Discussion:
svndumpfilter and svnsync?
Chris
2018-10-10 07:04:52 UTC
Permalink
Hi again,

I managed to get some better permissions so I don't have to do svnsync and can get by with doing incremental dumps/loads, but I'm a bit confused by the svndumpfilter + load process so any help would be appreciated.

First of all, my statement about the dump taking 2 weeks was a big fat urban legend. More like 20 minutes so that's good news.

I've trawled through bad commits of data files in our repo and added such paths to a filter file that I'm using for svndumpfilter to get a reasonably-looking dump. In most cases, the files in question existed in a single path(branch( and were no problem. But in some cases, the same files had been copied to a 2nd branch and then svndumpfilter gave me errors about missing source paths, so I added the same path on the 2nd branch to the filter expressions and tried again. After a few iterations of this process, I have a dump that should do what I want.
So I start "svnadmin load" and based on initial progress, that might take a couple of days to complete so I leave it overnight. I get back today and the load has crashed with a missing path. The error was:

svnadmin: E160013: File not found: transaction '16289-ckh', path 'branches/second/dir/datafile'

And looking up the history for that file, I see that "datafile" was added on branch "first" but the path "branches/first/dir" is already in my filter list. So why didn't svndumpfilter throw me an error on this like it did for a lot of other cases?
Since the load process it so much slower, the turnaround time for each error in that step is beyond painful, so if there's anything that I can do to assure that this gets caught by the filter would make my life a lot easier.

The syntax I used:
svnadmin dump -q MYREPO | svndumpfilter exclude --targets filterfile > filterdump
svnadmin load -q --no-flush-to-disk --force-uuid -M 2048 --bypass-prop-validation ./NEWREPO < filterdump

(I had to use the bypass-prop-validation due to some newline issues in old log message, similar to this one https://groups.google.com/forum/#!topic/subversion_users/P3ohZ-hKhCA, don't know why they have wrong newlines, but the repo works as it is now...)
- You can perfectly well use a 1.10 version of svnadmin or svnsync (or svnrdump, to create
a dumpfile from a remote server) to interact with a 1.8 server / repository.
Can I even do this with "svnadmin load"; I thought that would use an FSFS version 8 while 1.8 should have 6? I got that impression from my "research", but I'm probably off base.

TIA,
Chris


--------------------------------------------
On Thu, 10/4/18, Johan Corveleyn <***@gmail.com> wrote:

Subject: Re: svndumpfilter and svnsync?
To: "Chris" <***@yahoo.se>
Cc: "Ryan Schmidt" <subversion-***@ryandesign.com>, "Daniel Shahaf" <***@daniel.shahaf.name>, "Subversion" <***@subversion.apache.org>
Date: Thursday, October 4, 2018, 4:26 PM

On Thu, Oct 4, 2018 at 3:03 PM
(apologies
for the top-posting, I really need to stop using this yahoo
web interface which is useless with quoting)
Thanks for all the
replies. I'll try out what you outlined. There are
unfortunately problems outside of my control that makes it
worse and that is that for company-internal policy reasons,
I'm not allowed direct access to the server, I'm
only able to get a copy of the repo to work with and a
promise that they can replace the repo with my modified
version when I'm done. This might make some of the
suggestions hard to work with, but I'll see if seems
possible. Also, the server runs 1.8, and I have no authority
to get it upgraded. I think I may have a chance to change
the read permissions for the sync user though, so
there's a ray of light somewhere in there :)
W.r.t. Johan's
question about the time consumption for dumping, I
haven't been yet able to test it myself, I only got this
as second-hand info from someone who did a dump of the repo
last year, so I hope that is completely incorrect. Will try
dumping as soon as I get my hands on a repo copy.
Regarding why the
repo is so large: my estimate from running some analysis on
old revisions is that 90-95% of the data consists of
beginners doing accidental commits of things that should not
have been allowed to commit
Okay, good luck with those
"operations". I wanted to add a couple more
bits of info:

- After dump+filter+load or
svnsync-with-filtering (effectively
creating
a new repository with an alternate history compared to
the
original) your new repository will /
should have a new UUID. This
effectively
invalidates all existing working copies out there (which
keep track of the UUID they were a checkout
from). So all users will
have to checkout
new working copies.

- You
can perfectly well use a 1.10 version of svnadmin or svnsync
(or
svnrdump, to create a dumpfile from a
remote server) to interact with
a 1.8 server
/ repository. So if using a more modern version of
svnadmin or svnsync is beneficial, you should
use it :).

- A dump file
can be (much) larger than the original repository
itself, depending on how the dump is created.
That's because the
repository
potentially uses deltification, compression and
"representation sharing". If you use
the --deltas option for 'svnadmin
dump', it will be smaller, at the expense
of cpu time for the
deltification. Usually
people will not use the --deltas option when
sending it directly through a pipe (saving on
the cpu cycles for
deltification), but when
writing it to a file you should probably use
--deltas.

--
Johan
Ryan Schmidt
2018-10-10 07:16:25 UTC
Permalink
Post by Chris
I've trawled through bad commits of data files in our repo and added such paths to a filter file that I'm using for svndumpfilter to get a reasonably-looking dump. In most cases, the files in question existed in a single path(branch( and were no problem. But in some cases, the same files had been copied to a 2nd branch and then svndumpfilter gave me errors about missing source paths, so I added the same path on the 2nd branch to the filter expressions and tried again. After a few iterations of this process, I have a dump that should do what I want.
svnadmin: E160013: File not found: transaction '16289-ckh', path 'branches/second/dir/datafile'
And looking up the history for that file, I see that "datafile" was added on branch "first" but the path "branches/first/dir" is already in my filter list. So why didn't svndumpfilter throw me an error on this like it did for a lot of other cases?
Since the load process it so much slower, the turnaround time for each error in that step is beyond painful, so if there's anything that I can do to assure that this gets caught by the filter would make my life a lot easier.
svnadmin dump -q MYREPO | svndumpfilter exclude --targets filterfile > filterdump
svnadmin load -q --no-flush-to-disk --force-uuid -M 2048 --bypass-prop-validation ./NEWREPO < filterdump
(I had to use the bypass-prop-validation due to some newline issues in old log message, similar to this one https://groups.google.com/forum/#!topic/subversion_users/P3ohZ-hKhCA, don't know why they have wrong newlines, but the repo works as it is now...)
Instead of ignoring wrong newlines, you could fix them using svndumptool (using its eolfix-revprop command), originally at:

http://svn.borg.ch/svndumptool/

Newer fork at:

https://github.com/jwiegley/svndumptool
Post by Chris
- You can perfectly well use a 1.10 version of svnadmin or svnsync (or svnrdump, to create
a dumpfile from a remote server) to interact with a 1.8 server / repository.
Can I even do this with "svnadmin load"; I thought that would use an FSFS version 8 while 1.8 should have 6? I got that impression from my "research", but I'm probably off base.
If you use a newer version of svnadmin (than the one that will be used to serve the repo) to create the new repo and load the dump file, then make sure you pass the right --compatible-version argument to svnadmin create.
Johan Corveleyn
2018-10-10 08:42:37 UTC
Permalink
On Wed, Oct 10, 2018 at 9:16 AM Ryan Schmidt
Post by Ryan Schmidt
Post by Chris
I've trawled through bad commits of data files in our repo and added such paths to a filter file that I'm using for svndumpfilter to get a reasonably-looking dump. In most cases, the files in question existed in a single path(branch( and were no problem. But in some cases, the same files had been copied to a 2nd branch and then svndumpfilter gave me errors about missing source paths, so I added the same path on the 2nd branch to the filter expressions and tried again. After a few iterations of this process, I have a dump that should do what I want.
svnadmin: E160013: File not found: transaction '16289-ckh', path 'branches/second/dir/datafile'
And looking up the history for that file, I see that "datafile" was added on branch "first" but the path "branches/first/dir" is already in my filter list. So why didn't svndumpfilter throw me an error on this like it did for a lot of other cases?
Since the load process it so much slower, the turnaround time for each error in that step is beyond painful, so if there's anything that I can do to assure that this gets caught by the filter would make my life a lot easier.
Hm, not really a clear answer here either. I don't know why
svndumpfilter did not detect these.

However, you might also give 'svnadmin dump --exclude' a try, if you
can use version 1.10 of svnadmin.
http://subversion.apache.org/docs/release-notes/1.10.html#dump-include-exclude

This feature works similarly to 'svnsync with an authz file that
denies the excluded files'. That means that, when the source of a copy
is excluded, the copy is transformed into an add (so to complete
eliminate a bad file and all its copies this might be more difficult
to get a hold of these copies ... you won't get any warnings or errors
I think -- not sure if it emits a notification for such a copy-to-add
conversion). OTOH, 'svnadmin dump --exclude' supports wildcards if you
add the --pattern option, so it might be easier to filter out all
appearances of a specific filename, as in 'svnadmin dump --pattern
--exclude /*/datafile'.
Post by Ryan Schmidt
Post by Chris
svnadmin dump -q MYREPO | svndumpfilter exclude --targets filterfile > filterdump
svnadmin load -q --no-flush-to-disk --force-uuid -M 2048 --bypass-prop-validation ./NEWREPO < filterdump
(I had to use the bypass-prop-validation due to some newline issues in old log message, similar to this one https://groups.google.com/forum/#!topic/subversion_users/P3ohZ-hKhCA, don't know why they have wrong newlines, but the repo works as it is now...)
http://svn.borg.ch/svndumptool/
https://github.com/jwiegley/svndumptool
Also, as of version 1.10, svnadmin finally has an option to normalize
these on-the-fly during 'load':
http://subversion.apache.org/docs/release-notes/1.10.html#normalize-props

It's a lot better to normalize these (either with the
--normalize-props option for 'svnadmin load' or by using svndumptool)
than to "bypass" them. Otherwise you'll run into this again later (if
you would dump+load again sometime in the future).

And another tip: put the repo-to-be-loaded-into (NEWREPO) on as fast a
storage system as possible (SSD, ramdisk if feasible, ...). If you're
satisfied with the result, run 'svnadmin pack' on that fast storage,
and only then copy it over to the final location. Depending on the
final storage that technique might save you a lot of time (especially
if you have to redo it a couple of times).
Post by Ryan Schmidt
Post by Chris
- You can perfectly well use a 1.10 version of svnadmin or svnsync (or svnrdump, to create
a dumpfile from a remote server) to interact with a 1.8 server / repository.
Can I even do this with "svnadmin load"; I thought that would use an FSFS version 8 while 1.8 should have 6? I got that impression from my "research", but I'm probably off base.
If you use a newer version of svnadmin (than the one that will be used to serve the repo) to create the new repo and load the dump file, then make sure you pass the right --compatible-version argument to svnadmin create.
Indeed. It's at 'svnadmin create' time that the FSFS version is
decided. 'svnadmin load' will just "commit" new revisions in the
repository that you first created, and it will follow / respect the
FSFS format that's already set. So it's perfectly doable to create and
load a NEWREPO with 1.10 svnadmin, which you intend to be served by a
1.8 svn server (as long as you use the --compatible-version argument
at create time). (Small note though: 1.8 is no longer supported, so if
you can, plan to do an upgrade to 1.9 or preferably 1.10 soon).
--
Johan
Chris
2018-10-10 09:18:52 UTC
Permalink
Big thanks for the help, it is greatly appreciated!
Some comments and further questions inline below.
Post by Chris
Post by Ryan Schmidt
Post by Chris
I've trawled through bad commits of data files in our repo and added
such paths to a filter file that I'm using for svndumpfilter to get a
reasonably-looking dump. In most cases, the files in question existed in
a single path(branch( and were no problem. But in some cases, the same
files had been copied to a 2nd branch and then svndumpfilter gave me
errors about missing source paths, so I added the same path on the 2nd
branch to the filter expressions and tried again. After a few iterations
of this process, I have a dump that should do what I want.
Post by Ryan Schmidt
Post by Chris
So I start "svnadmin load" and based on initial progress, that might
take a couple of days to complete so I leave it overnight. I get back
Post by Ryan Schmidt
Post by Chris
svnadmin: E160013: File not found: transaction '16289-ckh', path
'branches/second/dir/datafile'
And looking up the history for that file, I see that "datafile" was
added on branch "first" but the path "branches/first/dir" is already in
my filter list. So why didn't svndumpfilter throw me an error on this
like it did for a lot of other cases?
Post by Ryan Schmidt
Post by Chris
Since the load process it so much slower, the turnaround time for
each error in that step is beyond painful, so if there's anything that I
can do to assure that this gets caught by the filter would make my life
a lot easier.
Hm, not really a clear answer here either. I don't know why
svndumpfilter did not detect these.
However, you might also give 'svnadmin dump --exclude' a try, if you can
use version 1.10 of svnadmin.
http://subversion.apache.org/docs/release-notes/1.10.html#dump-include-
exclude
This feature works similarly to 'svnsync with an authz file that
denies the excluded files'. That means that, when the source of a copy
is excluded, the copy is transformed into an add (so to complete
eliminate a bad file and all its copies this might be more difficult
to get a hold of these copies ... you won't get any warnings or errors
I think -- not sure if it emits a notification for such a copy-to-add
conversion). OTOH, 'svnadmin dump --exclude' supports wildcards if you
add the --pattern option, so it might be easier to filter out all
appearances of a specific filename, as in 'svnadmin dump --pattern
--exclude /*/datafile'.
I'll try that. Will be a monster of a commandline since dump+exclude
doesn't have the "-target <file>" from svndumpfilter and I have 150-ish
exclude-statements, but should be doable.
Not sure how much I can use patterns based on how the bad commits looked,
but should compress the commandline somewhat.
Post by Chris
Post by Ryan Schmidt
Post by Chris
The syntax I used: svnadmin dump -q MYREPO | svndumpfilter exclude
--targets filterfile filterdump svnadmin load -q --no-flush-to-disk
--force-uuid -M 2048 --bypass- prop-validation ./NEWREPO < filterdump
(I had to use the bypass-prop-validation due to some newline issues
in old log message, similar to this one
https://groups.google.com/forum/#!topic/subversion_users/P3ohZ-hKhCA,
don't know why they have wrong newlines, but the repo works as it is
now...)
Post by Ryan Schmidt
Instead of ignoring wrong newlines, you could fix them using
http://svn.borg.ch/svndumptool/
https://github.com/jwiegley/svndumptool
Also, as of version 1.10, svnadmin finally has an option to normalize
http://subversion.apache.org/docs/release-notes/1.10.html#normalize-
props
It's a lot better to normalize these (either with the
--normalize-props option for 'svnadmin load' or by using svndumptool)
than to "bypass" them. Otherwise you'll run into this again later (if
you would dump+load again sometime in the future).
I tried --normalize-props and I still got the same error which is why I
switched over to bypass. Maybe I've run into some bug with --normalize-props.
Unfortunately, I don't think I'll be able to create a script for reproducing
the error since it happens far into a monster dump load.
So I'll stick with the bypass for now or try the tool that Ryan suggested.
Post by Chris
And another tip: put the repo-to-be-loaded-into (NEWREPO) on as fast a
storage system as possible (SSD, ramdisk if feasible, ...). If you're
satisfied with the result, run 'svnadmin pack' on that fast storage,
and only then copy it over to the final location. Depending on the
final storage that technique might save you a lot of time (especially
if you have to redo it a couple of times).
True, I should have thought of that myself.
I'll see what I can do here. Corporate IT policies puts some restraints
on me, but definitely worth a shot. Just need to manage to
install a svn 1.10 on the only machine I have root on, which is a too-old
ubuntu where I can't find any pre-built packages. And overcome too-small
disk on that machine well.
But those are my own problems that I need to find ways around:)
Post by Chris
Post by Ryan Schmidt
Post by Chris
- You can perfectly well use a 1.10 version of svnadmin or svnsync
(or svnrdump, to create a dumpfile from a remote server) to interact
with a 1.8 server /
repository.
Post by Ryan Schmidt
Post by Chris
Can I even do this with "svnadmin load"; I thought that would use an
FSFS version 8 while 1.8 should have 6? I got that impression from my
"research", but I'm probably off base.
Post by Ryan Schmidt
If you use a newer version of svnadmin (than the one that will be used
to serve the repo) to create the new repo and load the dump file, then
make sure you pass the right --compatible-version argument to svnadmin
create.
Indeed. It's at 'svnadmin create' time that the FSFS version is
decided. 'svnadmin load' will just "commit" new revisions in the
repository that you first created, and it will follow / respect the
FSFS format that's already set. So it's perfectly doable to create and
load a NEWREPO with 1.10 svnadmin, which you intend to be served by a
1.8 svn server (as long as you use the --compatible-version argument
at create time).
Good, I will use the compatible-version argument, must have missed that one.
Post by Chris
(Small note though: 1.8 is no longer supported, so if
you can, plan to do an upgrade to 1.9 or preferably 1.10 soon).
Yes, I've tried to get the server upgraded since 1.10 came out, but no
luck so far.

Again, huge thanks for the help!

BR,
Chris
Johan Corveleyn
2018-10-10 10:11:03 UTC
Permalink
On Wed, Oct 10, 2018 at 11:18 AM Chris <***@yahoo.se> wrote:
...
Post by Chris
Post by Chris
Post by Ryan Schmidt
Post by Chris
The syntax I used: svnadmin dump -q MYREPO | svndumpfilter exclude
--targets filterfile filterdump svnadmin load -q --no-flush-to-disk
--force-uuid -M 2048 --bypass- prop-validation ./NEWREPO < filterdump
(I had to use the bypass-prop-validation due to some newline issues
in old log message, similar to this one
https://groups.google.com/forum/#!topic/subversion_users/P3ohZ-hKhCA,
don't know why they have wrong newlines, but the repo works as it is
now...)
Post by Ryan Schmidt
Instead of ignoring wrong newlines, you could fix them using
http://svn.borg.ch/svndumptool/
https://github.com/jwiegley/svndumptool
Also, as of version 1.10, svnadmin finally has an option to normalize
http://subversion.apache.org/docs/release-notes/1.10.html#normalize-
props
It's a lot better to normalize these (either with the
--normalize-props option for 'svnadmin load' or by using svndumptool)
than to "bypass" them. Otherwise you'll run into this again later (if
you would dump+load again sometime in the future).
I tried --normalize-props and I still got the same error which is why I
switched over to bypass. Maybe I've run into some bug with --normalize-props.
Unfortunately, I don't think I'll be able to create a script for reproducing
the error since it happens far into a monster dump load.
So I'll stick with the bypass for now or try the tool that Ryan suggested.
In that case the culprit might be another property than svn:log (or it
might be something like "non UTF-8 encoded" but not EOL-related in
svn:log). Possibly a "versioned" property like svn:ignore or some
other property in the svn: namespace. This is more difficult to fix,
but still it might be best to get rid of it or you'll run into it
again in the future.

See the very last bullet in:
http://subversion.apache.org/faq.html#dumpload

If that's indeed the problem, then you'll have to use that svndumptool
that Ryan pointed you to.
Quoting from that last bullet in the FAQ entry above:

"This is more difficult to repair, because 'svn:ignore' is not a
revision property (unlike svn:log, which can be manipulated with
svnadmin setrevprop), but a versioned property (so it's part of
history). Again, you can ignore this with --bypass-prop-validation.
But since this is a corruption "in history", this can only be repaired
with a dump+load, so this might be a good time to try and fix this (or
you'll run into this again in the future). To repair it you can use a
tool like svndumptool. But it only works on dump files, not as part of
a pipe. So a possible way to go about it is: dump that single
(corrupt) revision to a file, repair it ('svndumptool.py eolfix-prop
svn:ignore svn.dump svn.dump.repaired'), load that single dumpfile,
and then continue with a new "piped" command (like step (6) above). "

I should note here that svnsync is more powerful in this regard: it
does have the ability to normalize all of these on the fly. It's a
real pity that 'svnadmin load' doesn't (except for the svn:log EOL
fixing). Doesn't *yet* that is, until a volunteer comes along that
submits a patch for it ;-).

Anyway, I hope you succeed in cleaning this up eventually :-).
--
Johan
Loading...