Discussion:
eol-style and utf-16
engelbert gruber
2017-10-30 20:57:29 UTC
Permalink
hi

checking in a file with eol-style native on unix : eol = 0x0a
checking it out on windows : 0x0a is replaced by 0x0d 0x0a

when the file is in utf-16 : eol ist 0x00 0x0a
and when checked out on windows this becomes : 0x00 0x0d 0x0a

which breaks utf-16 as far as i understand it

possible fixes:

* get utf-## aware
* add a charsize property
* document it
* recommend eol-style a nonnative eol-style: LF CR or CRLF

all the best
e
Nico Kadel-Garcia
2017-10-31 01:12:38 UTC
Permalink
On Mon, Oct 30, 2017 at 4:57 PM, engelbert gruber
Post by engelbert gruber
hi
checking in a file with eol-style native on unix : eol = 0x0a
checking it out on windows : 0x0a is replaced by 0x0d 0x0a
when the file is in utf-16 : eol ist 0x00 0x0a
and when checked out on windows this becomes : 0x00 0x0d 0x0a
which breaks utf-16 as far as i understand it
* get utf-## aware
* add a charsize property
* document it
* recommend eol-style a nonnative eol-style: LF CR or CRLF
all the best
e
So, easy solution. *Never* use eol-style. It's destructive to any
working copy that may be accessed via operating systems with distinct
eol styles. And its destructiveness is insidious when files are
edited, locally, with editor that auto-interpret EOL on the fly,
leading to inconsistent EOL and EOL confusion when creating new files
in the repo.

It doesn't do much for otehr UTF difficulties, but it sure avoids the
whole inconsistent EOL issues.
Stefan Sperling
2017-10-31 09:11:19 UTC
Permalink
Post by Nico Kadel-Garcia
On Mon, Oct 30, 2017 at 4:57 PM, engelbert gruber
Post by engelbert gruber
hi
checking in a file with eol-style native on unix : eol = 0x0a
checking it out on windows : 0x0a is replaced by 0x0d 0x0a
when the file is in utf-16 : eol ist 0x00 0x0a
and when checked out on windows this becomes : 0x00 0x0d 0x0a
which breaks utf-16 as far as i understand it
* get utf-## aware
* add a charsize property
* document it
* recommend eol-style a nonnative eol-style: LF CR or CRLF
all the best
e
So, easy solution. *Never* use eol-style.
I would not point at svn:eol-style as the root cause here.
This feature works fine with text files.
Post by Nico Kadel-Garcia
It's destructive to any
working copy that may be accessed via operating systems with distinct
eol styles.
It works fine unless the operating system is so obscure that is uses
something other than LF, CRLF, or CR as a newline character.
Post by Nico Kadel-Garcia
And its destructiveness is insidious when files are
edited, locally, with editor that auto-interpret EOL on the fly,
leading to inconsistent EOL and EOL confusion when creating new files
in the repo.
If an editor decides to change all the newlines, this creates
a diff where every line in a text file appears as changed,
even if just a single line was modified by the editor's user.
That's a problem svn:eol-style can solve.

If an editor decides to create inconsistent newlines, it has broken
the file. All you can do now is treat is as a binary file because
text content cannot be split into lines anymore. I would put the
blame on the editor here.
Post by Nico Kadel-Garcia
It doesn't do much for otehr UTF difficulties, but it sure avoids the
whole inconsistent EOL issues.
In my opinion the problem under discussion has nothing to do with eol-style.
Rather, it is that UTF-16 must be treated as binary data in SVN.

The property svn:mime-type should be set to 'application/octet-stream'
on UTF-16 files. And setting svn:eol-style on a binary file is obviously
not a good idea (unfortunately, these features are not mutually exclusive
but they should be).

Adding UTF-16 support is not impossible but difficult because Subversion
as a system assumes UTF-8 strings and won't work correctly with strings
that contain embedded NUL bytes, and there are a lot of entry points
for text data in the system.
Daniel Shahaf
2017-10-31 11:55:30 UTC
Permalink
Post by Stefan Sperling
Post by Nico Kadel-Garcia
It doesn't do much for otehr UTF difficulties, but it sure avoids the
whole inconsistent EOL issues.
In my opinion the problem under discussion has nothing to do with eol-style.
Rather, it is that UTF-16 must be treated as binary data in SVN.
The property svn:mime-type should be set to 'application/octet-stream'
on UTF-16 files.
"application/octet-stream; charset=utf-16" should work too. I don't
remember off the top of my head which tools consume the additional
information --- httpd mod_magic perhaps? --- but they exist. (Sorry, I
don't have time to look up the details right now.)
Post by Stefan Sperling
And setting svn:eol-style on a binary file is obviously
not a good idea (unfortunately, these features are not mutually exclusive
but they should be).
Adding UTF-16 support is not impossible but difficult because Subversion
as a system assumes UTF-8 strings and won't work correctly with strings
that contain embedded NUL bytes, and there are a lot of entry points
for text data in the system.
I'm not sure which part of the system is not NUL-safe? UTF-8 text files with
svn:eol-style set and embedded NULs seem to be handled correctly.

I agree that principle it'd be possible to sniff the charset from the
svn:mime-type property and then <handwave>DTRT for UTF-16 files with svn:eol-
style</handwave>. This will happen when someone implements it, aka,
patches welcome.

Cheers,

Daniel

engelbert gruber
2017-10-31 11:17:09 UTC
Permalink
sorry one more

On 30 October 2017 at 21:57, engelbert gruber <***@gmail.com>
wrote:

checking in a file with eol-style native on unix : eol = 0x0a
Post by engelbert gruber
checking it out on windows : 0x0a is replaced by 0x0d 0x0a
when the file is in utf-16 : eol ist 0x00 0x0a
and when checked out on windows this becomes : 0x00 0x0d 0x0a
which breaks utf-16 as far as i understand it
* get utf-## aware
* add a charsize property
* document it
* recommend eol-style a nonnative eol-style: LF CR or CRLF
* turning of eol-style is an option, but not in general, as subversion
comes with eol-style native
(i like to stick with defaults, to ease setting up systems and because
subversion-maintainers
are more knowledgable than me)

setting svn:mime-type to 'application/octet-stream' shouldn't be necessary
if
http://help.collab.net/index.jsp?topic=/faq/svnbinary.html

Currently, Subversion only looks at the first 1024 bytes of the file; if
any of the bytes are zero,
or if more than 15 percent are not ASCII printing characters, then
Subversion calls the file binary.

is correct. by this utf-16 will always be binary.
mine was a csv-file, but the problem might be that it was imported from a
CVS-repo

cheers
Post by engelbert gruber
e
Continue reading on narkive:
Loading...