Encoding problems in subversion under Mac OS X (HFS+)

Discussion:

Balázs Szabó (dLux)

2005-12-03 01:11:48 UTC

Hi,

I have problems using Subversion on OSX (10.4.3). I have tried a few
different versions and the problem is always the same.

I have checked out a repository, which I created on Linux, and it
contained filenames like "statisztikák.sxc"

I set up the environment before I did anything:

export LC_CTYPE="hu_HU.UTF-8"

The checkout worked fine, but right after the checkout, I had the
following output for svn status (SVN 1.3RC4, but the results are
similar with 1.2.3 as well):

? statisztikák.sxc
! statisztikák.sxc

The problem can be that (as I read elsewhere), HFS+ stores the
filenames in decomposed form, and since "á" has two UTF-8 code in
composed and decomposed forms, SVN thinks that this file is different
what is just checked out...

What can I do with this? I searched through the web, but no one seems
to encounter this problem besides me. Is it so rare??

Thanks in advance,

Balázs Szabó (dLux)
-- -- - - - -- -

Paul Koning

2005-12-03 17:01:56 UTC

Permalink

Balázs> Hi, I have problems using Subversion on OSX (10.4.3). I have
Balázs> tried a few different versions and the problem is always the
Balázs> same.

Balázs> I have checked out a repository, which I created on Linux,
Balázs> and it contained filenames like "statisztikák.sxc"

Balázs> I set up the environment before I did anything:

Balázs> export LC_CTYPE="hu_HU.UTF-8"

Balázs> The checkout worked fine, but right after the checkout, I had
Balázs> the following output for svn status (SVN 1.3RC4, but the
Balázs> results are similar with 1.2.3 as well):

Balázs> ? statisztikák.sxc ! statisztikák.sxc

Balázs> The problem can be that (as I read elsewhere), HFS+ stores
Balázs> the filenames in decomposed form, and since "á" has two UTF-8
Balázs> code in composed and decomposed forms, SVN thinks that this
Balázs> file is different what is just checked out...

That sounds plausible. This problem can appear anytime you deal with
strings that aren't plain English text -- accents, for example.

There's a standard solution designed in the IETF called Stringprep
(it's an RFC, I don't have the number handy). Basically it involves
translating the string into a single "canonical" format, so that no
matter which choice of encoding you start with, after Stringprep there
is only one possible outcome. Think of it as the UTF analog of
case-insensitive comparison.

So in order to compare UTF strings, you first run the two through
Stringprep, and after that you compare them. That way, two strings
that are the same to the user will also be the same to the program,
and any irrelevant transformations done in storing the strings (like
the HFS+ one) will not confuse things.

paul

Balázs Szabó (dLux)

2005-12-04 21:36:48 UTC

Permalink

Hi,

Thank you for the explanation and the idea.

But what can I do with it as a subversion user? Does anyone have a
patch or something like this for this problem?

Thanks,

Balázs Szabó (dLux)
-- -- - - - -- -

Post by Paul Koning
Balázs> Hi, I have problems using Subversion on OSX (10.4.3). I have
Balázs> tried a few different versions and the problem is always the
Balázs> same.
Balázs> I have checked out a repository, which I created on Linux,
Balázs> and it contained filenames like "statisztikák.sxc"
Balázs> export LC_CTYPE="hu_HU.UTF-8"
Balázs> The checkout worked fine, but right after the checkout, I had
Balázs> the following output for svn status (SVN 1.3RC4, but the
Balázs> ? statisztikák.sxc ! statisztikák.sxc
Balázs> The problem can be that (as I read elsewhere), HFS+ stores
Balázs> the filenames in decomposed form, and since "á" has two UTF-8
Balázs> code in composed and decomposed forms, SVN thinks that this
Balázs> file is different what is just checked out...
That sounds plausible. This problem can appear anytime you deal with
strings that aren't plain English text -- accents, for example.
There's a standard solution designed in the IETF called Stringprep
(it's an RFC, I don't have the number handy). Basically it involves
translating the string into a single "canonical" format, so that no
matter which choice of encoding you start with, after Stringprep there
is only one possible outcome. Think of it as the UTF analog of
case-insensitive comparison.
So in order to compare UTF strings, you first run the two through
Stringprep, and after that you compare them. That way, two strings
that are the same to the user will also be the same to the program,
and any irrelevant transformations done in storing the strings (like
the HFS+ one) will not confuse things.
paul
---------------------------------------------------------------------

Balázs Szabó (dLux)

2005-12-05 22:22:04 UTC

Permalink

Hi Guys,

I really can't believe this thing can happen: why subversion uses
unicode filenames if it cannot handle such a common thing as a Mac OS
X default filesystem. I understand that OSX is a weird Unix in many
aspects, but man, many people use this.

Please someone just tell me what the heck I can do with this problem:
is it solvable easily? Is there any patch I can apply? Or just forget
using accents in the filenames? Or I am doing something wrong and it
works file for everyone else?

Regards,

Balázs Szabó (dLux)
-- -- - - - -- -

Post by BalÃ¡zs SzabÃ³ (dLux)
Hi,
Thank you for the explanation and the idea.
But what can I do with it as a subversion user? Does anyone have a
patch or something like this for this problem?
Thanks,
Balázs Szabó (dLux)
-- -- - - - -- -

Dave Camp

2005-12-05 23:39:51 UTC

Permalink

I'm not a shell guru by any means, but I'm wondering if you set the
wrong environment variable. I'm using the following in tcsh and I can
checkin/checkout files with non-ASCII chars just fine.

setenv LANG en_US.UTF-8

I assume that for bash, setenv becomes export.

Dave

Post by BalÃ¡zs SzabÃ³ (dLux)
Hi Guys,
I really can't believe this thing can happen: why subversion uses
unicode filenames if it cannot handle such a common thing as a Mac
OS X default filesystem. I understand that OSX is a weird Unix in
many aspects, but man, many people use this.
Please someone just tell me what the heck I can do with this
problem: is it solvable easily? Is there any patch I can apply? Or
just forget using accents in the filenames? Or I am doing something
wrong and it works file for everyone else?
Regards,
Balázs Szabó (dLux)
-- -- - - - -- -

---------------------------------------------------------------------

Paul Koning

2005-12-06 15:23:43 UTC

Permalink

Dave> I'm not a shell guru by any means, but I'm wondering if you set
Dave> the wrong environment variable. I'm using the following in tcsh
Dave> and I can checkin/checkout files with non-ASCII chars just
Dave> fine.

Dave> setenv LANG en_US.UTF-8

Dave> I assume that for bash, setenv becomes export.

Correct.

Without that setting (on a US Mac) things are utterly broken -- I get
errors about inability to convert string from UTF-8 to native
encoding.

With that setting, a checkout or update with a non-English letter in
the filename succeeds. However, past that point things are still
badly broken, as Balázs mentioned.

Test case:

1. On Windows, create a file á.txt (a with accent.txt). I used
TortoiseSVN for that, though I assume that isn't critical. Commit
it.

2. svn update on the Mac. The update reports that it added that new
file, and things look reasonable. The message shows the name
correctly.
(Curiously enough, "ls" butchers the name. Bad Mac...)

3. Do "svn status". File á.txt is shown with status "?".

4. Edit á.txt. File á.txt is now shown twice, once with status ?,
once with status M.

I don't read UTF-8 coding all that well, but it looks to me like
.svn/entries has á.txt listed with the accented a in its combined
(0x00E1) form. And, judging from the butchered output that Mac ls
gives me, Balázs is correct in saying that HFS+ uses the separated
("a" then the accent) form.

The problem here is that both are valid. In fact, for some languages
things get messier yet: if you have several diacritical marks on a
letter, as happens all the time in Vietnamese, those marks can occur
in any order.

The point I was making earlier with my reference to the Stringprep RFC
and a canonical encoding of UTF-8 strings is that you have to map all
these various equivalent UTF-8 strings to a single encoding before you
compare them. If you do that, then "svn status" would no longer be
confused because it would recognize the file name as given in
"entries" and the file name returned by the HFS+ file system as
equivalent strings, even though their raw encoding is not the same.

I looked briefly at the code, but it wasn't obvious to me where this
sort of thing would have to be inserted.

Interestingly enough, things seem ok in the other direction. If I add
a file ü.txt on the Mac, commit it, update on the Windows side, change
it, commit that, all is well. In this case, the separated form of the
name encoding appears in the repository and Windows doesn't appear to
have any objection to that.

But if I then rename the file on Windows and commit that rename,
things are broken again for the same reason as before. The other
oddity is that "svn update" on the Mac end adds the new file but
doesn't remove the old file (previous name) from the working
directory. Separate bug?

paul

Paul Koning

2005-12-06 15:47:23 UTC

Permalink

Paul> The other oddity is that "svn update" on the Mac end adds the
Paul> new file but doesn't remove the old file (previous name) from
Paul> the working directory. Separate bug?

Not a bug, perhaps, but a surprise certainly.

I did the rename, then committed the file (not the directory it lives
in). That adds the new file but doesn't delete the old name.

Only committing the directory will delete the old name.

That makes some vague sense but it isn't entirely intuitive. Might be
worth mentioning in the docs if it isn't already.

paul

Paul Koning

2005-12-06 17:01:11 UTC

Permalink

Somewhat related:

Scripts like commit-email.pl fail miserably on non-ASCII filenames if
the locale isnt' set to some UTF-8 locale. I added a line to my
post-commit hook script to do that:

export LANG=en_US.UTF-8

It may be a good idea to add that to the sample hook files. Depending
on the installation, some other UTF-8 locale may be wanted instead,
but clearly it has to be *some* UTF-8 locale for things like svnlook
to work.

With that change the post commit email message works just fine.

paul

Balázs Szabó (dLux)

2005-12-06 19:57:28 UTC

Permalink

Hi,

Post by Paul Koning
Without that setting (on a US Mac) things are utterly broken -- I get
errors about inability to convert string from UTF-8 to native
encoding.
With that setting, a checkout or update with a non-English letter in
the filename succeeds. However, past that point things are still
badly broken, as Balázs mentioned.

Exactly. I set the environment variables according to the locale
settings.

Post by Paul Koning
1. On Windows, create a file á.txt (a with accent.txt). I used
TortoiseSVN for that, though I assume that isn't critical. Commit
it.
2. svn update on the Mac. The update reports that it added that new
file, and things look reasonable. The message shows the name
correctly.
(Curiously enough, "ls" butchers the name. Bad Mac...)
3. Do "svn status". File á.txt is shown with status "?".
4. Edit á.txt. File á.txt is now shown twice, once with status ?,
once with status M.

Yes, it is correct!

Post by Paul Koning
I don't read UTF-8 coding all that well, but it looks to me like
.svn/entries has á.txt listed with the accented a in its combined
(0x00E1) form. And, judging from the butchered output that Mac ls
gives me, Balázs is correct in saying that HFS+ uses the separated
("a" then the accent) form.
The problem here is that both are valid.

[...]

Post by Paul Koning
The point I was making earlier with my reference to the Stringprep RFC
and a canonical encoding of UTF-8 strings is that you have to map all
these various equivalent UTF-8 strings to a single encoding before you
compare them.

Yes, the point is right.

Post by Paul Koning
If you do that, then "svn status" would no longer be
confused because it would recognize the file name as given in
"entries" and the file name returned by the HFS+ file system as
equivalent strings, even though their raw encoding is not the same.

[...]

Post by Paul Koning
Interestingly enough, things seem ok in the other direction. If I add
a file ü.txt on the Mac, commit it, update on the Windows side, change
it, commit that, all is well. In this case, the separated form of the
name encoding appears in the repository and Windows doesn't appear to
have any objection to that.
But if I then rename the file on Windows and commit that rename,
things are broken again for the same reason as before.

I figured out why it is: while Windows does not care what encoding
you are using in filenames, MacOSX converts them to a canonical form
(in this case it is the decomposed form), it does not matter what
formats you used before.

So in Windows, the two "á" characters are counted as different
characters if they are encoded differently, while they are the same
in OSX (HFS+ at least, which is the default filesystem of the OSX now).

So when you checkout a file where the composed form is used, it
converts this to decomposed, so SVN thinks that the original file is
removed, and a new one is created.

You can try the following as well:

In an empty directory (on OSX), try:

touch á.txt
svn add á.txt

The result will be a "? á.txt" or something like that.

But if you try:

touch á.txt
svn add *

The result will be fine!

This test shows that the hungarian keyboard layout of Mac OSX
produces a composed form of the character, while it is not stored in
that way...

I did some research:

http://developer.apple.com/technotes/tn/tn1150.html

"HFS Plus stores strings fully decomposed and in canonical order. HFS
Plus compares strings in a case-insensitive fashion. Strings may
contain Unicode characters that must be ignored by this comparison.
For more details on these subtleties, see Unicode Subtleties."

In the "http://developer.apple.com/technotes/tn/
tn1150.html#UnicodeSubtleties" link, it describes the algorithm it is
used for composing and decomposing filenames from and to UTF-8
format. This is a good reading for developers.

I am now sure that this is basically a compatibility problem between
SVN and OSX.

Is there any developer here who can say something to it? Is it easy
to fix? Is it going to be fixed (first of all :-) )?

Regards,

Balázs Szabó (dLux)
-- -- - - - -- -

Paul Koning

2005-12-06 21:45:19 UTC

Permalink

Balázs> I did some research:

Balázs> http://developer.apple.com/technotes/tn/tn1150.html

Balázs> "HFS Plus stores strings fully decomposed and in canonical
Balázs> order. HFS Plus compares strings in a case-insensitive
Balázs> fashion. Strings may contain Unicode characters that must be
Balázs> ignored by this comparison. For more details on these
Balázs> subtleties, see Unicode Subtleties." ...

Balázs> I am now sure that this is basically a compatibility problem
Balázs> between SVN and OSX.

It's a bit more than that. In some ways, it's analogous to the case
insensitive file systems issue.

Windows allows file names to be encoded any old way. You can create
á.txt twice -- once with a composed á, once with a decomposed á. You
can then commit both. Linux will happily deal with those two files as
distinct files, too.

OS X objects to this. It decomposes all file names, so the composed
and decomposed names conflict. That's very similar to the conflict
between a.txt and A.txt in a case insensitive file system (and in fact
the error messages look similar).

So, contrary to what I suggested before, making all filenames
canonical (decomposed, for example) in Subversion is not necessarily
the right answer, because then you can't handle those two identical
looking but differently encoded file names in Windows. (One might
argue this is a Windows bug -- it shouldn't allow two names that
produce the same pixels on the screen but have different encoding --
and that no doubt is why OS X doesn't. But they are two permitted
file names, so the counter-argument is that the version control system
should allow them both.)

If the file names on the server are maintained as they were on the
client that originally created them, then the fix has to be in the OS
X client. It would have to keep the original file name encodings in
its .svn/entries file. But when comparing those names against
filenames returned by the file system -- i.e., for commands like "svn
status", it has to run those names through the decomposition algorithm
so they will match the names the file system has.

By the way, there's a long discussion about decomposition on the
Unicode website. Alternatively, a good start would be to implement
the mapping table in
http://developer.apple.com/technotes/tn/tn1150table.html . (That does
the decomposing; I don't think it describes the reordering of multiple
accent marks into their canonical order, but that seems like a less
critical issue, except perhaps for Vietnamese.)

paul

Balázs Szabó

2005-12-06 22:31:54 UTC

Permalink

Hi,

Post by Paul Koning
It's a bit more than that. In some ways, it's analogous to the case
insensitive file systems issue.
Windows allows file names to be encoded any old way. You can create
á.txt twice -- once with a composed á, once with a decomposed á. You
can then commit both. Linux will happily deal with those two files as
distinct files, too.
OS X objects to this. It decomposes all file names, so the composed
and decomposed names conflict. That's very similar to the conflict
between a.txt and A.txt in a case insensitive file system (and in fact
the error messages look similar).
So, contrary to what I suggested before, making all filenames
canonical (decomposed, for example) in Subversion is not necessarily
the right answer, because then you can't handle those two identical
looking but differently encoded file names in Windows. (One might
argue this is a Windows bug -- it shouldn't allow two names that
produce the same pixels on the screen but have different encoding --
and that no doubt is why OS X doesn't. But they are two permitted
file names, so the counter-argument is that the version control system
should allow them both.)
If the file names on the server are maintained as they were on the
client that originally created them, then the fix has to be in the OS
X client. It would have to keep the original file name encodings in
its .svn/entries file. But when comparing those names against
filenames returned by the file system -- i.e., for commands like "svn
status", it has to run those names through the decomposition algorithm
so they will match the names the file system has.

Yes, it is a very good summary of the problem. I want to add that it
might not occur in other filesystems in OSX, only HFS+. So the client
does not only need to check that "OK, I am now in OSX, I will do this
conversion stuff", but needs some heuristics to find out if it
behaves in a way that is described earlier.

Post by Paul Koning
By the way, there's a long discussion about decomposition on the
Unicode website. Alternatively, a good start would be to implement
the mapping table in
http://developer.apple.com/technotes/tn/tn1150table.html . (That does
the decomposing; I don't think it describes the reordering of multiple
accent marks into their canonical order, but that seems like a less
critical issue, except perhaps for Vietnamese.)

This is Mac OSX and especially HFS+-specific code.

The problem with this is the detection of this issue.

What I suggest is a configuration parameter to allow it to be set up
globally: should the SVN client convert every pathname to a common
canonical form before filename-comparisons (e.g. with stringprep), or
not.

I don't see too much sense allowing filenames containing the same
character to be encoded differently, but it might happen to be a
case. If this configuration parameter is set to "off", then it is
allowed.

In a multi-platform environment where it the system is used to store
different document titles, this configuration parameter should be set
to "on", and then the OSX and Windows/Linux/UNIX users will be happy
with that as well.

So what do you suggest? Should I open a bug for it in the SVN bug-
tracking system?

Regards,

Balázs Szabó (dLux)
-- -- - - - -- -

Scott Palmer

2005-12-06 23:04:41 UTC

Permalink

Post by Paul Koning
Windows allows file names to be encoded any old way. You can create
á.txt twice -- once with a composed á, once with a decomposed á. You
can then commit both. Linux will happily deal with those two files as
distinct files, too.

some may call that a feature, but I call that a bug. Windows and
Linux are broken in this way. Mac OS is doing the "right" thing by
making the files user-friendly and not different in some cryptic,
geeky, computerese way.

Post by Paul Koning
OS X objects to this. It decomposes all file names, so the composed
and decomposed names conflict. That's very similar to the conflict
between a.txt and A.txt in a case insensitive file system (and in fact
the error messages look similar).
So, contrary to what I suggested before, making all filenames
canonical (decomposed, for example) in Subversion is not necessarily
the right answer, because then you can't handle those two identical
looking but differently encoded file names in Windows. (One might
argue this is a Windows bug -- it shouldn't allow two names that
produce the same pixels on the screen but have different encoding --
and that no doubt is why OS X doesn't. But they are two permitted
file names, so the counter-argument is that the version control system
should allow them both.)

It is a similar problem, but at least in the uppercase/lowercase
version we aren't talking about the "exact same" character that is
mysteriously not the same.

I prefer case-INsensitive filesystems for the same reason. If my
dog's name is "Spot" and I call him by yelling "spot" he still
comes :) I want the same to work for my files.

Anyway... don't mean to rant about all this other than to say, I
think Subversion should do what OS X is doing an only allow a single
representation of the same character (decomposed or whatever) in the
repository. In the end it prevents conflicts in these weird cases.
Anyone that actually WANTS the names to be unique even though they
only differ by the characters being in composed or decomposed form is
asking for trouble - I'ld rather they keep the trouble to themselves
than forcing the rest of us to deal with it. :)

To extend the argument you have listed above, Windows and Mac OS have
case insensitive filesystems, while most unix systems have case
sensitive filesystems... therefore subversion should support them
both... but it doesn't, it only supports cases-sensitive filesystems.
(In the sense that subversion causes errors on Windows and Mac when
case-only changes to path names are present.)

Scott

Ryan Schmidt

2005-12-06 06:56:19 UTC

Permalink

Post by BalÃ¡zs SzabÃ³ (dLux)
I really can't believe this thing can happen: why subversion uses
unicode filenames if it cannot handle such a common thing as a Mac
OS X default filesystem. I understand that OSX is a weird Unix in
many aspects, but man, many people use this.
Please someone just tell me what the heck I can do with this
problem: is it solvable easily? Is there any patch I can apply? Or
just forget using accents in the filenames? Or I am doing something
wrong and it works file for everyone else?

Unfortunately I don't know the answer to your question. I also use
Mac OS X with Subversion, but I use it to manage web sites. In a web
page URL, what charset should be used to encode non-ASCII characters?
I'm not aware of any RFC which answers that question. Therefore to
avoid unpredictable behavior, I restrict myself to ASCII filenames.
This is easy for me to do since I primarily speaks English, but I
certainly acknowledge that Hungarian and other languages have a much
tougher time with that restriction, and I agree that Subversion
should be handling non-ASCII filenames correctly. I don't know what
the problem is in your situation, but the previous suggestion that
Subversion isn't running stringprep when it should certainly sounds
possible to me. If that is the case, then a Subversion programmer
will probably have to evaluate all places where Subversion uses UTF-8
data and see whether stringprep is being used and if not whether it
should be used at that point.

If you could provide a step-by-step reproduction recipe for the
problem, I could try to reproduce it, as I have access to both Mac OS
X and Linux machines.

Kalin KOZHUHAROV

2005-12-06 15:48:41 UTC

Permalink

Plese do not top-post ( http://en.wikipedia.org/wiki/Top-posting )

Not _that_ common I guess.

Post by BalÃ¡zs SzabÃ³ (dLux)
I understand that OSX is a weird Unix in many aspects...

you said it.

Post by BalÃ¡zs SzabÃ³ (dLux)
is it solvable easily? Is there any patch I can apply? Or just forget
using accents in the filenames? Or I am doing something wrong and it
works file for everyone else?

First tell us what your locale settings are?
In a shell execute locale.

For example, on my systems:
$ locale
LANG=en
LC_CTYPE=ja_JP.UTF-8
LC_NUMERIC=C
LC_TIME=C
LC_COLLATE=ja_JP.UTF-8
LC_MONETARY=ja_JP.UTF-8
LC_MESSAGES=C
LC_PAPER=C
LC_NAME=C
LC_ADDRESS=C
LC_TELEPHONE=C
LC_MEASUREMENT=C
LC_IDENTIFICATION=C
LC_ALL=

As you see, LANG=en is good to work in English.
{LC_CTYPE,LC_COLLATE,LC_MONETARY}=language_COUNTRY.UTF-8 (I need the japanese locale for some input
methods)

First try with LANG="en_US.UTF-8".

Next, check if you have actually these locales installed. In a normal *nix system these should be
like /usr/share/locale/language_COUNTRY.UTF-8 not sure about Mac OS X... You need the actual
definitions in some files.
For more info (Gentoo specific) see:
http://www.gentoo.org/doc/en/guide-localization.xml#doc_chap3

Google hit this:
http://maczealots.com/tutorials/xcode-svn/
http://duke.usask.ca/~dalglb/macosx/Perl_5.6.html

Kalin.

--
|[ ~~~~~~~~~~~~~~~~~~~~~~ ]|
+-> http://ThinRope.net/ <-+
|[ ______________________ ]|

Balázs Szabó (dLux)

2005-12-06 19:33:40 UTC

Permalink

Hi,

Post by Kalin KOZHUHAROV
Plese do not top-post ( http://en.wikipedia.org/wiki/Top-posting )

Ok.

Post by Kalin KOZHUHAROV
First tell us what your locale settings are?
In a shell execute locale.

In my system, it is one of the settings. This is the result of an
"export LANG="hu_HU.UTF-8" settings:

LANG="hu_HU.UTF-8"
LC_COLLATE="hu_HU.UTF-8"
LC_CTYPE="hu_HU.UTF-8"
LC_MESSAGES="hu_HU.UTF-8"
LC_MONETARY="hu_HU.UTF-8"
LC_NUMERIC="hu_HU.UTF-8"
LC_TIME="hu_HU.UTF-8"
LC_ALL="hu_HU.UTF-8"

Post by Kalin KOZHUHAROV
As you see, LANG=en is good to work in English.
{LC_CTYPE,LC_COLLATE,LC_MONETARY}=language_COUNTRY.UTF-8 (I need
the japanese locale for some input
methods)
First try with LANG="en_US.UTF-8".

I tried it, and the results are the same.

Post by Kalin KOZHUHAROV
Next, check if you have actually these locales installed. In a
normal *nix system these should be
like /usr/share/locale/language_COUNTRY.UTF-8 not sure about Mac OS
X... You need the actual
definitions in some files.

Yes, I have the directory for the "hu_HU.UTF-8" as well as the
"en_US.UTF-8".

Post by Kalin KOZHUHAROV
http://www.gentoo.org/doc/en/guide-localization.xml#doc_chap3

I don't really have problems with the localization settings in the
Mac OSX, but thanks for the links.

Post by Kalin KOZHUHAROV
http://maczealots.com/tutorials/xcode-svn/
http://duke.usask.ca/~dalglb/macosx/Perl_5.6.html

You misunderstood the problem I guesst. I don't have problems with
setting up locale in the OSX, but I have problems with the character
encoding. I will explain it my other mail, answering to Paul's one.

Regards,

Balázs Szabó (dLux)
-- -- - - - -- -

Stuart Celarier

2005-12-07 02:19:22 UTC

Permalink

One might argue this is a Windows bug -- it shouldn't allow two names

that

produce the same pixels on the screen but have different encoding --
and that no doubt is why OS X doesn't.

[Stuart] I've got to question this specific line of reasoning. If your
criterion for distinctiveness is "same pixels on the screen", then, at
best, that's an issue to take up with font designers. But that

Consider this. U+0391 (Α) is a capital Greek alpha, which in
virtually every font is visually indistinguishable from A the
English letter A. Here's a simple HTML document, save it, load it in a
browser, and see for yourself:

<HTML><BODY>AΑ</BODY></HTML>

These are distinct code points, hence different characters, even if
similar or identical glyphs are used. I don't get how this becomes a
Windows (or any other operating system-specific) problem. If the letter
'O' and numeral '0' are visually indistinct on my computer, in whatever
font I happen to use, should the file system prevent me from using one
of these characters? I don't think so.

Cheers,
Stuart

Stuart Celarier | Corillian Corporation

Balázs Szabó

2005-12-07 07:48:47 UTC

Permalink

Hi,

Post by Stuart Celarier
Consider this. U+0391 (Α) is a capital Greek alpha, which in
virtually every font is visually indistinguishable from A the
English letter A. Here's a simple HTML document, save it, load it in a
<HTML><BODY>AΑ</BODY></HTML>
These are distinct code points, hence different characters, even if
similar or identical glyphs are used. I don't get how this becomes a
Windows (or any other operating system-specific) problem. If the letter
'O' and numeral '0' are visually indistinct on my computer, in
whatever
font I happen to use, should the file system prevent me from using one
of these characters? I don't think so.

I don't think we are talking about the same problem. We are not
talking about the same (or similar) glyphs, we are talking about the
same characters: one character with the same encoding. UTF-8 allows
the same character to be encoded with different code! These charactes
have to have the same meaning (that's why OSX converts them to a
canonical form). The mentioned characters does not have the same
meaning (greek alpha vs. A and 0 vs O).

What I suggest is having a config option to at least allow OSX users
to use SVN with its full glory (I mean accented filenames). this
requires some kind of stringprep.

Regards,

Balázs Szabó (dLux)
-- -- - - - -- -

Paul Koning

2005-12-07 14:24:59 UTC

Permalink

One might argue this is a Windows bug -- it shouldn't allow two names

Stuart> that

produce the same pixels on the screen but have different encoding
-- and that no doubt is why OS X doesn't.

Stuart> [Stuart] I've got to question this specific line of
Stuart> reasoning. If your criterion for distinctiveness is "same
Stuart> pixels on the screen", then, at best, that's an issue to take
Stuart> up with font designers. But that

Stuart> Consider this. U+0391 (Α) is a capital Greek alpha,
Stuart> which in virtually every font is visually indistinguishable
Stuart> from A the English letter A. Here's a simple HTML
Stuart> document, save it, load it in a browser, and see for
Stuart> yourself:

Stuart> <HTML><BODY>AΑ</BODY></HTML>

Stuart> These are distinct code points, hence different characters,
Stuart> even if similar or identical glyphs are used. I don't get how
Stuart> this becomes a Windows (or any other operating
Stuart> system-specific) problem. If the letter 'O' and numeral '0'
Stuart> are visually indistinct on my computer, in whatever font I
Stuart> happen to use, should the file system prevent me from using
Stuart> one of these characters? I don't think so.

Good point. I said it wrong in the previous note.

The Unicode standard, I believe, discusses the point you made about
meaning vs. appearance. It assigns code points to characters based on
what they mean, not based on what they look like. (Well, the
"unified" codes may stretch that rule...)

The right way to describe the issue is like this:

There is a character "Latin small letter a with acute". It looks like
this: "á". There are two ways to encode that character: as a
"combined" character, and as "a" followed by "combining acute accent".

Those two have the same meaning (not just the same appearance) --
which is really the important point. There are transformation
algorithms that recognize their equivalence. If you convert them to
any of the various Normalization Forms, you'll end up with the same
string for both (that's what "normalization" means).

See www.unicode.org/reports/tr15 for the full story.

paul

Stuart Celarier

2005-12-07 11:17:55 UTC

Permalink

Post by BalÃ¡zs SzabÃ³
I don't think we are talking about the same problem. We are not

talking about the same (or similar) glyphs, we are talking about the
same characters: one character with the same encoding. UTF-8 allows
the same character to be encoded with different code!

Post by BalÃ¡zs SzabÃ³
One might argue this is a Windows bug -- it shouldn't allow two names

that

Post by BalÃ¡zs SzabÃ³
produce the same pixels on the screen but have different encoding --
and that no doubt is why OS X doesn't.
What I suggest is having a config option to at least allow OSX users

to use SVN with its full glory (I mean accented filenames).

I agree that Subversion (ideally) should work equally well on all platforms and with all languages. Having multiple Unicode representations for a single text element is not limited to accents on (Latin-based) characters, for example see [1]. I bring this up in hopes that any solution be general enough to apply to all languages.

In the interim, could you use a pre-commit hook, similar in concept to check-case-insensitive.pl, to either prevent conflicting Unicode representations or make them canonical? I'm a novice when it comes to using Subversion hooks, so I don't know the extent to which this could fix the problem or ease your pain.

Cheers,
Stuart

Stuart Celarier | Corillian Corporation

[1] http://developer.apple.com/documentation/Carbon/Conceptual/ProgWithTECM/tecmgr_concepts/chapter_2_section_7.html