Discussion:
Missing LOCALE in post-commit hook leads to weird behaviour of `svnlook log` with unicode characters – broken transliterations
H.-Dirk Schmitt
2018-01-27 17:33:27 UTC
Permalink
I found a very weird behaviour of `svnlook log` that IMHO is a bug (or
at least a serious missing documentation issue).

Introduction
------------

Consider a log message like: 'Unicode Test → ø ÄÖÜ'

`svnlook log` invoked in a normal terminal session shows the proper
content.
This works because the environment is set to 'en_US.UTF-8'.

Now start to play - `env LC_ALL=C.UTF-8 svnlook log` also shows a
correct result.

Problem
-------
But falling back to `env LC_ALL=C svnlook log` I got a very flawed
result:

Unicode Test {U+2192} {U+00F8} AOU

→ and ø are replaced with there code description
The German Umlaut chars are translitterated in a very uncommon way.
In the old ASCII/type-writer days Ä was translitterated in Ae (Ö → Oe,
…)

Why is this behaviour not a cosmetic problem.
---------------------------------------------

Consider a post-commit hook fetching the commit message with `svnlook
log`.
Purpose is to postprocess the log message content, e.g. append to
bugzilla issues.

The actual setup is svn+apache2 and a bash script as post commit hook.
The machine locatle as reported by `localectl`: System Locale:
LANG=en_US.utf8

All the commit messages content transfered is broken as described
above.

This happens because the post-commit hook is running with a very
reduced set of environment variables:
PWD=/
SHLVL=1

Especially `LC_ALL` is not set which is eqivalent to `LC_ALL=C`.

Suggested Mitigation/Fixing
---------------------------
1. Subversion should ensure that the system locale is forwarded to the
post-commit hook.
2. `svnlook` shoud support the `--encoding` switch
3. German Umlaute (and surely some other national characters in the 8-
bit range) shouldn't translittered in a different
way as unicode characters (see ø / {U+00F8}).


PS: Google et. al. haven't shown that this issue is well documented.
H.-Dirk Schmitt
2018-01-27 17:35:17 UTC
Permalink
I found a very weird behaviour of `svnlook log` that IMHO is a bug (or
at least a serious missing documentation issue).

Introduction
------------

Consider a log message like: 'Unicode Test → ø ÄÖÜ'

`svnlook log` invoked in a normal terminal session shows the proper
content.
This works because the environment is set to 'en_US.UTF-8'.

Now start to play - `env LC_ALL=C.UTF-8 svnlook log` also shows a
correct result.

Problem
-------
But falling back to `env LC_ALL=C svnlook log` I got a very flawed
result:

Unicode Test {U+2192} {U+00F8} AOU

→ and ø are replaced with there code description
The German Umlaut chars are translitterated in a very uncommon way.
In the old ASCII/type-writer days Ä was translitterated in Ae (Ö → Oe,
…)

Why is this behaviour not a cosmetic problem.
---------------------------------------------

Consider a post-commit hook fetching the commit message with `svnlook
log`.
Purpose is to postprocess the log message content, e.g. append to
bugzilla issues.

The actual setup is svn+apache2 and a bash script as post commit hook.
The machine locatle as reported by `localectl`: System Locale:
LANG=en_US.utf8

All the commit messages content transfered is broken as described
above.

This happens because the post-commit hook is running with a very
reduced set of environment variables:
PWD=/
SHLVL=1

Especially `LC_ALL` is not set which is eqivalent to `LC_ALL=C`.

Suggested Mitigation/Fixing
---------------------------
1. Subversion should ensure that the system locale is forwarded to the
post-commit hook.
2. `svnlook` shoud support the `--encoding` switch
3. German Umlaute (and surely some other national characters in the 8-
bit range) shouldn't translittered in a different
way as unicode characters (see ø / {U+00F8}).


PS: Google et. al. haven't shown that this issue is well documented.
Johan Corveleyn
2018-01-29 08:50:09 UTC
Permalink
Post by H.-Dirk Schmitt
I found a very weird behaviour of `svnlook log` that IMHO is a bug (or
at least a serious missing documentation issue).
Introduction
------------
Consider a log message like: 'Unicode Test → ø ÄÖÜ'
`svnlook log` invoked in a normal terminal session shows the proper
content.
This works because the environment is set to 'en_US.UTF-8'.
Now start to play - `env LC_ALL=C.UTF-8 svnlook log` also shows a
correct result.
Problem
-------
But falling back to `env LC_ALL=C svnlook log` I got a very flawed
Unicode Test {U+2192} {U+00F8} AOU
→ and ø are replaced with there code description
The German Umlaut chars are translitterated in a very uncommon way.
In the old ASCII/type-writer days Ä was translitterated in Ae (Ö → Oe,
…)
Why is this behaviour not a cosmetic problem.
---------------------------------------------
Consider a post-commit hook fetching the commit message with `svnlook
log`.
Purpose is to postprocess the log message content, e.g. append to
bugzilla issues.
The actual setup is svn+apache2 and a bash script as post commit hook.
LANG=en_US.utf8
All the commit messages content transfered is broken as described
above.
This happens because the post-commit hook is running with a very
PWD=/
SHLVL=1
Especially `LC_ALL` is not set which is eqivalent to `LC_ALL=C`.
Suggested Mitigation/Fixing
---------------------------
1. Subversion should ensure that the system locale is forwarded to the
post-commit hook.
2. `svnlook` shoud support the `--encoding` switch
3. German Umlaute (and surely some other national characters in the 8-
bit range) shouldn't translittered in a different
way as unicode characters (see ø / {U+00F8}).
PS: Google et. al. haven't shown that this issue is well documented.
This is documented in the official documentation (the "SVN Book"):
http://svnbook.red-bean.com/nightly/en/svn.reposadmin.create.html#svn.reposadmin.hooks.configuration

(see the first sentence there: "By default, Subversion executes hook
scripts with an empty environment—that is, no environment variables
are set at all, not even $PATH (or %PATH%, under Windows).")
--
Johan
Ryan Schmidt
2018-01-30 13:24:44 UTC
Permalink
Perhaps the hook script templates could be modified to show how to properly set the environment variables so that UTF-8 log messages can be correctly processed?
Stefan Sperling
2018-01-29 08:53:55 UTC
Permalink
Post by H.-Dirk Schmitt
All the commit messages content transfered is broken as described
above.
This happens because the post-commit hook is running with a very
PWD=/
SHLVL=1
See http://subversion.apache.org/docs/release-notes/1.8.html#hooks-env
and http://subversion.apache.org/docs/release-notes/1.8.html#mod-dav-svn-utf8
H.-Dirk Schmitt
2018-01-29 15:46:09 UTC
Permalink
Post by Stefan Sperling
Post by H.-Dirk Schmitt
All the commit messages content transfered is broken as described
above.
This happens because the post-commit hook is running with a very
PWD=/
SHLVL=1
See http://subversion.apache.org/docs/release-notes/1.8.html#hooks-en
v
and http://subversion.apache.org/docs/release-notes/1.8.html#mod-dav-
svn-utf8
[...]
(see the first sentence there: "By default, Subversion executes hook
scripts with an empty environment—that is, no environment variables
are set at all, not even $PATH (or %PATH%, under Windows).")
OK - My „Postscriptum“ was not correct - my apologies.

But still valid are the the points:

- Broken transliteration of German Umlaut.
- Subversion is ignoring the machine locate settings which should
normally the default if not overwritten in the Environment. This is a
considerable bad behaviour for a linux/unix application.
--
Signature H.-Dirk Schmitt







H.-Dirk Schmitt


Dipl.Math.

eMail:***@computer42.org


mobile:+49 177 616 8564


phone: +49 2642 99 41 14


fax: +49 2642 99 41 15


Schillerstr. 42, D-53489 Sinzig

pgp: http://www.computer42.org/~dirk/OpenPGP-fingerprint.html
Stefan Sperling
2018-01-29 17:14:39 UTC
Permalink
OK - My „Postscriptum“ was not correct - my apologies.
- Broken transliteration of German Umlaut.
I don't see a reason to add support for transliteration if
the locale is incompatible. Just use UTF-8. Paths and log
messages are always stored as UTF-8 inside Subversion anyway.
- Subversion is ignoring the machine locate settings which should
normally the default if not overwritten in the Environment. This is a
considerable bad behaviour for a linux/unix application.
Generally, I agree that unix applications should heed locale settings,
but servers are a special case.

As mentioned in http://subversion.apache.org/docs/release-notes/1.8.html#mod-dav-svn-utf8
the locale behaviour is the result of a policy decision made by the
Apache HTTPD project, namely that all Apache modules run in the "C"
locale and only the "C" locale, even if the system default locale is
something else! Apache HTTPD does not call the setlocale() function.
This is a reasonable trade-off because locale-dependent behaviour could
potentially result in security issues in the webserver. And therefore,
having a webserver module like mod_dav_svn fiddle with the locale and/or
the environment of the running server would be frowned upon.

Hook scripts are generally only interested in the character set
anyway, i.e. LC_CTYPE. All the other locale settings (LC_TIME,
LC_MESSAGES, LC_NUMERIC, etc.) are not critical for hook scripts.

So we added a custom UTF-8 option to mod_dav_svn to allow SVN users to
configure hook script environments in a way that the default HTTPD
behaviour won't allow for, and to set the character set to UTF-8.
Environment variables set this way are only seen by hook scripts and
do not affect the HTTPD server in any way.

I believe this solution gives you the best of both worlds.

Note that using character sets other than ASCII in hook scripts was
impossible for many years. And the move from ASCII to UTF-8 did happen
a couple of years ago already. I don't think changing this behaviour
again would be worthwhile at this point.
See https://issues.apache.org/jira/browse/SVN-2487 in our bug database.
Loading...