On 02.08.2017 11:25, Ben RUBSON wrote:
>> On 02 Aug 2017, at 11:17, André Warnier (tomcat) <email@example.com> wrote:
>> On 02.08.2017 10:59, Ben RUBSON wrote:
>>>> On 02 Aug 2017, at 10:52, André Warnier (tomcat) <firstname.lastname@example.org> wrote:
>>>> On 01.08.2017 19:30, Ben RUBSON wrote:
>>>>> The following UTF-8 :
>>>>> warn("warn with special char ééèè");
>>>>> $r->log->error("log with special char ééèè");
>>>>> Produces :
>>>>> warn with special char ééèè at ...
>>>>> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>>>>> Why all these \x symbols ?
>>>> These represent the *bytes* which correspond to the UTF-8 encoding of your "special" characters above. E.g. the character "é" has the Unicode codepoint 233 (decimal) or E9 (hexadecimal). When encoded using the UTF-8 encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x" prefix is a common way to indicate that the symbols which follow should be interpreted as a hexadecimal number.
>>>> The exact reason why $r->log->error chooses to represent these characters in such a way in the logfile (instead of just printing them as the bytes that constitute their UTF-8 encoding) is not really known to me, but I can make a guess :
>>>> Internally, perl "knows" that these characters are Unicode. But when it writes them out to a file (such as here the logfile of Apache), it does not necessarily know that this file itself is opened "in UTF-8 mode" and that it can just send the characters that way.
>>>> So it "escapes" them in a way that will make them readable by a human, no matter what (*).
>>>> And those are the \x.. (pure ASCII) representations that you see in the logfile.
>>>> On the other hand, the "warn()" that you also use above, that is perl writing directly to its STDERR. And because that is a file that perl opened itself, it knows that it can handle UTF-8, so it writes these characters directly that way.
>>>>> How to avoid them ?
>>>> In this case, I don't know, because it may depend on the way that Apache handles its logfiles, and not only on perl/mod_perl.
>>>> (*) for example, no matter which text editor you later use to view the logfile. All text editors can handle ASCII, but not necessarily UTF-8.
>>>> Ah, and I just saw your follow-up message, and between that and the above, we should have some reasonable explanation together.
>>> Thank you very much for your detailed answer André !
>>> Yes Perl must certainly escape UTF-8 characters as you just explained.
>>> If we convert the string to ascii first (using Encode), these special characters are not correctly displayed, this time due to Apache ap_escape_errorlog_item() function.
>>> Best thing is then to avoid them :)
>> Unfortunately, this is not an option when applications have to deal with multiple languages, and maybe log some important data that just is "not english" (like names of people, or filenames that people use).
>> And unfortunately too, that is an issue which often does not seem so important to a lot of english-native-language programmers, who tend to consider such characters as indeed "special" and get very confused by them. To 80% of the people on earth, such characters are not "special" at all; they are an integral part of their language, just like "a" or "b" are an integral part of the English language. Hell, I can't even write my own name correctly without them ! (and neither can a multitude of websites and email programs, still today. I still get called Andr~O or similar all the time).
> Yes you're right, this is an issue if we need to log things such as user input.
> Supporting the extended ASCII table (up to decimal 255) would at least help a little.
> We would then be able to correctly log 'André' :)
> But many characters would still not be supported...
One thing to say, is that the current way in which Apache handles its logfiles, at least
logs these "extended" characters, without generating an error, and without corrupting or
losing data (the bytes composing the correct UTF-8 encoded characters are there, even if
they are difficult to read by a human). That's better than having some undecipherable ""
or "?" replacement symbol.
But I guess that a "proper" or "better" solution would be for Apache to write the logs
(always/optionally) in Unicode/UTF-8, and be able to tell mod_perl's $r->log->error() that
this is the case (or maybe that would not even be necessary then).
Maybe a problem with this however, would be the multiple "log analysis" programs which
exist (awstat e.g.), and which may not be able right now to handle this.
I'll try to float the idea on the Apache httpd list.