Mailing List Archive

Matching multi-character folds
This email is best viewed under utf8.

The Unicode standard lists several different cases where a character (or
code point if you prefer) should match a multiple character sequence
when case is ignored.

One of these is the oft mentioned in this list, German lower case sharp
s or ß. 'ss' =~ /ß/i is true. (U+00DF)

And perl does currently work that way if and only if the ß is stored in
utf8. For the purposes of this email, I'm assuming all strings are in utf8.

In a recent email, Yves has said that he thinks it is debatable whether
or not it should work this way. My own view is that they should match,
and it is beyond debate that the utf8ness of the strings should matter
or not. To quote from the perltodo: "The handling of Unicode is unclean
in many places. For example, the regexp engine matches in Unicode
semantics whenever the string or the pattern is flagged as UTF-8, but
that should not be dependent on an internal storage detail of the
string. Likewise, case folding behaviour is dependent on the UTF8
internal flag being on or off."

To start the discussion about the multi-char folds, I give examples of
the various types defined in the standard. The first case is that of ß.

Another case is ligatures (they don't view ß as a ligature, and I don't
know why) So 'fi' =~ /fi/i is true. (U+FB01)

Another case is where there there is no corresponding upper or title
case single precomposed character to a lower case one. For instance
LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. (U+01F0)

Still another case is lower Greek letters with a iota-subscript or a
iota adscript. I won't put in an example.

And the final cases all have to do with putting a combining dot above i
and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't
support in Unicode.

I think it is more correct for these things to match than not.
However, I'm not so sure when things are put in a character class. What
should /[ß]/i match? I'm tempted to say not 'ss' because character
classes match only a single character. But with the J with caron, that
really is like a single character, with the caron really just a
modifier. For that I'm tempted to say yes 'ǰ' =~ /[ǰ]/i. The problem
is that the concept of a character class doesn't fit with the Unicode
ideas. I haven't done any research as to what other languages, etc do.

Would you like to know what happens today in perl? Well I'll tell you
anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every
other multi-char fold returns false. This in fact may be the only time
in perl history, savor the moment, when the infamous ß gives an arguably
more correct result than other characters.

Now the code in regcomp.c takes special pains to make all these match.
But it doesn't work, except in the [ß] case. So we don't have to worry
about breaking existing code if we decide it should work differently.

Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ'
=~ /ǰ/i ? They both are true currently. However, things like ß =~
/s{2}/i is false, and that seems inconsistent.

So, I'm not sure what the right answers are, but things are broken today.
Matching multi-character folds [ In reply to ]
This email is best viewed under utf8.

The Unicode standard lists several different cases where a character (or
code point if you prefer) should match a multiple character sequence
when case is ignored.

One of these is the oft mentioned in this list, German lower case sharp
s or ß. 'ss' =~ /ß/i is true. (U+00DF)

And perl does currently work that way if and only if the ß is stored in
utf8. For the purposes of this email, I'm assuming all strings are in utf8.

In a recent email, Yves has said that he thinks it is debatable whether
or not it should work this way. My own view is that they should match.
It is beyond debate if the utf8ness of the strings should matter or not.
To quote from the perltodo: "The handling of Unicode is unclean in
many places. For example, the regexp engine matches in Unicode semantics
whenever the string or the pattern is flagged as UTF-8, but that should
not be dependent on an internal storage detail of the string. Likewise,
case folding behaviour is dependent on the UTF8 internal flag being on
or off."

Yves has submitted an RFC for the first part of that statement, and I'm
now going to talk about the second. I believe we have established that
there will be a new mode of operation which will become the default in
5.12 that characters in the 128-255 range will case fold match as the
Unicode standard says. But there are some issues with multi-char folds
(the only one in that range being ß) generally.

To start the discussion about the multi-char folds, I give examples of
the various types defined in the standard. The first type is that of ß.

Another type is ligatures (they don't view ß as a ligature, and I don't
know why) So 'fi' =~ /fi/i is true. (U+FB01)

Another type is where there there is no corresponding upper or title
case single precomposed character corresponding to a lower case one.
For instance LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true.
(U+01F0)

Still another type is lower Greek letters with a iota-subscript or a
iota adscript. I won't put in an example.

And the final types all have to do with putting a combining dot above i
and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't
support in Unicode.

I think it is more correct for these things to match than not.
However, I'm not so sure when things are put in a character class. What
should /[ß]/i match? I'm tempted to say not 'ss' because character
classes match only a single character. But with the J with caron, that
really is like a single character, with the caron just a modifier. For
that I'm tempted to say yes 'ǰ' =~ /[ǰ]/i. The problem is that the
concept of a character class doesn't fit with the Unicode ideas. I
haven't done any research as to what other languages, etc do.

Would you like to know what happens today in perl? Well I'll tell you
anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every
other multi-char case ignored fold returns false. This in fact may be
the only time in perl history, savor the moment, when the infamous ß
gives an arguably more correct result than other characters.

The code in regcomp.c takes special pains to make all these match. But
it doesn't work, except in the [ß] case. So we don't have to worry
about breaking existing code if we decide it should work differently.

Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ'
=~ /ǰ/i ? They both are true currently. However, things like ß =~
/s{2}/i is false, and that seems inconsistent.

So, I'm not sure what the right answers are, but things are somewhat
broken today, and I'd like to get clarity on how it should work.
Re: Matching multi-character folds [ In reply to ]
2008/11/23 karl williamson <public@khwilliamson.com>:
> This email is best viewed under utf8.
>
> The Unicode standard lists several different cases where a character (or
> code point if you prefer) should match a multiple character sequence when
> case is ignored.
>
> One of these is the oft mentioned in this list, German lower case sharp
> s or ß. 'ss' =~ /ß/i is true. (U+00DF)

0xDF is the only multi-codepoint folding character in the latin-1 range.

Also 0xDF is a "trickyfold" character meaning, that it can match
something of longer length (in terms of bytes) folded than unfolded.

> And perl does currently work that way if and only if the ß is stored in
> utf8. For the purposes of this email, I'm assuming all strings are in utf8.


> In a recent email, Yves has said that he thinks it is debatable whether or
> not it should work this way. My own view is that they should match, and it
> is beyond debate that the utf8ness of the strings should matter or not. To
> quote from the perltodo: "The handling of Unicode is unclean in many places.
> For example, the regexp engine matches in Unicode semantics whenever the
> string or the pattern is flagged as UTF-8, but that should not be dependent
> on an internal storage detail of the string. Likewise, case folding
> behaviour is dependent on the UTF8 internal flag being on or off."

What do you mean by "beyond debate" here?

Seems to me that there is a debate about whether unencoded
nonlocalized strings should be treated as ascii or as latin-1, and if
treated as latin-1 whether they should obey unicode foldcasing rules
or not.

>
> To start the discussion about the multi-char folds, I give examples of the
> various types defined in the standard. The first case is that of ß.
>
> Another case is ligatures (they don't view ß as a ligature, and I don't
> know why) So 'fi' =~ /fi/i is true. (U+FB01)

Prompted by your comment about 'ß' I did some searching for
information on ligatures and unicode and I was surprised how little
there was. The only ligature support seems to be for legacy conversion
reasons (for instance latin-1 equivalancy), and it seems that
ligatures are considered to be a presentation issue better left up to
the font and the font rendering engine. A good discussion being this:

http://unicode.org/faq/ligature_digraph.html

When I checked the unicode data files I didn't find anything about
ligatures outside of certain character names including the word
'LIGATURE', and some comments and commentary files mentioning that
some characters are ligatures. So I'm wondering what you were getting
at when you said "they don't view ß as a ligature, and I don't know
why".

> Another case is where there there is no corresponding upper or title
> case single precomposed character to a lower case one. For instance
> LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. (U+01F0)
>
> Still another case is lower Greek letters with a iota-subscript or a
> iota adscript. I won't put in an example.
>
> And the final cases all have to do with putting a combining dot above i
> and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't support
> in Unicode.
>
> I think it is more correct for these things to match than not. However, I'm
> not so sure when things are put in a character class. What should /[ß]/i
> match? I'm tempted to say not 'ss' because character classes match only a
> single character. But with the J with caron, that really is like a single
> character, with the caron really just a modifier. For that I'm tempted to
> say yes 'ǰ' =~ /[ǰ]/i. The problem is that the concept of a character
> class doesn't fit with the Unicode ideas. I haven't done any research as to
> what other languages, etc do.
>
> Would you like to know what happens today in perl? Well I'll tell you
> anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every other

I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
i can tell.

What doesnt work is

fold('ǰ') =~ /[ǰ]/i

where fold('ǰ') is equivalent to "\x{6A}\x{30C}".

> multi-char fold returns false. This in fact may be the only time in perl
> history, savor the moment, when the infamous ß gives an arguably more
> correct result than other characters.

Hmm. Interesting. I cant decide to be happy about this, or sad.

>
> Now the code in regcomp.c takes special pains to make all these match. But
> it doesn't work, except in the [ß] case. So we don't have to worry about
> breaking existing code if we decide it should work differently.
>
> Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ' =~
> /ǰ/i ? They both are true currently. However, things like ß =~ /s{2}/i is
> false, and that seems inconsistent.
>
>
> So, I'm not sure what the right answers are, but things are broken today.
>

Yes, things are. I wrote the attached hacky script to parse out
CaseFolding.txt and test all the complex folding rules. The output is
below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
first letter representing the string, and the second the patterns
encoding. The description on the right is the test, with chars
represented by their hex representation, and separated by spaces in
the case of the folded string. The output on 5.8.9 looks different,
with more mistakes.

demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
test_case_folding.pl
LATIN SMALL LETTER SHARP S
ll '0073 0073' =~ /00DF/i
ll, ul, uu '0073 0073' =~ /[00DF]/i
LATIN CAPITAL LETTER I WITH DOT ABOVE
uu '0069 0307' =~ /[0130]/i
LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
uu '02BC 006E' =~ /[0149]/i
LATIN SMALL LETTER J WITH CARON
uu '006A 030C' =~ /[01F0]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
uu '03B9 0308 0301' =~ /[0390]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
uu '03C5 0308 0301' =~ /[03B0]/i
ARMENIAN SMALL LIGATURE ECH YIWN
uu '0565 0582' =~ /[0587]/i
LATIN SMALL LETTER H WITH LINE BELOW
uu '0068 0331' =~ /[1E96]/i
LATIN SMALL LETTER T WITH DIAERESIS
uu '0074 0308' =~ /[1E97]/i
LATIN SMALL LETTER W WITH RING ABOVE
uu '0077 030A' =~ /[1E98]/i
LATIN SMALL LETTER Y WITH RING ABOVE
uu '0079 030A' =~ /[1E99]/i
LATIN SMALL LETTER A WITH RIGHT HALF RING
uu '0061 02BE' =~ /[1E9A]/i
LATIN CAPITAL LETTER SHARP S
lu, uu '0073 0073' =~ /[1E9E]/i
GREEK SMALL LETTER UPSILON WITH PSILI
uu '03C5 0313' =~ /[1F50]/i
GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
uu '03C5 0313 0300' =~ /[1F52]/i
GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
uu '03C5 0313 0301' =~ /[1F54]/i
GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
uu '03C5 0313 0342' =~ /[1F56]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
uu '1F00 03B9' =~ /[1F80]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
uu '1F01 03B9' =~ /[1F81]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
uu '1F02 03B9' =~ /[1F82]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
uu '1F03 03B9' =~ /[1F83]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
uu '1F04 03B9' =~ /[1F84]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
uu '1F05 03B9' =~ /[1F85]/i
GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F06 03B9' =~ /[1F86]/i
GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F07 03B9' =~ /[1F87]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
uu '1F00 03B9' =~ /[1F88]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
uu '1F01 03B9' =~ /[1F89]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
uu '1F02 03B9' =~ /[1F8A]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
uu '1F03 03B9' =~ /[1F8B]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
uu '1F04 03B9' =~ /[1F8C]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
uu '1F05 03B9' =~ /[1F8D]/i
GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F06 03B9' =~ /[1F8E]/i
GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F07 03B9' =~ /[1F8F]/i
GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
uu '1F20 03B9' =~ /[1F90]/i
GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
uu '1F21 03B9' =~ /[1F91]/i
GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
uu '1F22 03B9' =~ /[1F92]/i
GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
uu '1F23 03B9' =~ /[1F93]/i
GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
uu '1F24 03B9' =~ /[1F94]/i
GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
uu '1F25 03B9' =~ /[1F95]/i
GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F26 03B9' =~ /[1F96]/i
GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F27 03B9' =~ /[1F97]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
uu '1F20 03B9' =~ /[1F98]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
uu '1F21 03B9' =~ /[1F99]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
uu '1F22 03B9' =~ /[1F9A]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
uu '1F23 03B9' =~ /[1F9B]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
uu '1F24 03B9' =~ /[1F9C]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
uu '1F25 03B9' =~ /[1F9D]/i
GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F26 03B9' =~ /[1F9E]/i
GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F27 03B9' =~ /[1F9F]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
uu '1F60 03B9' =~ /[1FA0]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
uu '1F61 03B9' =~ /[1FA1]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
uu '1F62 03B9' =~ /[1FA2]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
uu '1F63 03B9' =~ /[1FA3]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
uu '1F64 03B9' =~ /[1FA4]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
uu '1F65 03B9' =~ /[1FA5]/i
GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F66 03B9' =~ /[1FA6]/i
GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
uu '1F67 03B9' =~ /[1FA7]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
uu '1F60 03B9' =~ /[1FA8]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
uu '1F61 03B9' =~ /[1FA9]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
uu '1F62 03B9' =~ /[1FAA]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
uu '1F63 03B9' =~ /[1FAB]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
uu '1F64 03B9' =~ /[1FAC]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
uu '1F65 03B9' =~ /[1FAD]/i
GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F66 03B9' =~ /[1FAE]/i
GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
uu '1F67 03B9' =~ /[1FAF]/i
GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
uu '1F70 03B9' =~ /[1FB2]/i
GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
uu '03B1 03B9' =~ /[1FB3]/i
GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
uu '03AC 03B9' =~ /[1FB4]/i
GREEK SMALL LETTER ALPHA WITH PERISPOMENI
uu '03B1 0342' =~ /[1FB6]/i
GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
uu '03B1 0342 03B9' =~ /[1FB7]/i
GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
uu '03B1 03B9' =~ /[1FBC]/i
GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
uu '1F74 03B9' =~ /[1FC2]/i
GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
uu '03B7 03B9' =~ /[1FC3]/i
GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
uu '03AE 03B9' =~ /[1FC4]/i
GREEK SMALL LETTER ETA WITH PERISPOMENI
uu '03B7 0342' =~ /[1FC6]/i
GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
uu '03B7 0342 03B9' =~ /[1FC7]/i
GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
uu '03B7 03B9' =~ /[1FCC]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
uu '03B9 0308 0300' =~ /[1FD2]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
uu '03B9 0308 0301' =~ /[1FD3]/i
GREEK SMALL LETTER IOTA WITH PERISPOMENI
uu '03B9 0342' =~ /[1FD6]/i
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
uu '03B9 0308 0342' =~ /[1FD7]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
uu '03C5 0308 0300' =~ /[1FE2]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
uu '03C5 0308 0301' =~ /[1FE3]/i
GREEK SMALL LETTER RHO WITH PSILI
uu '03C1 0313' =~ /[1FE4]/i
GREEK SMALL LETTER UPSILON WITH PERISPOMENI
uu '03C5 0342' =~ /[1FE6]/i
GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
uu '03C5 0308 0342' =~ /[1FE7]/i
GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
uu '1F7C 03B9' =~ /[1FF2]/i
GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
uu '03C9 03B9' =~ /[1FF3]/i
GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
uu '03CE 03B9' =~ /[1FF4]/i
GREEK SMALL LETTER OMEGA WITH PERISPOMENI
uu '03C9 0342' =~ /[1FF6]/i
GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
uu '03C9 0342 03B9' =~ /[1FF7]/i
GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
uu '03C9 03B9' =~ /[1FFC]/i
LATIN SMALL LIGATURE FF
lu, uu '0066 0066' =~ /[FB00]/i
LATIN SMALL LIGATURE FI
lu, uu '0066 0069' =~ /[FB01]/i
LATIN SMALL LIGATURE FL
lu, uu '0066 006C' =~ /[FB02]/i
LATIN SMALL LIGATURE FFI
lu, uu '0066 0066 0069' =~ /[FB03]/i
LATIN SMALL LIGATURE FFL
lu, uu '0066 0066 006C' =~ /[FB04]/i
LATIN SMALL LIGATURE LONG S T
lu, uu '0073 0074' =~ /[FB05]/i
LATIN SMALL LIGATURE ST
lu, uu '0073 0074' =~ /[FB06]/i
ARMENIAN SMALL LIGATURE MEN NOW
uu '0574 0576' =~ /[FB13]/i
ARMENIAN SMALL LIGATURE MEN ECH
uu '0574 0565' =~ /[FB14]/i
ARMENIAN SMALL LIGATURE MEN INI
uu '0574 056B' =~ /[FB15]/i
ARMENIAN SMALL LIGATURE VEW NOW
uu '057E 0576' =~ /[FB16]/i
ARMENIAN SMALL LIGATURE MEN XEH
uu '0574 056D' =~ /[FB17]/i


So its clear that multicode-point character class folding is broken
for some definition of expected behaviour.

I personally consider character class notation to be an abbreviation
of alternation. So a character class [xyz] is supposed to match the
same thing as (x|y|z). This implies that character classes have to be
able to match more than one character under case-folding rules. A lot
of external logic and at least some internal logic operates under this
assumption, so i dont think we can change it.

cheers,
Yves




--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: Matching multi-character folds [ In reply to ]
* karl williamson <public@khwilliamson.com> [2008-11-23 04:50]:
> Another case is ligatures (they don't view ß as a ligature, and
> I don't know why) So 'fi' =~ /fi/i is true. (U+FB01)

Because “fi” is always exactly equivalent with “fi” and they can
be mechanically substituted for one another without affecting the
correctness of the text.

In contradistinction, there is a whole range of German words
which are correct only when written with “ss” (“Wasser”) and also
a few examples which are correct only when written with “ß”
(examples escape me right now, since this case is rarer and the
orthography reform has muddied the waters).

In this light it should be noted that back when “ß” was in fact a
ligature, it was actually equivalent to “sz” rather than “ss”.
You can see this in the glyph shape if you trace its evolution
over time. There’s a great article about it on the Typefoundry
blog, <http://typefoundry.blogspot.com/2008/01/esszett-or.html>,
which includes pictorial evidence.

So there you have it: “ß” (today) is not a ligature.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Re: Matching multi-character folds [ In reply to ]
demerphq wrote:
> 2008/11/23 karl williamson <public@khwilliamson.com>:
>> This email is best viewed under utf8.
>>
>> The Unicode standard lists several different cases where a character (or
>> code point if you prefer) should match a multiple character sequence when
>> case is ignored.
>>
>> One of these is the oft mentioned in this list, German lower case sharp
>> s or ß. 'ss' =~ /ß/i is true. (U+00DF)
>
> 0xDF is the only multi-codepoint folding character in the latin-1 range.
>
> Also 0xDF is a "trickyfold" character meaning, that it can match
> something of longer length (in terms of bytes) folded than unfolded.
>
There must be more to it than that, as the code indicates there are only
three tricky fold characters, yet there are more that fit this
definition. For example U+023A which takes 2 bytes in UTF-8 folds to
U+2C65 which takes 3. They seem to work.

>> And perl does currently work that way if and only if the ß is stored in
>> utf8. For the purposes of this email, I'm assuming all strings are in utf8.
>
>
>> In a recent email, Yves has said that he thinks it is debatable whether or
>> not it should work this way. My own view is that they should match, and it
>> is beyond debate that the utf8ness of the strings should matter or not. To
>> quote from the perltodo: "The handling of Unicode is unclean in many places.
>> For example, the regexp engine matches in Unicode semantics whenever the
>> string or the pattern is flagged as UTF-8, but that should not be dependent
>> on an internal storage detail of the string. Likewise, case folding
>> behaviour is dependent on the UTF8 internal flag being on or off."
>
> What do you mean by "beyond debate" here?
>
> Seems to me that there is a debate about whether unencoded
> nonlocalized strings should be treated as ascii or as latin-1, and if
> treated as latin-1 whether they should obey unicode foldcasing rules
> or not.
>
I thought that was settled. While you were taking a break from p5p, I
naively came in and started a discussion on it (there are various
threads, but most include [perl #58182] in the subject). There was
agreement that they should match Unicode and I gave a very detailed
proposal which the 5.12 pumpking said sounded reasonable. It was
pointed out that perl5100delta says:

| The handling of Unicode still is unclean in several places, where it's
| dependent on whether a string is internally flagged as UTF-8. This will
| be made more consistent in perl 5.12, but that won't be possible without
| a certain amount of backwards incompatibility."

Similarly in perltodo, as I quoted in the first email on this thread:
"that should not be dependent on an internal storage detail of the
string" meaning the utf8ness of a string should not affect its external
semantics.

It seems clear that it's been agreed that the utf8ness of a string
should not affect its external behavior. So what should the behavior
be? It has to be the Unicode behavior, for otherwise, the characters
between 128 and 255 would never behave like Unicode.

There are 3 main areas where things don't work. (I believe that the
problems with pack() have been fixed.)

1. uc(), lcfirst(), \U, etc. I have submitted for review code that
gives the same semantics for these whether or not the string is in utf8
or not.

2. \w, [:graph:], etc re matching. I think the solution to this is in
your RFC to make these just match ASCII or the current locale. Then the
utf8ness won't matter, except if someone's string gets converted to
utf8, and then their locale most likely won't work properly. That is
why I said in an earlier email that I don't think strings should be
upgraded to utf8 when "use locale" is in effect. The RFC also solves
the problem of, for example, \d matching things the programmer never
intended, just because the string silently, somehow, got changed to
utf8. My proposal that I thought had been accepted was, for example, to
make \w match the appropriate Latin1 characters even when not in utf8.
And I had working experimental code to do that. But I think your RFC
makes more sense.

3. caseless re matching m/.../i Again, perl has to change so that the
utf8ness of the pattern doesn't matter. One could do it by adding
modifiers, as you originally suggested, like /u to force unicode
semantics. But I think you had pulled away from that idea. I would be
open to something like that, but I think there has to be a way for a
programmer to make that the default, without forcing them to always
remember to add the modifier. Or one could do it by having the re code
know about latin1 semantics. Again, I have mostly working code which
doesn't change regcomp.c very much that does this. I do think overall
that this is a better solution than the modifier one. One consideration
I have that has been mentioned in the documentation is that latin1
should be faster than utf8. I think Tom may have said that he didn't
find that to be the case in his experiments.

>> To start the discussion about the multi-char folds, I give examples of the
>> various types defined in the standard. The first case is that of ß.
>>
>> Another case is ligatures (they don't view ß as a ligature, and I don't
>> know why) So 'fi' =~ /fi/i is true. (U+FB01)
>
> Prompted by your comment about 'ß' I did some searching for
> information on ligatures and unicode and I was surprised how little
> there was. The only ligature support seems to be for legacy conversion
> reasons (for instance latin-1 equivalancy), and it seems that
> ligatures are considered to be a presentation issue better left up to
> the font and the font rendering engine. A good discussion being this:
>
> http://unicode.org/faq/ligature_digraph.html
>
> When I checked the unicode data files I didn't find anything about
> ligatures outside of certain character names including the word
> 'LIGATURE', and some comments and commentary files mentioning that
> some characters are ligatures. So I'm wondering what you were getting
> at when you said "they don't view ß as a ligature, and I don't know
> why".
>
My source for that was lib/unicore/SpecialCasing.txt

>> Another case is where there there is no corresponding upper or title
>> case single precomposed character to a lower case one. For instance
>> LATIN SMALL LETTER J WITH CARON, so 'ǰ' =~ /ǰ/i is true. (U+01F0)
>>
>> Still another case is lower Greek letters with a iota-subscript or a
>> iota adscript. I won't put in an example.
>>
>> And the final cases all have to do with putting a combining dot above i
>> and j in Azeri, Turkish, and Lithuanian locales, which perl doesn't support
>> in Unicode.
>>
>> I think it is more correct for these things to match than not. However, I'm
>> not so sure when things are put in a character class. What should /[ß]/i
>> match? I'm tempted to say not 'ss' because character classes match only a
>> single character. But with the J with caron, that really is like a single
>> character, with the caron really just a modifier. For that I'm tempted to
>> say yes 'ǰ' =~ /[ǰ]/i. The problem is that the concept of a character
>> class doesn't fit with the Unicode ideas. I haven't done any research as to
>> what other languages, etc do.
>>
>> Would you like to know what happens today in perl? Well I'll tell you
>> anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every other
>
> I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
> i can tell.
>
> What doesnt work is
>
> fold('ǰ') =~ /[ǰ]/i
>
> where fold('ǰ') is equivalent to "\x{6A}\x{30C}".
>

I don't understand. I just tested again with the perl I have on my
machine that I think is today's bleadperl, and it failed. But in any
event as you agree below, there are a number of things broken.

>> multi-char fold returns false. This in fact may be the only time in perl
>> history, savor the moment, when the infamous ß gives an arguably more
>> correct result than other characters.
>
> Hmm. Interesting. I cant decide to be happy about this, or sad.
>
The only reason it works is because for single character char classes,
they get optimized out, and somehow, it works. [ßa] doesn't work.

>> Now the code in regcomp.c takes special pains to make all these match. But
>> it doesn't work, except in the [ß] case. So we don't have to worry about
>> breaking existing code if we decide it should work differently.
>>
>> Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ' =~
>> /ǰ/i ? They both are true currently. However, things like ß =~ /s{2}/i is
>> false, and that seems inconsistent.
>>
>>
>> So, I'm not sure what the right answers are, but things are broken today.
>>
>
> Yes, things are. I wrote the attached hacky script to parse out
> CaseFolding.txt and test all the complex folding rules. The output is
> below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
> first letter representing the string, and the second the patterns
> encoding. The description on the right is the test, with chars
> represented by their hex representation, and separated by spaces in
> the case of the folded string. The output on 5.8.9 looks different,
> with more mistakes.
>
> demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
> test_case_folding.pl
> LATIN SMALL LETTER SHARP S
> ll '0073 0073' =~ /00DF/i
> ll, ul, uu '0073 0073' =~ /[00DF]/i
> LATIN CAPITAL LETTER I WITH DOT ABOVE
> uu '0069 0307' =~ /[0130]/i
> LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
> uu '02BC 006E' =~ /[0149]/i
> LATIN SMALL LETTER J WITH CARON
> uu '006A 030C' =~ /[01F0]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
> uu '03B9 0308 0301' =~ /[0390]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
> uu '03C5 0308 0301' =~ /[03B0]/i
> ARMENIAN SMALL LIGATURE ECH YIWN
> uu '0565 0582' =~ /[0587]/i
> LATIN SMALL LETTER H WITH LINE BELOW
> uu '0068 0331' =~ /[1E96]/i
> LATIN SMALL LETTER T WITH DIAERESIS
> uu '0074 0308' =~ /[1E97]/i
> LATIN SMALL LETTER W WITH RING ABOVE
> uu '0077 030A' =~ /[1E98]/i
> LATIN SMALL LETTER Y WITH RING ABOVE
> uu '0079 030A' =~ /[1E99]/i
> LATIN SMALL LETTER A WITH RIGHT HALF RING
> uu '0061 02BE' =~ /[1E9A]/i
> LATIN CAPITAL LETTER SHARP S
> lu, uu '0073 0073' =~ /[1E9E]/i
> GREEK SMALL LETTER UPSILON WITH PSILI
> uu '03C5 0313' =~ /[1F50]/i
> GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
> uu '03C5 0313 0300' =~ /[1F52]/i
> GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
> uu '03C5 0313 0301' =~ /[1F54]/i
> GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
> uu '03C5 0313 0342' =~ /[1F56]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
> uu '1F00 03B9' =~ /[1F80]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
> uu '1F01 03B9' =~ /[1F81]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
> uu '1F02 03B9' =~ /[1F82]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
> uu '1F03 03B9' =~ /[1F83]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
> uu '1F04 03B9' =~ /[1F84]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
> uu '1F05 03B9' =~ /[1F85]/i
> GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F06 03B9' =~ /[1F86]/i
> GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F07 03B9' =~ /[1F87]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
> uu '1F00 03B9' =~ /[1F88]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
> uu '1F01 03B9' =~ /[1F89]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
> uu '1F02 03B9' =~ /[1F8A]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
> uu '1F03 03B9' =~ /[1F8B]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
> uu '1F04 03B9' =~ /[1F8C]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
> uu '1F05 03B9' =~ /[1F8D]/i
> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F06 03B9' =~ /[1F8E]/i
> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F07 03B9' =~ /[1F8F]/i
> GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
> uu '1F20 03B9' =~ /[1F90]/i
> GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
> uu '1F21 03B9' =~ /[1F91]/i
> GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
> uu '1F22 03B9' =~ /[1F92]/i
> GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
> uu '1F23 03B9' =~ /[1F93]/i
> GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
> uu '1F24 03B9' =~ /[1F94]/i
> GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
> uu '1F25 03B9' =~ /[1F95]/i
> GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F26 03B9' =~ /[1F96]/i
> GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F27 03B9' =~ /[1F97]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
> uu '1F20 03B9' =~ /[1F98]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
> uu '1F21 03B9' =~ /[1F99]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
> uu '1F22 03B9' =~ /[1F9A]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
> uu '1F23 03B9' =~ /[1F9B]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
> uu '1F24 03B9' =~ /[1F9C]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
> uu '1F25 03B9' =~ /[1F9D]/i
> GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F26 03B9' =~ /[1F9E]/i
> GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F27 03B9' =~ /[1F9F]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
> uu '1F60 03B9' =~ /[1FA0]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
> uu '1F61 03B9' =~ /[1FA1]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
> uu '1F62 03B9' =~ /[1FA2]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
> uu '1F63 03B9' =~ /[1FA3]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
> uu '1F64 03B9' =~ /[1FA4]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
> uu '1F65 03B9' =~ /[1FA5]/i
> GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F66 03B9' =~ /[1FA6]/i
> GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
> uu '1F67 03B9' =~ /[1FA7]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
> uu '1F60 03B9' =~ /[1FA8]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
> uu '1F61 03B9' =~ /[1FA9]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
> uu '1F62 03B9' =~ /[1FAA]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
> uu '1F63 03B9' =~ /[1FAB]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
> uu '1F64 03B9' =~ /[1FAC]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
> uu '1F65 03B9' =~ /[1FAD]/i
> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F66 03B9' =~ /[1FAE]/i
> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
> uu '1F67 03B9' =~ /[1FAF]/i
> GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
> uu '1F70 03B9' =~ /[1FB2]/i
> GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
> uu '03B1 03B9' =~ /[1FB3]/i
> GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
> uu '03AC 03B9' =~ /[1FB4]/i
> GREEK SMALL LETTER ALPHA WITH PERISPOMENI
> uu '03B1 0342' =~ /[1FB6]/i
> GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
> uu '03B1 0342 03B9' =~ /[1FB7]/i
> GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
> uu '03B1 03B9' =~ /[1FBC]/i
> GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
> uu '1F74 03B9' =~ /[1FC2]/i
> GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
> uu '03B7 03B9' =~ /[1FC3]/i
> GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
> uu '03AE 03B9' =~ /[1FC4]/i
> GREEK SMALL LETTER ETA WITH PERISPOMENI
> uu '03B7 0342' =~ /[1FC6]/i
> GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
> uu '03B7 0342 03B9' =~ /[1FC7]/i
> GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
> uu '03B7 03B9' =~ /[1FCC]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
> uu '03B9 0308 0300' =~ /[1FD2]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
> uu '03B9 0308 0301' =~ /[1FD3]/i
> GREEK SMALL LETTER IOTA WITH PERISPOMENI
> uu '03B9 0342' =~ /[1FD6]/i
> GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
> uu '03B9 0308 0342' =~ /[1FD7]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
> uu '03C5 0308 0300' =~ /[1FE2]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
> uu '03C5 0308 0301' =~ /[1FE3]/i
> GREEK SMALL LETTER RHO WITH PSILI
> uu '03C1 0313' =~ /[1FE4]/i
> GREEK SMALL LETTER UPSILON WITH PERISPOMENI
> uu '03C5 0342' =~ /[1FE6]/i
> GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
> uu '03C5 0308 0342' =~ /[1FE7]/i
> GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
> uu '1F7C 03B9' =~ /[1FF2]/i
> GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
> uu '03C9 03B9' =~ /[1FF3]/i
> GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
> uu '03CE 03B9' =~ /[1FF4]/i
> GREEK SMALL LETTER OMEGA WITH PERISPOMENI
> uu '03C9 0342' =~ /[1FF6]/i
> GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
> uu '03C9 0342 03B9' =~ /[1FF7]/i
> GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
> uu '03C9 03B9' =~ /[1FFC]/i
> LATIN SMALL LIGATURE FF
> lu, uu '0066 0066' =~ /[FB00]/i
> LATIN SMALL LIGATURE FI
> lu, uu '0066 0069' =~ /[FB01]/i
> LATIN SMALL LIGATURE FL
> lu, uu '0066 006C' =~ /[FB02]/i
> LATIN SMALL LIGATURE FFI
> lu, uu '0066 0066 0069' =~ /[FB03]/i
> LATIN SMALL LIGATURE FFL
> lu, uu '0066 0066 006C' =~ /[FB04]/i
> LATIN SMALL LIGATURE LONG S T
> lu, uu '0073 0074' =~ /[FB05]/i
> LATIN SMALL LIGATURE ST
> lu, uu '0073 0074' =~ /[FB06]/i
> ARMENIAN SMALL LIGATURE MEN NOW
> uu '0574 0576' =~ /[FB13]/i
> ARMENIAN SMALL LIGATURE MEN ECH
> uu '0574 0565' =~ /[FB14]/i
> ARMENIAN SMALL LIGATURE MEN INI
> uu '0574 056B' =~ /[FB15]/i
> ARMENIAN SMALL LIGATURE VEW NOW
> uu '057E 0576' =~ /[FB16]/i
> ARMENIAN SMALL LIGATURE MEN XEH
> uu '0574 056D' =~ /[FB17]/i
>

What Yves didn't mention to those of you reading along, is that only the
failures were printed above. When I run his program on 5.8 vs blead on
the same version of the Unicode database, the only differences I saw
were related, I think, to Yves fixing things in 5.10 with his tricky
fold addition, and the new in Unicode 5.1 upper case version of ß. I
don't understand off-hand why that would be different.
>
> So its clear that multicode-point character class folding is broken
> for some definition of expected behaviour.
>
> I personally consider character class notation to be an abbreviation
> of alternation. So a character class [xyz] is supposed to match the
> same thing as (x|y|z). This implies that character classes have to be
> able to match more than one character under case-folding rules. A lot
> of external logic and at least some internal logic operates under this
> assumption, so i dont think we can change it.
>

That sounds right.
Re: Matching multi-character folds [ In reply to ]
2008/11/23 karl williamson <public@khwilliamson.com>:
> demerphq wrote:
>> 2008/11/23 karl williamson <public@khwilliamson.com>:
[snip]
>>> One of these is the oft mentioned in this list, German lower case sharp
>>> s or ß. 'ss' =~ /ß/i is true. (U+00DF)
>>
>> 0xDF is the only multi-codepoint folding character in the latin-1 range.
>>
>> Also 0xDF is a "trickyfold" character meaning, that it can match
>> something of longer length (in terms of bytes) folded than unfolded.
>>
> There must be more to it than that, as the code indicates there are only
> three tricky fold characters, yet there are more that fit this definition.
> For example U+023A which takes 2 bytes in UTF-8 folds to U+2C65 which takes
> 3. They seem to work.

The three trickyfold characters tickle a bug in minlen logic of the
optimiser. The ones you mention dont, I think because they are both
one codepoint long. As far the mail history shows I dont think I
really got the bottom of the bug in the optimiser and worked around it
with the trickyfold construct as being the simplest solution.

As far as I recall /$char/i for unicode $char is stored casefolded at
compile time. The bug basically came down to:

$df=chr(0xdf);
utf8::upgrade($df);
print $df=~/$df/i ? "ok" : "not ok";

which if inspected under use re 'debug' revealed that this was
internally converted into an EXACTF <ss> opcode. Which in turn caused
the minlen logic to fire, as it is two characters long. An exhaustive
search for these revealed problems only in the three codepoints we
covered, and my retest shows that we have more of this class with the
updates to unicode 5.1. Exactly why the others did not fail was never
really clear. I did an exhaustive search and those were the ones I
found. The optimiser is a scary beast :-(

[snip]
>> What do you mean by "beyond debate" here?
>>
>> Seems to me that there is a debate about whether unencoded
>> nonlocalized strings should be treated as ascii or as latin-1, and if
>> treated as latin-1 whether they should obey unicode foldcasing rules
>> or not.
>>
> I thought that was settled. While you were taking a break from p5p, I
> naively came in and started a discussion on it (there are various threads,
> but most include [perl #58182] in the subject). There was agreement that
> they should match Unicode and I gave a very detailed proposal which the 5.12
> pumpking said sounded reasonable. It was pointed out that perl5100delta
> says:
>
> | The handling of Unicode still is unclean in several places, where it's
> | dependent on whether a string is internally flagged as UTF-8. This will
> | be made more consistent in perl 5.12, but that won't be possible without
> | a certain amount of backwards incompatibility."
>
> Similarly in perltodo, as I quoted in the first email on this thread:
> "that should not be dependent on an internal storage detail of the string"
> meaning the utf8ness of a string should not affect its external semantics.
>
> It seems clear that it's been agreed that the utf8ness of a string should
> not affect its external behavior. So what should the behavior be? It has
> to be the Unicode behavior, for otherwise, the characters between 128 and
> 255 would never behave like Unicode.
>
> There are 3 main areas where things don't work. (I believe that the
> problems with pack() have been fixed.)
>
> 1. uc(), lcfirst(), \U, etc. I have submitted for review code that gives
> the same semantics for these whether or not the string is in utf8 or not.

This worries me, as it involves a fairly serious behaviour change. But
if its been decided then fine, at least it will be consistency.

> 2. \w, [:graph:], etc re matching. I think the solution to this is in your
> RFC to make these just match ASCII or the current locale. Then the utf8ness
> won't matter, except if someone's string gets converted to utf8, and then
> their locale most likely won't work properly. That is why I said in an
> earlier email that I don't think strings should be upgraded to utf8 when
> "use locale" is in effect. The RFC also solves the problem of, for example,
> \d matching things the programmer never intended, just because the string
> silently, somehow, got changed to utf8. My proposal that I thought had been
> accepted was, for example, to make \w match the appropriate Latin1
> characters even when not in utf8. And I had working experimental code to do
> that. But I think your RFC makes more sense.

Ok.

> 3. caseless re matching m/.../i Again, perl has to change so that the
> utf8ness of the pattern doesn't matter. One could do it by adding
> modifiers, as you originally suggested, like /u to force unicode semantics.
> But I think you had pulled away from that idea. I would be open to
> something like that, but I think there has to be a way for a programmer to
> make that the default, without forcing them to always remember to add the
> modifier. Or one could do it by having the re code know about latin1
> semantics. Again, I have mostly working code which doesn't change regcomp.c
> very much that does this. I do think overall that this is a better
> solution than the modifier one. One consideration I have that has been
> mentioned in the documentation is that latin1 should be faster than utf8. I
> think Tom may have said that he didn't find that to be the case in his
> experiments.

I'd like to see more on this. I do know that benchmarking the regex
engine is not easy. There are lots of special cases and things like
that to consider. Ive definitely seen utf8 have serious performance
consequences.

[snip]
>>> Another case is ligatures (they don't view ß as a ligature, and I don't
>>> know why) So 'fi' =~ /fi/i is true. (U+FB01)
>>
>> Prompted by your comment about 'ß' I did some searching for
>> information on ligatures and unicode and I was surprised how little
>> there was. The only ligature support seems to be for legacy conversion
>> reasons (for instance latin-1 equivalancy), and it seems that
>> ligatures are considered to be a presentation issue better left up to
>> the font and the font rendering engine. A good discussion being this:
>>
>> http://unicode.org/faq/ligature_digraph.html
>>
>> When I checked the unicode data files I didn't find anything about
>> ligatures outside of certain character names including the word
>> 'LIGATURE', and some comments and commentary files mentioning that
>> some characters are ligatures. So I'm wondering what you were getting
>> at when you said "they don't view ß as a ligature, and I don't know
>> why".
>>
> My source for that was lib/unicore/SpecialCasing.txt

Right, which includes a comment about some of the unusual forms. But
it is not a formal status or property of the characters.

[snip]
>>> Would you like to know what happens today in perl? Well I'll tell you
>>> anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every
>>> other
>>
>> I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
>> i can tell.
>>
>> What doesnt work is
>>
>> fold('ǰ') =~ /[ǰ]/i
>>
>> where fold('ǰ') is equivalent to "\x{6A}\x{30C}".
>>
>
> I don't understand. I just tested again with the perl I have on my machine
> that I think is today's bleadperl, and it failed. But in any event as you
> agree below, there are a number of things broken.

Can you post a oneliner that doesnt contain unicode in it to test
with? In other words coded so it can be expressed in ascii, whatever
the code itself does?

>>> multi-char fold returns false. This in fact may be the only time in perl
>>> history, savor the moment, when the infamous ß gives an arguably more
>>> correct result than other characters.
>>
>> Hmm. Interesting. I cant decide to be happy about this, or sad.
>>
> The only reason it works is because for single character char classes, they
> get optimized out, and somehow, it works. [ßa] doesn't work.

Ah. Sigh. So they turn into EXACTF instead of ANYOF. I forgot about that.

>
>>> Now the code in regcomp.c takes special pains to make all these match.
>>> But
>>> it doesn't work, except in the [ß] case. So we don't have to worry about
>>> breaking existing code if we decide it should work differently.
>>>
>>> Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ' =~
>>> /ǰ/i ? They both are true currently. However, things like ß =~ /s{2}/i
>>> is
>>> false, and that seems inconsistent.
>>>
>>>
>>> So, I'm not sure what the right answers are, but things are broken today.
>>>
>>
>> Yes, things are. I wrote the attached hacky script to parse out
>> CaseFolding.txt and test all the complex folding rules. The output is
>> below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
>> first letter representing the string, and the second the patterns
>> encoding. The description on the right is the test, with chars
>> represented by their hex representation, and separated by spaces in
>> the case of the folded string. The output on 5.8.9 looks different,
>> with more mistakes.
>>
>> demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
>> test_case_folding.pl
>> LATIN SMALL LETTER SHARP S
>> ll '0073 0073' =~ /00DF/i
>> ll, ul, uu '0073 0073' =~ /[00DF]/i

ll is expected to fail here under the current rules.

[snip]
>> LATIN CAPITAL LETTER SHARP S
>> lu, uu '0073 0073' =~ /[1E9E]/i

lu probably fails because of the minlen bug.

[snip]
>> LATIN SMALL LIGATURE FF
>> lu, uu '0066 0066' =~ /[FB00]/i
>> LATIN SMALL LIGATURE FI
>> lu, uu '0066 0069' =~ /[FB01]/i
>> LATIN SMALL LIGATURE FL
>> lu, uu '0066 006C' =~ /[FB02]/i
>> LATIN SMALL LIGATURE FFI
>> lu, uu '0066 0066 0069' =~ /[FB03]/i
>> LATIN SMALL LIGATURE FFL
>> lu, uu '0066 0066 006C' =~ /[FB04]/i
>> LATIN SMALL LIGATURE LONG S T
>> lu, uu '0073 0074' =~ /[FB05]/i
>> LATIN SMALL LIGATURE ST
>> lu, uu '0073 0074' =~ /[FB06]/i

These lu's might fail because of the minlen bug. Are these new to 5.1?

> What Yves didn't mention to those of you reading along, is that only the
> failures were printed above.

Yes correct, and we only test the possible combinations. So only \xDF
has 'll' or 'ul' and most only have 'uu'.

> When I run his program on 5.8 vs blead on the
> same version of the Unicode database, the only differences I saw were
> related, I think, to Yves fixing things in 5.10 with his tricky fold
> addition, and the new in Unicode 5.1 upper case version of ß. I don't
> understand off-hand why that would be different.

Because its not being handled by the trickfold logic. Basically its
the same problem as the lower case but it hasn't been added
regcharclass.pl. And none of the special cases coded into the regex
engine to deal with 0xDF have been added to the engine for its
majestic brother.

>> So its clear that multicode-point character class folding is broken
>> for some definition of expected behaviour.
>>
>> I personally consider character class notation to be an abbreviation
>> of alternation. So a character class [xyz] is supposed to match the
>> same thing as (x|y|z). This implies that character classes have to be
>> able to match more than one character under case-folding rules. A lot
>> of external logic and at least some internal logic operates under this
>> assumption, so i dont think we can change it.
>>
>
> That sounds right.

Im trying to imagine a way to do this that doesn't involve a pretty
considerable redesign of how character classes work, and not coming up
with much.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: Matching multi-character folds [ In reply to ]
demerphq wrote:
[snip]
>>>
>>> I personally consider character class notation to be an abbreviation
>>> of alternation. So a character class [xyz] is supposed to match the
>>> same thing as (x|y|z). This implies that character classes have to be
>>> able to match more than one character under case-folding rules. A lot
>>> of external logic and at least some internal logic operates under this
>>> assumption, so i dont think we can change it.
>>>
>> That sounds right.
>
> Im trying to imagine a way to do this that doesn't involve a pretty
> considerable redesign of how character classes work, and not coming up
> with much.
>
> Yves
>

I've only time right now to address this last point in your response.
I'll look at the rest later.

What I know is that regcomp.c attempts to handle some of this. Here
is a little of it starting at line 8324:
/* Any multicharacter foldings
* require the following transform:
* [ABCDEF] -> (?:[ABCabcDEFd]|pq|rst)
* where E folds into "pq" and F folds
* into "rst", all other characters
* fold to single characters. We save
* away these multicharacter foldings,
* to be later saved as part of the
* additional "s" data. */
SV *sv;

if (!unicode_alternate)
unicode_alternate = newAV();
sv = newSVpvn_utf8((char*)foldbuf, foldlen,
TRUE);
av_push(unicode_alternate, sv);

But it's not working. I never found the time to pursue it. But perhaps
you meant that it doesn't handle things like ß =~ /s{2}/
Re: Matching multi-character folds [ In reply to ]
karl williamson wrote:
> demerphq wrote:
> [snip]
>>>>
>>>> I personally consider character class notation to be an abbreviation
>>>> of alternation. So a character class [xyz] is supposed to match the
>>>> same thing as (x|y|z). This implies that character classes have to be
>>>> able to match more than one character under case-folding rules. A lot
>>>> of external logic and at least some internal logic operates under this
>>>> assumption, so i dont think we can change it.
>>>>
>>> That sounds right.
>>
>> Im trying to imagine a way to do this that doesn't involve a pretty
>> considerable redesign of how character classes work, and not coming up
>> with much.
>>
>> Yves
>>
>
> I've only time right now to address this last point in your response.
> I'll look at the rest later.
>
> What I know is that regcomp.c attempts to handle some of this. Here is
> a little of it starting at line 8324:
> /* Any multicharacter foldings
> * require the following transform:
> * [ABCDEF] -> (?:[ABCabcDEFd]|pq|rst)
> * where E folds into "pq" and F folds
> * into "rst", all other characters
> * fold to single characters. We save
> * away these multicharacter foldings,
> * to be later saved as part of the
> * additional "s" data. */
> SV *sv;
>
> if (!unicode_alternate)
> unicode_alternate = newAV();
> sv = newSVpvn_utf8((char*)foldbuf, foldlen,
> TRUE);
> av_push(unicode_alternate, sv);
>
> But it's not working. I never found the time to pursue it. But perhaps
> you meant that it doesn't handle things like ß =~ /s{2}/
>
>
And, another idea that might be helpful. I looked up the discussion in
this list's archives about tricky folds, and someone suggested an idea
that I also had been thinking of independently, and it didn't look like
there was any response to his idea. And that was in effect to instead
of using trickyfold, to pretend for the tricky fold characters that the
input was a mapping of them. For ß, for example, pretend it was
(?:ß|[Ss][Ss]|\x{1e9e}). Then the optimizer wouldn't have to be fooled.
Re: Matching multi-character folds, and FMTEYEWTK on troubles thereof [ In reply to ]
Oh good: *this* one I *can* reply to. :-)

In-Reply-To: Message from karl williamson <public@khwilliamson.com>
of "Mon, 24 Nov 2008 12:20:10 MST." <492AFE6A.6000806@khwilliamson.com>

> demerphq wrote:

>>>> I personally consider character class notation to be an abbreviation
>>>> of alternation. So a character class [xyz] is supposed to match the
>>>> same thing as (x|y|z). This implies that character classes have to be
>>>> able to match more than one character under case-folding rules. A lot
>>>> of external logic and at least some internal logic operates under this
>>>> assumption, so i dont think we can change it.

>>> That sounds right.

>> I'm trying to imagine a way to do this that doesn't involve a pretty
>> considerable redesign of how character classes work, and not coming
>> up with much.

>> Yves

Me, I'm wondering the same thing. See below. Sometimes I feel lost
in one of Borges's labyrinths--or Eco's, though these are the same.

I *sure* wish we had time to investigate what Ken, Rob, and Andrew
did with their UTF-8 regex engines. I sent refs about those earlier.

Each of those 3 alone is probably smarter than the next dozen or 3
of us reading this put together, and I bet we could learn from their
work--acknowledging "learning" *could* mean what not to do as well
as what *to* do. :-)

Unfortunately, 3 dozen bicyclists can't ever move cross-country
the amount of freight that one big semi-trailer can move. So I
hope that's not what we're up against here. :-)

> I've only time right now to address this last point in your response.
> I'll look at the rest later.

> What I know is that regcomp.c attempts to handle some of this. Here
> is a little of it starting at line 8324:

> /* Any multicharacter foldings
> * require the following transform:
> * [ABCDEF] -> (?:[ABCabcDEFd]|pq|rst)
> * where E folds into "pq" and F folds
> * into "rst", all other characters
> * fold to single characters. We save
> * away these multicharacter foldings,
> * to be later saved as part of the
> * additional "s" data. */
> SV *sv;

> if (!unicode_alternate)
> unicode_alternate = newAV();
> sv = newSVpvn_utf8((char*)foldbuf, foldlen,
> TRUE);
> av_push(unicode_alternate, sv);

> But it's not working. I never found the time to pursue it. But perhaps
> you meant that it doesn't handle things like ß =~ /s{2}/

Case-insensitive character-class compares where the case-change changes the
byte-length in utf-8 (like code point 255 mapping to 376 in upper case, or
code point 223 mapping to *double* code points 83 catted to 83--and worse),
is the stuff of massive headaches. I wonder what the Plan9 team did for this?
I really think we should find out.

And that's not even thinking about whether code point 255 has instead been
expressed as code point 121 followed by code point 776. (Yes, those are
all expressed in decimal.)

It's this latter that I had to cope with this weekend, and Perl doesn't
really even acknowledge the problem's existence, save by suggesting the
user preprocess into NFKC forms, or something.

This weekend, I was pretty happy to finally come up with how to get non-
English comparisons and searches to behave "correctly". I spent all day
and all night Saturday reading specs, downloading current working docs in
icky XML format, and a lot more.

But this only works for character searches, not real regexes.

My task was that I had a series of burgs from Spain or with Spanish names,
and I had to sort them as a Spaniard would expect to see them sorted.

Now, this is itself a somewhat intractable problem, since one Spaniard
in every four doesn't speak "Spanish" as their mother tongue, but
rather, a different lengua española (per their Constitution).

Not only did I have to do with multiple languages, I had to strip out
leading articles and uncapped intervening particles, like a phonebook
or title-type sort might do.

So I had to account for Castilian, that Iberian tongue the rest of the
world calls "Spanish", as well as Catalan and Galician, at a minimum.
Asturianu fits in with the rest, and I disavow all knowledge of the
alien Euskara. :-)

("Castilian", contrary to popular misconception, is *not* the
accent of northern Spain, but the "Spanish" language itself.
The word Castilian is generally used within the Peninsula to
distinguish it from the many other Iberian languages, such
as those I just listed above.)

My only punting was that I ordered it

a b c ç ch d ... k l ll m n ñ o p ...

CAVEAT LECTOR:
The next stuff below will only look right in a utf8 xterm: it's the
only thing that does combining characters correctly, especially the
multiples I need for the [phonologic] transcriptions that follow the
/phonemic/ ones. Also note that xterm gets tabs wrong, but vim in
that xterm gets them right! Thence my jihad against Text::Tabs.
I think my new application of \X will handle them all there.

It's punting because Castilian doesn't use ç anymore; they use z (named
zeta) but equated to Greek theta or English unvoiced th of thirty not
thither { "zeta" /ˈθeta/ => [ˈθe̞t̪ä] }, because it no longer means the
/ts/ sound heard in the original Çid.

However, Catalan places ç ("ce trencada") *between* c and d. This is
unlike Portuguese or French, which equate ç to c because they discount
diacriticals--except for French where for ties, they resolve right to
left. Galician, while *extremely* closely related to Portuguese,
nevertheless uses Castilian orthography, *not* Portuguese.

Meanwhile, Castilian traditionally placed ch { "che" /tʃe/ => [t̪ʃe̞] }
between c and d, and ll { "elle" /ˈeʎe/ => [ˈe̞ʎe̞], more commonly now
[ˈe̞ʝe̞], save in Vallodolid, sometimes Barcelona, etc } between l and
m. They don't do this now, since 1997, but they still put ñ { "eñe"
/ˈeɲje/ => [ˈẽ̞ɲje̞] } between n and o.

My table below thus reflects something of a merge of the traditional
Castilian ordering with Catalan's treatment of ç.

The crux is the fancy collator object I create. Its numbers DO
NOT MATCH the documentation's suggestions, which I believe must
be because the DUCET we use is newer than when the docs were made.

+-----------------------------------------------------------------+
| NB: I think that means we need to update our documentation! (?) |
+-----------------------------------------------------------------+

But mirabile visu, this at long last gives me something to use for
searches, compares, and substitutes.

But golly, is it--um--non-fast!

I can't see how we won't have to someday do something like this for Perl.
Already people think that Unicode compares the way the default collator
does things, and it doesn't--it's strictly codepoint compares (mostly). But
the day we *really* do that--well, I just can't see it being small or fast.
I haven't looked at the Unicode::String class to see if that's what it's
doing yet, but it might be a good idea.

Right now, \X works fine with "\x{E7}" (well, kinda) as it does with
"e\x{0327}". However, we don't have anything that grabs digraphs the way I
have to do for elle and che, and which the collator object enables/permits
(and the NFKD lets me dodge Catalan's ela geminada).

Thank *goodness* I'm not doing hyphenation! Then I'd have to add "rr",
named "erre", as opposed to "ere", as an unbreakable digraph, even though
it is not *really* its own letter, pretty much save for hyphenation.

That makes

Mon-te-rrey

the correct hyphenation for the city in Mexico. You mustn't
split the rr! And you always want open syllables if possible;
different rules than in English. Why, even Knuth Himself admits
how VERY hard correctly hyphenating is--and that's in English alone.

But that's a whole "nother" project.

See, if you're interested:

* Unicode Collation Algorithm - UTS #10
http://unicode.org/reports/tr10/

* Unicode Normalization Forms - UAX #15
http://www.unicode.org/reports/tr15/

* Composition Exclusion Table
http://www.unicode.org/Public/UNIDATA/CompositionExclusions.txt

* Normalization Corrections
http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt

* Canonical Equivalence in Applications - UTN #5
http://www.unicode.org/notes/tn5/

*** Unicode CLDR Project: Common Locale Data Repository
http://unicode.org/cldr/

*** CVS Snapshots for CLDR:
ftp://ftp.unicode.org/Public/cldr/cldr-repository-daily.tgz

Yves, there're also French versions of some of the above, s'il te plaît,
but I had trouble getting them to download.

The last, CLDR, contains *VERY* interesting stuff. I wish I could figure
out how to auto-translate these into Unicode::Collation objects. For
example, here's cldr/common/collation/fr.xml, fycnrdths:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "http://www.unicode.org/cldr/dtd/1.6/ldml.dtd">
<ldml>
<identity>
<version number="$Revision: 1.23 $"/>
<generation date="$Date: 2008/03/10 02:27:54 $"/>
<language type="fr" />
</identity>
<collations validSubLocales="fr_BE fr_CA fr_CH fr_FR fr_LU">
<collation type="standard" >
<settings backwards="on" />
<rules>
<reset>ae</reset>
<s>æ</s>
<t>Æ</t>
<!--
<reset>A</reset>
<x><s>Æ</s><extend>E</extend></x>
<reset>a</reset>
<x><s>æ</s><extend>e</extend></x>
-->
</rules>
</collation >
</collations>
</ldml>

Here's cldr/common/collation/es.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "http://www.unicode.org/cldr/dtd/1.6/ldml.dtd">
<ldml>
<identity>
<version number="$Revision: 1.25 $"/>
<generation date="$Date: 2008/06/12 17:39:03 $"/>
<language type="es" />
</identity>
<collations validSubLocales="es_AR es_BO es_CL es_CO es_CR es_DO es_EC es_ES es_GT es_HN es_MX es_NI es_PA es_PE es_PR es_PY es_SV es_US es_UY es_VE">
<collation type="standard" >
<rules>
<reset>N</reset>
<p>ñ</p>
<t>Ñ</t>
<reset>ae</reset>
<s>æ</s>
<t>Æ</t>
<!--
<reset>A</reset>
<x><s>Æ</s><extend>E</extend></x>
<reset>a</reset>
<x><s>æ</s><extend>e</extend></x>
-->
</rules>
</collation >
<collation type="traditional" >
<rules>
<reset>N</reset>
<p>ñ</p>
<t>Ñ</t>
<reset>C</reset>
<p>ch</p>
<t>Ch</t>
<t>CH</t>
<reset>l</reset>
<p>ll</p>
<t>Ll</t>
<t>LL</t>
</rules>
</collation >
<collation draft="unconfirmed" alt="proposed" type="traditional" >
<rules>
<reset>N</reset>
<p>ñ</p>
<t>Ñ</t>
<reset>C</reset>
<p>ch</p>
<t>cH</t>
<t>Ch</t>
<t>CH</t>
<reset>l</reset>
<p>ll</p>
<t>lL</t>
<t>Ll</t>
<t>LL</t>
</rules>
</collation >
</collations>
</ldml>

And here's cldr/common/collation/ca.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ldml SYSTEM "http://www.unicode.org/cldr/dtd/1.6/ldml.dtd">
<ldml>
<identity>
<version number="$Revision: 1.25 $"/>
<generation date="$Date: 2008/06/12 17:39:03 $"/>
<language type="ca" />
</identity>
<collations validSubLocales="ca_ES">
<collation type="standard" >
<settings backwards="on" />
<rules>
<reset>C</reset>
<p>ch</p>
<t>Ch</t>
<t>CH</t>
<reset>L</reset>
<p>ll</p>
<t>l·l</t>
<t>Ll</t>
<t>L·l</t>
<t>LL</t>
<t>L·L</t>
</rules>
</collation >
<collation draft="unconfirmed" alt="proposed" type="standard" >
<settings backwards="on" />
<rules>
<reset>C</reset>
<p>ch</p>
<t>cH</t>
<t>Ch</t>
<t>CH</t>
<reset>L</reset>
<p>ll</p>
<t>l·l</t>
<t>lL</t>
<t>l·L</t>
<t>Ll</t>
<t>L·l</t>
<t>LL</t>
<t>L·L</t>
</rules>
</collation >
<collation draft="unconfirmed" alt="cldrbug173" type="standard" >
<settings backwards="on" />
</collation >
</collations>
</ldml>

Hm, looks like they forgot ce trancada altogether, and I don't know that
Catalan is supposed to be Frenchly-backwards in resolution. Nor do I
know if I could find a native speaker locally, or if I did, they would
even know. Ask a native English speaker subtle questions about his own
language, and you will *not* find them as authoritatively certain as you
might hope, wish, expect, or wait for (NB: "esperar" being the prettily
parsimonious translation for ALL FOUR of those verbs into Spanish. :-)

D'oh: I'll just look in a Catalan unilingual dictionary!

For those looking to figure out more about what I typed above, or why
I'm doing what I do below, you can also check out those of these that
you can read. The versions in their own languages are ALWAYS better, if
you can make your way through them, than the English versions. Just
look at way the first one calls it a polemic, which it is, while English
skips the tone.

* http://en.wikipedia.org/wiki/Names_given_to_the_Spanish_language
http://es.wikipedia.org/wiki/Pol%C3%A9mica_en_torno_a_espa%C3%B1ol_o_castellano

* http://en.wikipedia.org/wiki/Spanish_language
http://es.wikipedia.org/wiki/Idioma_espa%C3%B1ol

* http://en.wikipedia.org/wiki/Spanish_phonology
http://es.wikipedia.org/wiki/Fonolog%C3%ADa_del_espa%C3%B1ol

* http://en.wikipedia.org/wiki/Help:IPA_for_Spanish
http://es.wikipedia.org/wiki/Transcripci%C3%B3n_fon%C3%A9tica_del_espa%C3%B1ol_con_el_IPA

* http://en.wikipedia.org/wiki/Catalan_orthography
http://en.wikipedia.org/wiki/Catalan_phonology
http://ca.wikipedia.org/wiki/Alfabet_catal%C3%A0

* http://en.wikipedia.org/wiki/Galician_language
http://en.wikipedia.org/wiki/Galician-Portuguese
http://gl.wikipedia.org/wiki/Lingua_galega

--tom

--

"And the Lord said, Behold, the people is one, and they have all
one language; and this they begin to do; and now nothing will be
restrained from them, which they have imagined to do. Go to, let
us go down, and there confound their language, that they may not
understand one another's speech. So the Lord scattered them
abroad from thence upon the face of all the earth: and they left
off to build the city. Therefore is the name of it called Babel;
because the Lord did there confound the language of all the earth:
and from thence did the Lord scatter them abroad upon the face of
all the earth." Genesis 11:6-9, KJV


PS: My program below runs correctly even if you have PERL_UNICODE set to
S as I do. So NOTE the binmode: that means you can run it
wherever. Extract my program and run it, either in an ISO8859-1,
ISO8859-15, or an UTF-8 environment, and it will behave equally
well, since nothing needs codepoints not in Latin1. This I
construe to be a feature. Note the sorting: lovely, n'est-ce pas?

PPS: *Now* you see why the non-\X-aware bugs in Text:Tab,
Text::Autoformat, et alios were so very vexing to me. :-(

PPPS: Just *wait* till you find LA in my list!! :-)

#!/usr/bin/perl
# es-sort - sort Spanish city names
# version 1.0
#
# Tom Christiansen <tchrist@perl.com>
# Sun Nov 23 13:47:19 2008 -0700
# Boulder, Colorado [MST]

use 5.010_000;

use Unicode::Collate;

$es = Unicode::Collate->new( entry => <<'YANETUT',
0063 0068 ; [.1000.0020.0002.0063] # ch
0043 0068 ; [.1000.0020.0007.0043] # Ch
0043 0048 ; [.1000.0020.0008.0043] # CH
006C 006C ; [.10F5.0020.0002.006C] # ll
004C 006C ; [.10F5.0020.0007.004C] # Ll
004C 004C ; [.10F5.0020.0008.004C] # LL
00E7 ; [.0FFC.0020.0002.0063] # c-cedilla NFC
0063 0327 ; [.0FFC.0020.0002.0063] # c-cedilla NFD
00C7 ; [.0FFC.0020.0002.0043] # C-Cedilla NFC
0043 0327 ; [.0FFC.0020.0002.0043] # C-Cedilla NFD
00F1 ; [.112B.0020.0002.00F1] # n-tilde NFC
006E 0303 ; [.112B.0020.0002.00F1] # n-tilde NFD
00D1 ; [.112B.0020.0008.00D1] # N-Tilde NFC
004E 0303 ; [.112B.0020.0008.00D1] # N-Tilde NFD
YANETUT

upper_before_lower => 1,

normalization => "NFKD",

preprocess => sub { # strip leading articles
my $_ = shift();

# 1st: strip leading articles/particles
s/^L'\b//; # Catalan

s{ ^ # leading, *leading*, I said!
(?:

# Castilian / castellano
El
| Los
| La
| Las

# Catalan / catalán / català
| Els
| Les
| Sa
| Es

# Galician / gallego / galego
| O
| Os
| A
| As

)

\s+ # gobble whitish space, whatever that means

}{}x;

# 2nd: strip interior particles

s/\b[ndl]'\b//; # Catalan; yes, both \b's are right--and differ!!

s{ # now for the ones in the middles; note lc
\b
(?:

# Castilian / castellano:
y | el | los | la | las | de | del |
# Galician / gallego / galego:
e | o | os | a | as | de | do | dos | da | das |
del | deles| dela | delas |
# alt GA contractions, more ES than PT influenced ^
# Catalan / catalán / català:
i | el|en | els | la | les | de | del | dels |
es | sa
# last are CA for ES "se" and "su", but don't seem to count
# Catalan city-name sorts.

)

\b

}{}gx;

# Technically, GA has many other contractions, both PROCLITIC
# (la'n, l'en, li'n, me'n, n', se'n, te'n) and ENCLITIC ones
# (-ens-en, -la'n, -l'en, -les-en, -li'n, -los-e,n'ls-en, -me'n,
# 'ns-en, -se'n, -te'n, -us-en, -vos-en). I cover n' below.

# Plus CA has a great deal of futzing with hi and ho and en and such.
# They can go hang.

return $_;

},
) || die;

binmode(DATA, ":utf8") || die;

chomp(@words = <DATA>);

@swords = $es->sort(@words); # time passes...

for $word (@swords) {
say $word;
# printf "%-12s %s\n", $word, $es->viewSortKey($word);
}

__END__
Sant Julià de Cerdanyola
Muros de Nalón
Montgat
Melilla
Malasaña
Collado Villalba
Macharaviaya
Navalagamella
Sant Antoni de Vilamajor
Estellencs
Selaya
Calvià
Manlleu
Abusejo
Salas
Rágama
Osona
Abegondo
Valdemorillo
Leganés
La Pobla de Claramunt
Barcelona
Montalbán de Córdoba
Aldeavieja de Tormes
Montejo
Cariño
Vallejera de Riofrío
Avià
Morata de Tajuña
Grau Roig
Puigpunyent
Villares de Yeltes
Pozos de Hinojo
La Encina
Gelves
Villaharta
Camas
Ventosa del Río Almar
Terradillos
Tamames
Coaña
Valderrodrigo
Navacerrada
Gerena
Pontedeume
Gallifa
Valdepiélagos
Cabranes
Garcibuey
Vallromanes
Roda de Ter
Cerceda
Fuentes de Béjar
Gurb
Lamasón
Buitrago del Lozoya
Cuevas del Becerro
Casabermeja
Valverde de Alcalá
Vilanova i la Geltrú
Corvera de Toranzo
Villacarriedo
Calella
Santa Eulàlia de Ronçana
Cádiz
Alpedrete
Valenzuela
Sora
Valdemanco
Suances
Puebla de la Sierra
Ames
Braojos
Encinas de Abajo
Narón
Zuheros
Juberri
Fígols
Moralzarzal
Fontaneda
Sitges
Sorihuela
Mancera de Abajo
Vilobí del Penedès
Navas del Rey
Castellnou de Bages
Sant Cugat del Vallès
Carrascal del Obispo
Cangas del Narcea
Las Veguillas
Salamanca
Castro Urdiales
Ferreries
Cilleros de la Bastida
Argentona
Vegadeo
Gascones
Campins
Castellolí
Lena
Colmenar Viejo
Ordina
Udías
Villaviciosa
Ávila
Santa Eulàlia de Riuprimer
Perales de Tajuña
Cabrera de Mar
Prats
Martorell
Bárcena de Cicero
Arcediano
Parres
Marbella
Toledo
Torresmenudas
Avinyó
Alameda
Valdemierque
Lloseta
Vilassar de Mar
Riells
Muro
Alt Camp
Sant Martí Sarroca
Carral
Valdelacasa
El Payo
Puerto Seguro
Torelló
La Victoria
Armenteros
Campillos
Colunga
Posadas
Montornès del Vallès
Peñaparda
Castellbell i el Vilar
Villarejo de Salvanés
Pujerra
Ribeira
Taradell
Puente Genil
Saldeana
Escalante
Fonollosa
Estepa
Valencia
Bormujos
Cebadag
Ripollet
Pallars Sobirà
Santa Eulalia de Oscos
Vilafranca de Bonany
Moriles
Ribamontán al Monte
Sant Adrià de Besòs
Reinosa
Banyalbufar
Baix Llobregat
El Serrat
Alcaracejos
Formentera
San Felices de Buelna
Ledesma
Sant Esteve de Palautordera
Monterrey, Mexico
Las Salines
Grado
Liendo
La Alamedilla
Pezuela de las Torres
Villares de la Reina
L'Estany
Alameda del Valle
Iznate
Granera
Cantagallo
Ribadedeva
Cartajima
Laxe
Aldeanueva de Figueroa
Zamarra
El Real de la Jara
Ferrol
Villar de la Yegua
Vilarmaior
Vélez-Málaga
La Coruña
Sierra de Yeguas
Cuevas Bajas
Alanís
Mejorada del Campo
Serranillos del Valle
Alaraz
Guadalcanal
Santa María de Cayón
Piloña
Valdemaqueda
Sant Boi de Llobregat
Castelldefels
Cedeira
Gironella
Ceuta
Montuïri
Sobrado
Álava
Pajares de la Laguna
El Pedroso
Potes
Marinaleda
Valdés
Alcaucín
Fuentidueña de Tajo
Cubelles
As Somozas
Alaior
La Vídola
Sant Quirze Safaja
Gajates
Tremedal de Tormes
Liérganes
Piélagos
Santa Coloma de Gramenet
Collbató
Encinas de Arriba
Belmonte de Tajo
Benamocarra
San Vicente de la Barquera
Pedroche
Villanueva de Oscos
Meruelo
Algaida
Ronda
Arabayona de Mógica
Fuenteguinaldo
Entrambasaguas
Ampuero
Sant Martí de Centelles
Baix Ebre
San Miguel del Robledo
San Pedro del Romeral
Cerdido
Sant Agustí de Lluçanès
Hermandad de Campoo de Suso
Guipúzcoa
Teverga
Bagà
Macotera
Bélmez
Camarma de Esteruelas
Collado Mediano
O Pino
San Martín del Rey Aurelio
Dios le Guarde
Santaella
Valdeprado del Río
Solórzano
El Pedroso de la Armuña
Petra
Aller
Cubas de la Sagra
El Berrueco
Ortigueira
El Cerro
Santibáñez de Béjar
Cañada Rosal
Tolox
Guadarrama
Molledo
Machacón
Tarragonès
Parada de Arriba
Alicante
Palencia de Negrilla
Aguadulce
Pujalt
Viladecans
Molinillo
Villaverde de Guareña
Viver i Serrateix
San Fernando de Henares
Martiago
Badalona
Piera
Sobremunt
Oza dos Ríos
Valderredible
Aznalcóllar
Meritxell
Santa Maria de Besora
Espino de la Orbada
Manresa
Baena
Vila
Tineo
La Zarza de Pumareda
Los Molares
Titulcia
Medio Cudeyo
Mogarraz
Villafufre
Santa Cruz de Bezana
El Paso, New Mexico
Lozoyuela-Navas-Sieteiglesias
Tordera
Santillana del Mar
Calvarrasa de Abajo
Cee
Baix Penedès
Rubió
Es Mercadal
Espadaña
Llanes
Copons
Els Prats de Rei
Porto do Son
Zaragoza
Villaseco de los Gamitos
Belalcázar
Utrera
El Sahugo
Castañeda
Castilleja de la Cuesta
Villaconejos
Aranga
Malgrat de Mar
Vilanova del Vallès
Las Rozas de Madrid
Vilasantar
Corvera de Asturias
Pruna
Pereña de la Ribera
Mazcuerras
Zamayón
Cazalla de la Sierra
Talamanca de Jarama
Navamorales
Sant Vicenç de Castellet
Aldehuela de la Bóveda
Callús
Palafrugell
Navia
Fresno Alhándiga
Villar de Peralonso
Ibias
Poveda de las Cintas
Valdetorres de Jarama
Villoruela
Fuenteliante
Islas Baleares
Parets del Vallès
Bárcena de Pie de Concha
Andratx
Martín de la Jara
Las Casas del Conde
Rionansa
Sant Vicenç de Torelló
València
El Madroño
Perafita
Atajate
Outes
Seva
Es Migjorn Gran
Noguera
Sanchotello
Sant Boi de Lluçanès
Alt Penedès
Riotuerto
Basauri
La Lantejuela
Alamosa, Colorado
Villamayor
Sant Mateu de Bages
Santa Fe, New Mexico
Villamanta
Puebla de San Medel
Sant Josep de sa Talaia
A Coruña
Gallegos de Argañán
Batres
Monleras
Collsuspina
Torrelavega
Cangas de Onís
Cantaracillo
Villar de Argañán
Coripe
Arapiles
Gisclareny
El Campo de Peñaranda
El Masnou
Inca
San Pedro de Rozados
Lora del Río
Villagonzalo de Tormes
Casarabonela
Villar del Olmo
Pastores
Soto del Real
Tarragona
Llucmajor
Villanueva del Duque
Irún
Quirós
Hornachuelos
Pontons
Campo Real
Marganell
Estremera
Olmeda de las Fuentes
Sant Antoni de Portmany
Mollina
San Miguel de Aguayo
Añora
Begues
Villanueva del Río y Minas
Rupit i Pruit
Puentes Viejas
Campos
Palenciana
Bages
Aranjuez
Carmona
Sant Quirze de Besora
Esparreguera
Galinduste
Teba
Capdepera
La Granjuela
Valdoviño
San Tirso de Abres
Sada
Sant Vicenç de Montalt
Paderne
Mañón
Cabeza del Caballo
Lozoya
Sant Esteve Sesrovires
Esporles
Rascafría
Puente Viesgo
Boal
Peñamellera Alta
Doñinos de Ledesma
La Hiruela
Villanueva del Pardillo
Morón de la Frontera
Mieres
Colmenarejo
Llubí
Hazas de Cesto
A Laracha
El Arco
Fresnedoso
Móstoles
Alcalá de Henares
Tres Cantos
Torremocha de Jarama
Cútar
Lagunilla
Navasfrías
La Cabrera
Malpartida
Rentería
Villasrubias
Écija
Sant Andreu de la Barca
Ajalvir
Bellprat
Fuente Obejuna
Toques
Berga
Guillena
Benacazón
Barceo
Barberà del Vallès
Pedrera
Torrelles de Llobregat
Villa del Río
Marina de Cudeyo
La Fregeneda
Castellví de la Marca
La Alberca
Martorelles
San Sebastián
Sentmenat
Pelayos
La Rinconada
Ardales
Humanes de Madrid
Pinedas
Dumbría
Ribatejada
Fernán-Núñez
Coín
Endrinal
Salteras
Manzanares el Real
Sispony
Villanueva del Ariscal
Guaro
Torrejón de Velasco
Las Rozas de Valdearroyo
Gelida
Sa Pobla
Boqueixón
La Quar
Berzosa del Lozoya
Fuentes de Andalucía
Jimera de Líbar
El Pino de Tormes
Navarcles
Santa María del Camí
Ribadesella
Langreo
Santiurde de Toranzo
Sant Pere Sallavinera
Periana
Binissalem
Robregordo
Neda
Vizcaya
Ã’rrius
Vilanova del Camí
Ribera de Arriba
Orpí
Onís
La Rioja
La Llacuna
Marchena
Pueblo, Colorado
Castellar del Vallès
Siero
Cambre
Tapia de Casariego
Membribe de la Sierra
Valverdón
Villarmayor
Pedrezuela
Alcalá de Guadaíra
La Hoya
Cabrerizos
Guadramiro
Masella
Sedella
Casarrubuelos
Estepona
Fogars de Montclús
La Llagosta
Bimenes
Valdáliga
Ejeme
Cristóbal
Sant Salvador de Guardiola
Valle de Abdalajís
Agallas
Sepulcro-Hilario
La Orbada
Vega de Tirados
Montcada i Reixac
Trinidad, Colorado
Santa Eulària del Riu
Los Santos de la Humosa
Fuente de Piedra
Laredo
Dosrius
Bogajo
Topas
Premià de Dalt
Santa Comba
El Prat de Llobregat
Gualba
Alozaina
Algatocín
Noja
Cespedosa de Tormes
Cabezón de Liébana
Soldeu
Llafranc
La Algaba
Galapagar
Canillas de Albaida
Alacant
Masquefa
Navalafuente
Gaià
Villaviciosa de Odón
Encina de San Silvestre
Candamo
Daganzo de Arriba
Zamora
Badajoz
Rollán
El Bruc
Santiponce
Villanueva del Rosario
Santoña
Miengo
Ullastrell
La Calzada de Béjar
Almería
La Peña
Moronta
Mairena del Aljarafe
El Garrobo
Pelayos de la Presa
Cáceres
Carpio de Azaba
Ciudad Real
Pallejà
Vecinos
Igualeja
Fuenterroble de Salvatierra
Xixón
Arinsal
Sallent de Llobregat
Calvarrasa de Arriba
Sant Llorenç des Cardassar
Monistrol de Montserrat
Tielmes
Caldes de Montbui
Lluçà
Deià
Aldeacipreste
Castellet i la Gornal
Pas de la Casa
Torrelodones
Palafolls
Ourense
Ciutadella
Sant Lluís
Villaflores
Granada
Castropol
Sant Martí Sesgueioles
Masueco
Canovelles
Benalauría
San Morales
Taramundi
Serradilla del Llano
Guijuelo
Villarmuerto
Martinamor
León
Vallgorguina
Guadalix de la Sierra
Bañobárez
Teruel
Tordoia
Villoria
San Agustín del Guadalix
Alfarnate
Tarazona de Guareña
Sant Julià de Vilatorta
Villalba de los Llanos
Carballo
Rasines
La Maya
Valdecarros
Viñuela
Valdelageve
Navàs
Arenas de Iguña
Ojén
Villamanrique de la Condesa
Boimorto
Alcalá del Río
Oroso
Getafe
Piñuécar-Gandullas
San Sebastián de los Ballesteros
Salares
Malpica de Bergantiños
Navacarros
Júzcar
Sobradillo
Alba de Yeltes
Becerril de la Sierra
Alt Urgell
Morasverdes
L'Espunyola
Villanueva del Trabuco
Castellón
Escurial de la Sierra
Sallent
Florida de Liébana
Mura
Cudillero
San Pelayo de Guareña
Corcubión
La Carlota
Cantillana
Ramales de la Victoria
Monforte de la Sierra
Castellbisbal
Aiguafreda
Vallès Occidental
Carabaña
Saldes
Morille
Alt Empordà
El Burgo
Cornellà de Llobregat
Ses Salines
Espeja
Santa Eugènia
La Sagrada
La Ciudad de Nuestra Señora la Reina de Los Angeles de Porciúncula, California
Sant Cebrià de Vallalta
Alcorcón
Noreña
Barcelonès
Rivas-Vaciamadrid
Miranda del Castañar
Gallegos de Solmirón
La Torre de Claramunt
Santa Cruz de Tenerife
Sanchón de la Sagrada
Montsià
Viladecavalls
Sant Llorenç d'Hortons
Chagarcía Medianero
El Papiol
Rois
Bollullos de la Mitación
Olèrdola
Sanchón de la Ribera
Vilaller
Gejuelo del Barro
Sant Sadurní d'Anoia
Prádena del Rincón
Pravia
Valdunciel
Castelló
Moià
Zas
Sant Fruitós de Bages
Coria del Río
Sant Joan
Ahigal de los Aceiteros
Cabanillas de la Sierra
El Tejado
Nerja
Mieza
Segarra
Oviedo
Colmenar
Torrelaguna
Nava de Francia
Porreres
Ribera d'Ebre
Mollet del Vallès
San Sadurniño
Cabezabellosa de la Calzada
Es Castell
Barruecopardo
Sant Martí d'Albars
Algarrobo
Durango, Colorado
Santa Maria de Miralles
Esplugues de Llobregat
Vallès Oriental
Sagàs
San Juan de Aznalfarache
Algámitas
Ruesga
Los Corrales
Casariche
Olivella
Vitigudino
El Coronil
Valle de Villaverde
Alconada
Priorat
Valdeolea
La Roda de Andalucía
Pozuelo de Alarcón
Arteixo
Bustarviejo
Castellar del Riu
Iruelos
Robledo de Chavela
Cipérez
La Acebeda
El Cubo de Don Sancho
Balsareny
Paradinas de San Juan
Alfoz de Lloredo
Fornalutx
Baix Empordà
Guriezo
Prats de Lluçanès
Canillo
Les Franqueses del Vallès
Valdefuentes de Sangusín
Villamantilla
Huerta
Valldemossa
La Tala
As Pontes de García Rodríguez
Castell de l'Areny
Cañete la Real
Santa Maria de Palautordera
Logroño
Soba
Lousame
Doña Mencía
El Boalo
La Garriga
El Milano
Coslada
Madroñal
Val de San Vicente
Puebla de Azaba
El Brull
Cuevas de San Marcos
Moriscos
Cabanas
Castraz
Tresviso
Comillas
Torremolinos
Pedro Abad
Dos Hermanas
Vilafranca del Penedès
La Cortinada
Rellinars
Lleida
Aldea del Fresno
Peratallada
Alcolea del Río
San Martín de la Vega
Villaralto
Santa Margarida i els Monjos
Los Santos
Peñarrubia
Sant Bartomeu del Grau
Los Corrales de Buelna
Matilla de los Caños del Río
Rubí
Pizarral
Arahal
Carrión de los Céspedes
Santa Margarida de Montbui
Illas
Montoro
Griñón
Ituero de Azaba
Montmeló
Garcirrey
Juzbado
Caldes d'Estrac
Montmaneu
Soto del Barco
Tardáguila
Meco
Vega de Pas
Venturada
Bergondo
La Campana
Constantina
Dodro
Obejo
Cómpeta
La Bastida
Las Navas de la Concepción
Aldeanueva de la Sierra
Bilbao
Mesía
Calzada de Valdunciel
Somosierra
Fresno de Torote
Robledillo de la Jara
Buenamadre
Ledrada
Arenys de Mar
Palma del Río
Salvatierra de Tormes
Ratón, New Mexico
Villaseco de los Reyes
Cardona
Manilva
Berguedà
Rajadell
Amieva
San Lorenzo de El Escorial
Calldetenes
Tavertet
Arans
Torrelavit
El Escorial
Cabrera d'Igualada
Villasdardo
Villar de Gallimazo
Babilafuente
Vega de Liébana
Degaña
Costitx
Adamuz
Villanueva del Rey
Casserres
Sant Andreu de Llavaneres
Castillejo de Martín Viejo
Torrejón de Ardoz
Benaoján
Gines
Santpedor
Ciudad Rodrigo
Villayón
Alhaurín el Grande
El Manzano
San Esteban de la Sierra
Valladolid
Castilblanco de los Arroyos
Pinto
Villaescusa
Pozuelo del Rey
Peñaranda de Bracamonte
El Saucejo
Oristà
Muros
Aldeaseca de la Frontera
Jorba
Mancor de la Vall
Pozoblanco
Pilas
Colmenar del Arroyo
Cervelló
Noia
Tenebrón
Trabanca
Masies de Roda
Doñinos de Salamanca
Monsagro
Pesoz
Santa Coloma de Cervelló
Madrid
Chapinería
Peralejos de Arriba
Tabera de Abajo
Castrillón
Herguijuela del Campo
Allande
Penagos
Montmajor
Coirós
Zorilla
Linares de Riofrío
Santiso
Sanlúcar la Mayor
El Franco
Canillas de Aceituno
Carnota
Negreira
Alfarnatejo
Villavieja de Yeltes
Vilada
La Nou de Berguedà
La Alameda de Gardón
Gijón
Moclinejo
Saelices el Chico
Betanzos
Subirats
Árchez
Tiana
Boiro
Boadilla del Monte
Canyelles
Fuentes de Oñoro
Seville
Vilalba Sasserra
San Pedro del Valle
La Rinconada de la Sierra
Vic
Consell
Sant Martí de Tous
Vilvestre
Golpejas
Maresme
Polanco
Villanueva de Algaidas
Montilla
Sevilla
Robleda
Pelarrodríguez
Calaf
Zorita de la Frontera
Colmenar de Oreja
Urgell
Tagamanent
Ruente
Ciempozuelos
Sa Riera
Vallirana
Carratraca
Hinojosa de Duero
Val do Dubra
Borredà
Gargantilla del Lozoya y Pinilla de Buitrago
Les Masies de Voltregà
Canencia
Rute
Monterey, California
Frades de la Sierra
Villanueva del Conde
Villanueva de San Juan
Cobeña
Mazaricos
Aguilar de Segarra
Rozas de Puerto Real
El Borge
Herguijuela de la Sierra
Huelva
Sant Pere de Ribes
Sant Sadurní d'Osormort
Ponteceso
Belmonte de Miranda
Val d'Aran
Castellanos de Moriscos
Algete
Carme
Gozón
Cadalso de los Vidrios
Velilla de San Antonio
Montseny
Artés
Navalcarnero
Comares
Almodóvar del Río
Orusco de Tajuña
Sariego
El Cabaco
Santa Maria de Martorelles
Villanueva de Perales
Istán
Brea de Tajo
Alcobendas
El Atazar
Valdilecha
Coca de Alba
Alhaurín de la Torre
Sant Pol de Mar
Tomares
Aznalcázar
Nava
Cártama
Moraleja de Enmedio
Getxo
Olvan
Martín de Yeltes
Somiedo
Fuengirola
Montemayor del Río
Segovia
Son Servera
El Molar
Almogía
Aldeatejada
Sant Joan de Vilatorrada
Torrejón de la Calzada
Valdeolmos-Alalpardo
Fene
El Rubio
Coristanco
La Roca del Vallès
Anoia
Llanera
Sequeros
Barakaldo
Villaverde del Río
Puigdàlber
Carreño
Santa Susanna
Escorca
Baix Camp
Abrera
Cepeda
Sant Just Desvern
Hinojosa del Duque
Gilena
Salmoral
Limpias
San Nicolás del Puerto
Robliza de Cojos
Sabadell
Sencelles
Canillas de Abajo
Gavà
Proaza
Cerdanyola del Vallès
Castilleja de Guzmán
Font-rubí
Barbadillo
Berrocal de Huebra
Aguilar de la Frontera
Villavieja del Lozoya
Guadalcázar
El Cuervo
Parauta
Garraf
Fuenlabrada
Pedraza de Alba
Moeche
Priego de Córdoba
Villa del Prado
Bujalance
Terrassa
Ribamontán al Mar
Sóller
Alella
Nuevo Baztán
Cañon City, Colorado
Artà
Lumbrales
Pelabravo
Sineu
El Castillo de las Guardas
Saro
La Redonda
Vitoria
Valdelaguna
Brión
L'Ametlla del Vallès
Riogordo
Sardón de los Frailes
Figaró-Montmany
Arredondo
Encinasola de los Comendadores
Cerdanya
Eivissa Vila
Paradas
El Vellón
Hoyo de Manzanares
La Pobla de Lillet
Cervera de Buitrago
Horcajo de Montemayor
Nava de Sotrobal
Tona
Valdelosa
Zarzalejo
Pesaguero
Castellterçol
Espiel
Gaucín
Cortez, Colorado
Montejaque
Olesa de Bonesvalls
Cabrillas
Ambite
Benalmádena
Morcín
Córdoba
Cenicientos
Capellades
Bóveda del Río Almar
Arenys de Munt
Palencia
Humilladero
Añover de Tormes
Trazo
Caso
Santa María de Sando
Corpa
Cercs
Santorcaz
Riosa
Santa Maria de Corcó
La Vansa i Fórnols
Búger
Santa Cecília de Voltregà
Huévar del Aljarafe
Veciana
Sando
Cabuérniga
Avilés
Tordillos
Casafranca
Puertas
Peromingo
Bermellar
Moríñigo
Calzada de Don Diego
Peralejos de Abajo
Ariany
Sant Climent de Llobregat
Burgos
Cieza
La Palma de Cervelló
Zarapicos
A Baña
Reocín
Monleón
Oleiros
Pedrosillo el Ralo
Santiago de la Puebla
Sant Hipòlit de Voltregà
La Bouza
Ã’dena
Polinyà
Herrera
Montemayor
San Muñoz
Bixessari
Cardeña
Mozárbez
Santa Margalida
Puebla de Yeltes
Larrodrigo
Llorts
Fuente Palmera
Argençola
Sayalonga
Aldeadávila de la Ribera
Escaldes-Engordany
Yecla de Yeltes
Málaga
Els Hostalets de Pierola
Horcajo Medianero
Frigiliana
Llumeneres
Pontevedra
Colmenar de Montemayor
Castellgalí
Polaciones
Albacete
Horcajo de la Sierra
Cabezón de la Sal
Anaya de Alba
Cartes
Sant Llorenç Savall
Aldealengua
Sant Pere de Riudebitlles
Sant Jaume de Frontanyà
La Massana
Calders
Miño
Santa Coloma
Ordes
El Ronquillo
Badia del Vallès
Valverde de Valdelacasa
Monterrubio de la Sierra
San Miguel de Valero
Guardiola de Berguedà
Santander
Santa Fe del Penedès
El Pla del Penedès
Randsol
Montesquiu
Paracuellos de Jarama
Marratxí
Cànoves i Samalús
Fresnedillas de la Oliva
Almáchar
Ripollès
Espejo
Villanueva de la Cañada
Villar de Samaniego
Caravia
Calella de Palafrugell
A Pobra do Caramiñal
Guijo de Ávila
Torrelles de Foix
Laviana
Les Bons
Puente del Congosto
Cerralbo
Herguijuela de Ciudad Rodrigo
Fuente la Lancha
Sant Fost de Campsentelles
Piñon, Colorado
Yernes y Tameza
Sant Celoni
Campoo de Enmedio
El Álamo
Portugalete
Bareyo
Fuente-Tójar
La Mata de Ledesma
Umbrete
Colindres
L'Hospitalet de Llobregat
Les Cabanyes
Encinas Reales
Miraflores de la Sierra
Espartinas
Pacs del Penedès
Padrón
Monda
Villanueva de Tapia
San Roque de Riomiera
Cerezal de Peñahorcada
Castellcir
Conca de Barberà
Cordovilla
Santa Eugènia de Berga
Quijorna
Montclar
Tudanca
Illano
Arroyomolinos
Sobrescobio
Lugo
Selva
Retortillo
Grandas de Salime
Loeches
Iznájar
Fogars de la Selva
Aldearrubia
La Atalaya
Molins de Rei
Torres de la Alameda
Lliçà de Vall
Solsonès
Berrocal de Salvatierra
Cereceda de la Sierra
Alaró
Pedrosillo de Alba
Arganda del Rey
Santa Maria de Merlès
Murcia
Benamejí
Lora de Estepa
Cantalapiedra
Andorra la Vella
Navarredonda y San Mamés
Almedinilla
Alba de Tormes
Vimianzo
Presidio, Texas
San Sebastián de los Reyes
Puig-reig
Benahavís
Orís
Galisancho
Pizarra
Luena
Sant Feliu de Llobregat
El Carpio
Voto
Genalguacil
Mairena del Alcor
Sant Julia de Loria
Irixoa
Maria de la Salut
Pinilla del Valle
Pineda de Mar
La Cabeza de Béjar
El Astillero
El Viso
Sant Quintí de Mediona
Santanyí
Anyos
Torrox
Campillo de Azaba
Vilanova de Sau
La Fuente de San Esteban
Peña Blvd, Denver
Valsalabroso
Navales
Torrecampo
Vilassar de Dalt
Benarrabá
Castellanos de Villiquera
Huesca
Aldehuela de Yeltes
Lucena
Isla Mayor
Sant Joan de Labritja
Capolat
Sotoserrano
Arnuero
Vallbona d'Anoia
Anchuelo
Badolatosa
Almendra
Llinars del Vallès
Arzúa
Las Cabezas de San Juan
Valencina de la Concepción
Cortes de la Frontera
Castellfollit del Boix
Balenyà
Almadén de la Plata
Sant Feliu Sasserra
Calonge de Segarra
Valdehijaderos
Álora
Brunete
Monfero
Santo Adriano
Corbera de Llobregat
Nava de Béjar
Cabra
La Vellés
Aldeaseca de Alba
Jaén
Sant Cugat Sesgarrigues
Garrotxa
El Tornadizo
Valdeavero
Folgueroles
Bunyola
Castellfollit de Riubregós
Cabana de Bergantiños
La Alberguería de Argañán
Puerto de Béjar
Totalán
Encamp
Anievas
Pallars Jussà
Teià
Madarcos
Segrià
Ruiloba
Santa Eufemia
Carcabuey
Horcajuelo de la Sierra
Maó
Mediona
Olost
Lebrija
Santurce
Olmedo de Camaces
Peñaflor
Villasbuenas
Montejo de la Sierra
Malla
Almensilla
Saucelle
Beleña
Brenes
Buenavista
Parada de Rubiales
Arriate
Miranda de Azán
Santiz
La Puebla del Río
Archidona
Camariñas
Centelles
Avinyonet del Penedès
Carrascal de Barregas
Melide
Villar de Ciervo
Villalbilla
Benamargosa
Súria
El Guijo
El Bodón
Ahigal de Villarino
Valdaracete
Girona
Castro del Río
Gironès
Majadahonda
Nueva Carteya
Cuenca
Galindo y Perahuy
Los Gatos, California
San Cristóbal de la Cuesta
Villanueva de Córdoba
Tejeda y Segoyuela
La Serna del Monte
Monistrol de Calders
Las Regueras
Negrilla de Palencia
Palma
Valsequillo
Osuna
Sancti-Spíritus
Sant Pere de Vilamajor
Cardedeu
Sant Feliu de Codines
Vacarisses
Canet de Mar
Soria
Campoo de Yuso
Garcihernández
Vallcebre
Castellví de Rosanes
Garganta de los Montes
Palomares del Río
Sant Quirze del Vallès
Cillorigo de Liébana
Fisterra
Monterrubio de Armuña
Aldea del Obispo
Gomecello
Cantalpino
Muntanyola
Palau-solità i Plegamans
Las Palmas
Santa Marta de Tormes
Cesuras
Serradilla del Arroyo
Santa María de la Alameda
Faraján
Santa Perpètua de Mogoda
La Puebla de los Infantes
Los Tojos
Pitiegua
Monturque
Tamariu
Peñarroya-Pueblonuevo
Olesa de Montserrat
Argoños
Los Palacios y Villafranca
El Pont de Vilomara i Rocafort
Burguillos
Casas de Monleón
Igualada
Matadepera
Olivares
Ares
Pollença
Pedrosillo de los Aires
A Capela
Castilleja del Campo
Navarra
Almargen
Peñarandilla
Cabrales
Cercedilla
El Viso del Alcor
Teo
Guadalajara
Tocina
Casillas de Flores
Peñamellera Baja
Granollers
Peñacaballera
Patones
Palaciosrubios
Barbalos
Cabrils
Montellano
Miera
Manacor
Talamanca
Santa Maria d'Oló
Conquista
Culleredo
Valero
Rincón de la Victoria
Felanitx
Brincones
Benadalid
Frades
Villaviciosa de Córdoba
Curtis
La Rambla
Palacios del Arzobispo
Chinchón
Dos Torres
Los Molinos
Sant Joan Despí
Alcúdia
Boada
Bigues i Riells
Villafranca de Córdoba
Yunquera
Alpens
San Felices de los Gallegos
Mijas
Les Masies de Roda
Navarredonda de la Rinconada
Candelario
Almenara de Tormes
San Martín de Oscos
Sant Pere de Torelló
Ponga
Castellar de n'Hug
Santiago de Compostela
Cañete de las Torres
Sevilla la Nueva
Aldearrodrigo
Béjar
Cantabria
Mataró
Premià de Mar
La Luisiana
Los Blázquez
Villamanrique de Tajo
Forfoleda
Sant Vicenç dels Horts
Herrerías
La Sierpe
Santibáñez de la Sierra
Antequera
Campanet
Villarino de los Aires
Pesquera
Rianxo
El Maíllo
Narros de Matalayegua
Touro
Albaida del Aljarafe
Parla
Redueña
Lloret de Vistalegre
Luque
La Granada
Sieteiglesias de Tormes
Jubrique
Tavèrnoles
Muxía
Casares
L'Estartit
Sant Iscle de Vallalta
Mugardos
Santiurde de Reinosa
Camargo
Carbajosa de la Sagrada
Lliçà d'Amunt
Fuente el Saz de Jarama
Camaleño
Alpandeire
Vedra
Valdemoro
Navalmoral de Béjar
Arenas
La Puebla de Cazalla
Asturias
San Martín de Valdeiglesias
San Martín del Castañar
Re: Matching multi-character folds, and FMTEYEWTK on troubles thereof [ In reply to ]
2008/11/25 Tom Christiansen <tchrist@perl.com>:
> *** Unicode CLDR Project: Common Locale Data Repository
> http://unicode.org/cldr/
>
> *** CVS Snapshots for CLDR:
> ftp://ftp.unicode.org/Public/cldr/cldr-repository-daily.tgz
>
> Yves, there're also French versions of some of the above, s'il te plaît,
> but I had trouble getting them to download.
>
> The last, CLDR, contains *VERY* interesting stuff. I wish I could figure
> out how to auto-translate these into Unicode::Collation objects. For
> example, here's cldr/common/collation/fr.xml, fycnrdths:

Strange, it doesn't seem to contain collations for the French "Œ" ("e
dans l'o"), which sorts exactly as "oe". Does that mean that the CLDR
still have bugs too ?

Anyway. I don't think that it's the core's job to handle localisation
data and collations. There are too many of them, not counting the ones
you might invent for specific purposes (like, where to put Planck's
constant in a quantum physics book index?) Let us begin by trying to
get the Turkish capitalisation right. And even for this, I'm not sure
we want it really in the core.

> <?xml version="1.0" encoding="UTF-8" ?>
> <!DOCTYPE ldml SYSTEM "http://www.unicode.org/cldr/dtd/1.6/ldml.dtd">
> <ldml>
> <identity>
> <version number="$Revision: 1.23 $"/>
> <generation date="$Date: 2008/03/10 02:27:54 $"/>
> <language type="fr" />
> </identity>
> <collations validSubLocales="fr_BE fr_CA fr_CH fr_FR fr_LU">
> <collation type="standard" >
> <settings backwards="on" />
> <rules>
> <reset>ae</reset>
> <s>æ</s>
> <t>Æ</t>
> <!--
> <reset>A</reset>
> <x><s>Æ</s><extend>E</extend></x>
> <reset>a</reset>
> <x><s>æ</s><extend>e</extend></x>
> -->
> </rules>
> </collation >
> </collations>
> </ldml>

--
A system is nothing more than the subordination of all aspects of
the universe to any one such aspect.
-- Borges
Re: Matching multi-character folds [ In reply to ]
demerphq wrote:
> 2008/11/23 karl williamson <public@khwilliamson.com>:
>> demerphq wrote:
>>> 2008/11/23 karl williamson <public@khwilliamson.com>:
> [snip]
>>>> One of these is the oft mentioned in this list, German lower case sharp
>>>> s or ß. 'ss' =~ /ß/i is true. (U+00DF)
>>> 0xDF is the only multi-codepoint folding character in the latin-1 range.
>>>
>>> Also 0xDF is a "trickyfold" character meaning, that it can match
>>> something of longer length (in terms of bytes) folded than unfolded.
>>>
>> There must be more to it than that, as the code indicates there are only
>> three tricky fold characters, yet there are more that fit this definition.
>> For example U+023A which takes 2 bytes in UTF-8 folds to U+2C65 which takes
>> 3. They seem to work.
>
> The three trickyfold characters tickle a bug in minlen logic of the
> optimiser. The ones you mention dont, I think because they are both
> one codepoint long. As far the mail history shows I dont think I
> really got the bottom of the bug in the optimiser and worked around it
> with the trickyfold construct as being the simplest solution.
>
> As far as I recall /$char/i for unicode $char is stored casefolded at
> compile time. The bug basically came down to:
>
> $df=chr(0xdf);
> utf8::upgrade($df);
> print $df=~/$df/i ? "ok" : "not ok";
>
> which if inspected under use re 'debug' revealed that this was
> internally converted into an EXACTF <ss> opcode. Which in turn caused
> the minlen logic to fire, as it is two characters long. An exhaustive
> search for these revealed problems only in the three codepoints we
> covered, and my retest shows that we have more of this class with the
> updates to unicode 5.1. Exactly why the others did not fail was never
> really clear. I did an exhaustive search and those were the ones I
> found. The optimiser is a scary beast :-(
>
> [snip]
>>> What do you mean by "beyond debate" here?
>>>
>>> Seems to me that there is a debate about whether unencoded
>>> nonlocalized strings should be treated as ascii or as latin-1, and if
>>> treated as latin-1 whether they should obey unicode foldcasing rules
>>> or not.
>>>
>> I thought that was settled. While you were taking a break from p5p, I
>> naively came in and started a discussion on it (there are various threads,
>> but most include [perl #58182] in the subject). There was agreement that
>> they should match Unicode and I gave a very detailed proposal which the 5.12
>> pumpking said sounded reasonable. It was pointed out that perl5100delta
>> says:
>>
>> | The handling of Unicode still is unclean in several places, where it's
>> | dependent on whether a string is internally flagged as UTF-8. This will
>> | be made more consistent in perl 5.12, but that won't be possible without
>> | a certain amount of backwards incompatibility."
>>
>> Similarly in perltodo, as I quoted in the first email on this thread:
>> "that should not be dependent on an internal storage detail of the string"
>> meaning the utf8ness of a string should not affect its external semantics.
>>
>> It seems clear that it's been agreed that the utf8ness of a string should
>> not affect its external behavior. So what should the behavior be? It has
>> to be the Unicode behavior, for otherwise, the characters between 128 and
>> 255 would never behave like Unicode.
>>
>> There are 3 main areas where things don't work. (I believe that the
>> problems with pack() have been fixed.)
>>
>> 1. uc(), lcfirst(), \U, etc. I have submitted for review code that gives
>> the same semantics for these whether or not the string is in utf8 or not.
>
> This worries me, as it involves a fairly serious behaviour change. But
> if its been decided then fine, at least it will be consistency.
>
And since it is such a change, it will require a pragma to enable in
5.10, becoming the default in 5.12.
>> 2. \w, [:graph:], etc re matching. I think the solution to this is in your
>> RFC to make these just match ASCII or the current locale. Then the utf8ness
>> won't matter, except if someone's string gets converted to utf8, and then
>> their locale most likely won't work properly. That is why I said in an
>> earlier email that I don't think strings should be upgraded to utf8 when
>> "use locale" is in effect. The RFC also solves the problem of, for example,
>> \d matching things the programmer never intended, just because the string
>> silently, somehow, got changed to utf8. My proposal that I thought had been
>> accepted was, for example, to make \w match the appropriate Latin1
>> characters even when not in utf8. And I had working experimental code to do
>> that. But I think your RFC makes more sense.
>
> Ok.
>
>> 3. caseless re matching m/.../i Again, perl has to change so that the
>> utf8ness of the pattern doesn't matter. One could do it by adding
>> modifiers, as you originally suggested, like /u to force unicode semantics.
>> But I think you had pulled away from that idea. I would be open to
>> something like that, but I think there has to be a way for a programmer to
>> make that the default, without forcing them to always remember to add the
>> modifier. Or one could do it by having the re code know about latin1
>> semantics. Again, I have mostly working code which doesn't change regcomp.c
>> very much that does this. I do think overall that this is a better
>> solution than the modifier one. One consideration I have that has been
>> mentioned in the documentation is that latin1 should be faster than utf8. I
>> think Tom may have said that he didn't find that to be the case in his
>> experiments.
>
> I'd like to see more on this. I do know that benchmarking the regex
> engine is not easy. There are lots of special cases and things like
> that to consider. Ive definitely seen utf8 have serious performance
> consequences.
>
The goal should be to not have a programmer have to know about the
internal storage method of a string. From looking at the code, I don't
see how going to utf8 could possibly not have a significant impact. In
a program I wrote, I looked at the documentation and bent over backwards
to keep from going outside the Latin1 range, so as to not invoke utf8.
Then I discovered that Encode always goes to utf8, so my efforts were
for naught.
> [snip]
>>>> Another case is ligatures (they don't view ß as a ligature, and I don't
>>>> know why) So 'fi' =~ /fi/i is true. (U+FB01)
>>> Prompted by your comment about 'ß' I did some searching for
>>> information on ligatures and unicode and I was surprised how little
>>> there was. The only ligature support seems to be for legacy conversion
>>> reasons (for instance latin-1 equivalancy), and it seems that
>>> ligatures are considered to be a presentation issue better left up to
>>> the font and the font rendering engine. A good discussion being this:
>>>
>>> http://unicode.org/faq/ligature_digraph.html
>>>
>>> When I checked the unicode data files I didn't find anything about
>>> ligatures outside of certain character names including the word
>>> 'LIGATURE', and some comments and commentary files mentioning that
>>> some characters are ligatures. So I'm wondering what you were getting
>>> at when you said "they don't view ß as a ligature, and I don't know
>>> why".
>>>
>> My source for that was lib/unicore/SpecialCasing.txt
>
> Right, which includes a comment about some of the unusual forms. But
> it is not a formal status or property of the characters.
>
> [snip]
>>>> Would you like to know what happens today in perl? Well I'll tell you
>>>> anyway. /[ß]/i is true and 'ǰ' =~ /[ǰ]/i is false. In fact, every
>>>> other
>>> I cant repeat that. In bleadperl 'ǰ' =~ /[ǰ]/i matches fine as far as
>>> i can tell.
>>>
>>> What doesnt work is
>>>
>>> fold('ǰ') =~ /[ǰ]/i
>>>
>>> where fold('ǰ') is equivalent to "\x{6A}\x{30C}".
>>>
>> I don't understand. I just tested again with the perl I have on my machine
>> that I think is today's bleadperl, and it failed. But in any event as you
>> agree below, there are a number of things broken.
>
> Can you post a oneliner that doesnt contain unicode in it to test
> with? In other words coded so it can be expressed in ascii, whatever
> the code itself does?
>
Actually, when I look at your test cases, I see it is one that failed:
LATIN SMALL LETTER J WITH CARON
uu '006A 030C' =~ /[01F0]/i
>>>> multi-char fold returns false. This in fact may be the only time in perl
>>>> history, savor the moment, when the infamous ß gives an arguably more
>>>> correct result than other characters.
>>> Hmm. Interesting. I cant decide to be happy about this, or sad.
>>>
>> The only reason it works is because for single character char classes, they
>> get optimized out, and somehow, it works. [ßa] doesn't work.
>
> Ah. Sigh. So they turn into EXACTF instead of ANYOF. I forgot about that.
>
>>>> Now the code in regcomp.c takes special pains to make all these match.
>>>> But
>>>> it doesn't work, except in the [ß] case. So we don't have to worry about
>>>> breaking existing code if we decide it should work differently.
>>>>
>>>> Let's look at it the other direction. Should ß =~ /ss/i ? Should 'ǰ' =~
>>>> /ǰ/i ? They both are true currently. However, things like ß =~ /s{2}/i
>>>> is
>>>> false, and that seems inconsistent.
>>>>
>>>>
>>>> So, I'm not sure what the right answers are, but things are broken today.
>>>>
>>> Yes, things are. I wrote the attached hacky script to parse out
>>> CaseFolding.txt and test all the complex folding rules. The output is
>>> below, the 'll', 'lu','ul','uu' means, 'latin' and 'unicode', with the
>>> first letter representing the string, and the second the patterns
>>> encoding. The description on the right is the test, with chars
>>> represented by their hex representation, and separated by spaces in
>>> the case of the folded string. The output on 5.8.9 looks different,
>>> with more mistakes.
>>>
>>> demerphq@gemini:~/blead/p4/lib/unicore$ ../../perl -I../../lib
>>> test_case_folding.pl
>>> LATIN SMALL LETTER SHARP S
>>> ll '0073 0073' =~ /00DF/i
>>> ll, ul, uu '0073 0073' =~ /[00DF]/i
>
> ll is expected to fail here under the current rules.
>
> [snip]
>>> LATIN CAPITAL LETTER SHARP S
>>> lu, uu '0073 0073' =~ /[1E9E]/i
>
> lu probably fails because of the minlen bug.
>
> [snip]
>>> LATIN SMALL LIGATURE FF
>>> lu, uu '0066 0066' =~ /[FB00]/i
>>> LATIN SMALL LIGATURE FI
>>> lu, uu '0066 0069' =~ /[FB01]/i
>>> LATIN SMALL LIGATURE FL
>>> lu, uu '0066 006C' =~ /[FB02]/i
>>> LATIN SMALL LIGATURE FFI
>>> lu, uu '0066 0066 0069' =~ /[FB03]/i
>>> LATIN SMALL LIGATURE FFL
>>> lu, uu '0066 0066 006C' =~ /[FB04]/i
>>> LATIN SMALL LIGATURE LONG S T
>>> lu, uu '0073 0074' =~ /[FB05]/i
>>> LATIN SMALL LIGATURE ST
>>> lu, uu '0073 0074' =~ /[FB06]/i
>
> These lu's might fail because of the minlen bug. Are these new to 5.1?
>
These latin ligatures were in Unicode 1.1. There's some code in
regclass() in regcomp.c for EBCDIC only that looks bogus to me that is
attempting to handle some of these. I don't understand why just some
would need special handling.

>> What Yves didn't mention to those of you reading along, is that only the
>> failures were printed above.
>
> Yes correct, and we only test the possible combinations. So only \xDF
> has 'll' or 'ul' and most only have 'uu'.
>
>> When I run his program on 5.8 vs blead on the
>> same version of the Unicode database, the only differences I saw were
>> related, I think, to Yves fixing things in 5.10 with his tricky fold
>> addition, and the new in Unicode 5.1 upper case version of ß. I don't
>> understand off-hand why that would be different.
>
> Because its not being handled by the trickfold logic. Basically its
> the same problem as the lower case but it hasn't been added
> regcharclass.pl. And none of the special cases coded into the regex
> engine to deal with 0xDF have been added to the engine for its
> majestic brother.
>
>>> So its clear that multicode-point character class folding is broken
>>> for some definition of expected behaviour.
>>>
>>> I personally consider character class notation to be an abbreviation
>>> of alternation. So a character class [xyz] is supposed to match the
>>> same thing as (x|y|z). This implies that character classes have to be
>>> able to match more than one character under case-folding rules. A lot
>>> of external logic and at least some internal logic operates under this
>>> assumption, so i dont think we can change it.
>>>
>> That sounds right.
>
> Im trying to imagine a way to do this that doesn't involve a pretty
> considerable redesign of how character classes work, and not coming up
> with much.
>
> Yves
>
Keep in mind that it works for the vast majority of Unicode characters,
and fails only on a few, and only when there is a multi-character fold.
However, we don't even attempt to implement some things that Unicode
would want us to, such as treating two strings that are in different
canonical normalizations as equivalent.
Re: Matching multi-character folds, and FMTEYEWTK on troubles thereof [ In reply to ]
"Rafael Garcia-Suarez" <rgarciasuarez@gmail.com>
wrote on "Wed, 26 Nov 2008 08:49:02 +0100.":

> 2008/11/25 Tom Christiansen <tchrist@perl.com>:

( >> Me, I'm wondering the same thing. See below. Sometimes I feel lost )
( >> in one of Borges's labyrinths--or Eco's, though these are the same. )

>> *** Unicode CLDR Project: Common Locale Data Repository
>> http://unicode.org/cldr/
>>
>> *** CVS Snapshots for CLDR:
>> ftp://ftp.unicode.org/Public/cldr/cldr-repository-daily.tgz

>> Yves, there're also French versions of some of the above, s'il te plaît,
>> but I had trouble getting them to download.

>> The last, CLDR, contains *VERY* interesting stuff. I wish I could figure
>> out how to auto-translate these into Unicode::Collation objects. For
>> example, here's cldr/common/collation/fr.xml, fycnrdths:

> Strange, it doesn't seem to contain collations for the French "=8C" ("e
> dans l'o"), which sorts exactly as "oe".

That I found odd, too. But I'm thinking that OE and the digraph have
different titlecase renderings. Is that correct? See here:

% perl -E 'say ucfirst "oeuf"'
Oeuf
% perl -E 'say ucfirst "\x{152}uf"'
Å’uf

Reminds me of the old joke:

Q: Why are the French so svelt?
A: Light breakfasts, where un Å“uf is always enough. :-)

So Å“ and Å’ work more like the English ae digraph, once a separate
letter for the sound of "cat" or "sat", and written Æ and æ. It wasn't
considered a ligature as it is today, as still seen in Icelandic or in
Old English where you find Ǣ and ǣ or Ǽ and ǽ.

That is, neither French œ nor English æ have the tripartite case
system of Hungarian:

% perl -E 'say for chr(0x01F3), ucfirst(chr 0x01F3), uc(chr 0x01F3)'
dz
Dz
DZ

Still, people get confused about capitalizing ligatures and digraphs
(like "th" or "ch" in English) even in places they don't go. Think of
road signs that say that "LLeida" is this way, for example. or "LLiçà
de Vall"? [.Can you tell I was once lost in Andorra, hitchhiking, and
driving around Catalunya, equally lost? :-)]

At least we don't have to decompose "ß" so that "ß" =~ /ſs/i, or
"Œ" =~ /oe/i, or perhaps even "ÿ" = /ij/ (ducks from Johan and Abigail :-).

I can just see people wanting weird matches on this sort of thing:

Loſt be yᵉ, and on so trafficked a way?

> Does that mean that the CLDR still have bugs too ?

Yes, I think you are correct. You can read that in their XML, where they
mention bugfixes from now and then. And I now know how to write them for
French, using the tabular approach I used for the Iberian tongues.

> Anyway. I don't think that it's the core's job to handle localisation
> data and collations. There are too many of them, not counting the ones
> you might invent for specific purposes (like, where to put Planck's
> constant in a quantum physics book index?) Let us begin by trying to
> get the Turkish capitalisation right. And even for this, I'm not sure
> we want it really in the core.

True enough, and you are nearly certainly correct; I still am amazed we
get as much right as we do.

But I still long for [=e=] though. I know, I know: modules.

After the Iberian stuff I did, I **really** wonder whether bending
over backwards for ß to cope with SS and Ss may prove to have
been a bad idea in the long run.

I wonder what the perl6 folks are thinking re this?

And, um, er, if? :-(

> A system is nothing more than the subordination of all aspects of
> the universe to any one such aspect.
> -- Borges

I haven't played with the default DUCET, just my modified one.
I should change that.

--tom
--

Como todo poseedor de una biblioteca, Aureliano
se sabía culpable de no conocerla hasta el fin.
--Jorge Luis Borges, 'Los teólogos' in _El Aleph_

~ Like all those possessing a library, Aurelian was aware
that he was guilty of not knowing his in its entirety. ~