Mailing List Archive

Different apostrophe signs and MediaWiki internal search
Hi!

There are different apostrophe signs exist. Let's consider 2 of them:
U+0027 and U+2019. They have the same meaning and both of them are
acceptable and apostrophes for the English language, for instance. The
problem is that MediaWiki internal search distinguishes these two
apostrophes and the words containing U+2019 can't be found with the
request containing U+0027 and vice versa.

MediaWiki uses a search index for the internal search and the index is
renewed every time the article is saved. I have found that if to
override the function stripForSearch() in the language class with the
new function wich relpaces the U+2019 with U+0027 for search index it
appears that the internal search begins to work properly not paying
attention to which exactly apostrophe was provided in the search
query, either U+0027 or U+2019. For sure, the context is not
highlighted if the apostrophes differ in the query and in the result,
but the search returns what is really needed.

The question is, if we override the stripForSearch() function in the
language class in such a way, won't this cause any problems?

The code of the override function is the following:

function stripForSearch( $string ) {
  $s = $string;
  $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
  return parent::stripForSearch( $s );
}

We want to introduce such an issue for Belarusian, but I think
Ukrainian language may experience the same problem with the different
apostrophes, as U+0027 is not a valid apostrophe here as well, but
only U+0027 (the typewriter apostrophe) is available on the majority
of Belarusian and Ukrainian keyboard layouts.

Thanks,
zedlik

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Jaska Zedlik skrev:
<...>
> The code of the override function is the following:
>
> function stripForSearch( $string ) {
> $s = $string;
> $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
> return parent::stripForSearch( $s );
> }

I'm not a PHP programmer, but why using the extra assignment of $s
instead of using $string directly in the parent call, like so:

function stripForSearch( $string ) {
$s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
return parent::stripForSearch( $s );
}

or even:

function stripForSearch( $string ) {
return parent::stripForSearch( preg_replace( '/\xe2\x80\x99/',
'\'', $string ) );
}

... ?

Regards,

// Rolf Lampa



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Jaska Zedlik wrote:
> Hi!
>
> There are different apostrophe signs exist. Let's consider 2 of them:
> U+0027 and U+2019. They have the same meaning and both of them are
> acceptable and apostrophes for the English language, for instance. The
> problem is that MediaWiki internal search distinguishes these two
> apostrophes and the words containing U+2019 can't be found with the
> request containing U+0027 and vice versa.
>
Probably what we should be doing in this area is running text through
Unicode compatibility composition normalization as well as some other
character folding for punctuation forms where necessary.
(UtfNormal::toNFKC() will merge things like full-width Roman characters
but won't merge these related-but-not-quite-the-same punctuation forms.)

-- brion
> MediaWiki uses a search index for the internal search and the index is
> renewed every time the article is saved. I have found that if to
> override the function stripForSearch() in the language class with the
> new function wich relpaces the U+2019 with U+0027 for search index it
> appears that the internal search begins to work properly not paying
> attention to which exactly apostrophe was provided in the search
> query, either U+0027 or U+2019. For sure, the context is not
> highlighted if the apostrophes differ in the query and in the result,
> but the search returns what is really needed.
>
> The question is, if we override the stripForSearch() function in the
> language class in such a way, won't this cause any problems?
>
> The code of the override function is the following:
>
> function stripForSearch( $string ) {
> $s = $string;
> $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
> return parent::stripForSearch( $s );
> }
>
> We want to introduce such an issue for Belarusian, but I think
> Ukrainian language may experience the same problem with the different
> apostrophes, as U+0027 is not a valid apostrophe here as well, but
> only U+0027 (the typewriter apostrophe) is available on the majority
> of Belarusian and Ukrainian keyboard layouts.
>
> Thanks,
> zedlik
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Hello,
On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <rolf.lampa@rilnet.com> wrote:

> Jaska Zedlik skrev:
> <...>
> > The code of the override function is the following:
> >
> > function stripForSearch( $string ) {
> > $s = $string;
> > $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
> > return parent::stripForSearch( $s );
> > }
>
> I'm not a PHP programmer, but why using the extra assignment of $s
> instead of using $string directly in the parent call, like so:
>
> function stripForSearch( $string ) {
> $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
> return parent::stripForSearch( $s );
> }
>
Really, you are right, for the real function all these redundant assignments
should be strepped for the productivity purposes, I just used a framework
from the Japanese language class which does soma Japanese-specific
reduction, but I agree with your notice.

zedlik
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Hello,

On Fri, Jun 19, 2009 at 23:28, Brion Vibber <brion@wikimedia.org> wrote:

> Jaska Zedlik wrote:
>
>> Hi!
>>
>> There are different apostrophe signs exist. Let's consider 2 of them:
>> U+0027 and U+2019. They have the same meaning and both of them are
>> acceptable and apostrophes for the English language, for instance. The
>> problem is that MediaWiki internal search distinguishes these two
>> apostrophes and the words containing U+2019 can't be found with the
>> request containing U+0027 and vice versa.
>>
>>
> Probably what we should be doing in this area is running text through
> Unicode compatibility composition normalization as well as some other
> character folding for punctuation forms where necessary.
> (UtfNormal::toNFKC() will merge things like full-width Roman characters but
> won't merge these related-but-not-quite-the-same punctuation forms.)
>
> -- brion

As I understand, this is not a Unicode compatibility composition, as these
are 2 different charachters (U+2019 even defined as Right Single Quotation
Mark), but in some languages (not for all, for sure) they could have
identical meaning. As the characters are different, I'm afraid they are not
covered by the Unicode normalization process, and we should deal with the
functions available in the language class.

zedlik
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
2009/6/20 Jaska Zedlik <jz53zc@gmail.com>:
> Hello,
> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <rolf.lampa@rilnet.com> wrote:
>
>> Jaska Zedlik skrev:
>> <...>
>> > The code of the override function is the following:
>> >
>> > function stripForSearch( $string ) {
>> >   $s = $string;
>> >   $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
>> >   return parent::stripForSearch( $s );
>> > }
>>
>> I'm not a PHP programmer, but why using the extra assignment of $s
>> instead of using $string directly in the parent call, like so:
>>
>> function stripForSearch( $string ) {
>>     $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
>>     return parent::stripForSearch( $s );
>> }
>>
> Really, you are right, for the real function all these redundant assignments
> should be strepped for the productivity purposes, I just used a framework
> from the Japanese language class which does soma Japanese-specific
> reduction, but I agree with your notice.

The username anti-spoofing code already knows about a lot of "looks similar"
characters which may be of some help.

Andrew Dunbar (hippietrail)


> zedlik
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Andrew Dunbar wrote:
> 2009/6/20 Jaska Zedlik <jz53zc@gmail.com>:
>
>> Hello,
>> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <rolf.lampa@rilnet.com> wrote:
>>
>>
>>> Jaska Zedlik skrev:
>>> <...>
>>>
>>>> The code of the override function is the following:
>>>>
>>>> function stripForSearch( $string ) {
>>>> $s = $string;
>>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
>>>> return parent::stripForSearch( $s );
>>>> }
>>>>
>>> I'm not a PHP programmer, but why using the extra assignment of $s
>>> instead of using $string directly in the parent call, like so:
>>>
>>> function stripForSearch( $string ) {
>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
>>> return parent::stripForSearch( $s );
>>> }
>>>
>>>
>> Really, you are right, for the real function all these redundant assignments
>> should be strepped for the productivity purposes, I just used a framework
>> from the Japanese language class which does soma Japanese-specific
>> reduction, but I agree with your notice.
>>
>
> The username anti-spoofing code already knows about a lot of "looks similar"
> characters which may be of some help.
>
> Andrew Dunbar (hippietrail)
>
>
>
Of itself, the username anti-spoofing code table -- which I originally
wrote -- is rather too thorough for this purpose, since it deliberately
errs on the side of mapping even vaguely similar-looking characters to
one another, regardless of character type and script system,and this,
combined with case-folding and transitivity, leads to some apparently
bizarre mappings that are of no practical use for any other application.

If you're interested, I can take a look at producing a more limited
punctuation-only version.

-- Neil


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Neil Harris wrote:
> Andrew Dunbar wrote:
>
>> 2009/6/20 Jaska Zedlik <jz53zc@gmail.com>:
>>
>>
>>> Hello,
>>> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <rolf.lampa@rilnet.com> wrote:
>>>
>>>
>>>
>>>> Jaska Zedlik skrev:
>>>> <...>
>>>>
>>>>
>>>>> The code of the override function is the following:
>>>>>
>>>>> function stripForSearch( $string ) {
>>>>> $s = $string;
>>>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
>>>>> return parent::stripForSearch( $s );
>>>>> }
>>>>>
>>>>>
>>>> I'm not a PHP programmer, but why using the extra assignment of $s
>>>> instead of using $string directly in the parent call, like so:
>>>>
>>>> function stripForSearch( $string ) {
>>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
>>>> return parent::stripForSearch( $s );
>>>> }
>>>>
>>>>
>>>>
>>> Really, you are right, for the real function all these redundant assignments
>>> should be strepped for the productivity purposes, I just used a framework
>>> from the Japanese language class which does soma Japanese-specific
>>> reduction, but I agree with your notice.
>>>
>>>
>> The username anti-spoofing code already knows about a lot of "looks similar"
>> characters which may be of some help.
>>
>> Andrew Dunbar (hippietrail)
>>
>>
>>
>>
> Of itself, the username anti-spoofing code table -- which I originally
> wrote -- is rather too thorough for this purpose, since it deliberately
> errs on the side of mapping even vaguely similar-looking characters to
> one another, regardless of character type and script system,and this,
> combined with case-folding and transitivity, leads to some apparently
> bizarre mappings that are of no practical use for any other application.
>
> If you're interested, I can take a look at producing a more limited
> punctuation-only version.
>
> -- Neil
>
>
http://www.unicode.org/reports/tr39/data/confusables.txt is probably the
single best source for information about visual confusables.

Staying entirely within the Latin punctuation repertoire, and avoiding
combining characters and other exotica such as math characters and
dingbats, you might want to consider the following characters as
possible unintentional lookalikes for the apostrophe:

U+0027 APOSTROPHE
U+2019 RIGHT SINGLE QUOTATION MARK
U+2018 LEFT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+2032 PRIME
U+00B4 ACUTE ACCENT
U+0060 GRAVE ACCENT
U+FF40 FULLWIDTH GRAVE ACCENT
U+FF07 FULLWIDTH APOSTROPHE

There are also lots of other characters that look like these from other
languages, and various combining character combinations which could also
look the same, but I doubt whether they would be generated in Latin text
by accident.

Please check these against the actual code tables for reasonableness and
accuracy before putting them in any code.

-- Neil


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
2009/6/20 Neil Harris <usenet@tonal.clara.co.uk>:
> Neil Harris wrote:
>> Andrew Dunbar wrote:
>>
>>> 2009/6/20 Jaska Zedlik <jz53zc@gmail.com>:
>>>
>>>
>>>> Hello,
>>>> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <rolf.lampa@rilnet.com> wrote:
>>>>
>>>>
>>>>
>>>>> Jaska Zedlik skrev:
>>>>> <...>
>>>>>
>>>>>
>>>>>> The code of the override function is the following:
>>>>>>
>>>>>> function stripForSearch( $string ) {
>>>>>>   $s = $string;
>>>>>>   $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
>>>>>>   return parent::stripForSearch( $s );
>>>>>> }
>>>>>>
>>>>>>
>>>>> I'm not a PHP programmer, but why using the extra assignment of $s
>>>>> instead of using $string directly in the parent call, like so:
>>>>>
>>>>> function stripForSearch( $string ) {
>>>>>     $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
>>>>>     return parent::stripForSearch( $s );
>>>>> }
>>>>>
>>>>>
>>>>>
>>>> Really, you are right, for the real function all these redundant assignments
>>>> should be strepped for the productivity purposes, I just used a framework
>>>> from the Japanese language class which does soma Japanese-specific
>>>> reduction, but I agree with your notice.
>>>>
>>>>
>>> The username anti-spoofing code already knows about a lot of "looks similar"
>>> characters which may be of some help.
>>>
>>> Andrew Dunbar (hippietrail)
>>>
>>>
>>>
>>>
>> Of itself, the username anti-spoofing code table -- which I originally
>> wrote -- is rather too thorough for this purpose, since it deliberately
>> errs on the side of mapping even vaguely similar-looking characters to
>> one another, regardless of character type and script system,and this,
>> combined with case-folding and transitivity, leads to some apparently
>> bizarre mappings that are of no practical use for any other application.
>>
>> If you're interested, I can take a look at producing a more limited
>> punctuation-only version.
>>
>> -- Neil
>>
>>
> http://www.unicode.org/reports/tr39/data/confusables.txt is probably the
> single best source for information about visual confusables.
>
> Staying entirely within the Latin punctuation repertoire, and avoiding
> combining characters and other exotica such as math characters and
> dingbats, you might want to consider the following characters as
> possible unintentional lookalikes for the apostrophe:
>
> U+0027 APOSTROPHE
> U+2019 RIGHT SINGLE QUOTATION MARK
> U+2018 LEFT SINGLE QUOTATION MARK
> U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
> U+2032 PRIME
> U+00B4 ACUTE ACCENT
> U+0060 GRAVE ACCENT
> U+FF40 FULLWIDTH GRAVE ACCENT
> U+FF07 FULLWIDTH APOSTROPHE
>
> There are also lots of other characters that look like these from other
> languages, and various combining character combinations which could also
> look the same, but I doubt whether they would be generated in Latin text
> by accident.

I would add
U+02BB MODIFIER LETTER TURNED COMMA (hawaiian 'okina)
U+02C8 MODIFIER LETTER VERTICAL LINE (IPA primary stress mark)

It might be worthwhile folding some dashes and hyphens too.

Andrew Dunbar (hippietrail)

> Please check these against the actual code tables for reasonableness and
> accuracy before putting them in any code.
>
> -- Neil
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Andrew Dunbar wrote:
> 2009/6/20 Neil Harris <usenet@tonal.clara.co.uk>:
>
>> Neil Harris wrote:
>>
>>> Andrew Dunbar wrote:
>>>
>>>
>>>> 2009/6/20 Jaska Zedlik <jz53zc@gmail.com>:
>>>>
>>>>
>>>>
>>>>> Hello,
>>>>> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <rolf.lampa@rilnet.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Jaska Zedlik skrev:
>>>>>> <...>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> The code of the override function is the following:
>>>>>>>
>>>>>>> function stripForSearch( $string ) {
>>>>>>> $s = $string;
>>>>>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
>>>>>>> return parent::stripForSearch( $s );
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I'm not a PHP programmer, but why using the extra assignment of $s
>>>>>> instead of using $string directly in the parent call, like so:
>>>>>>
>>>>>> function stripForSearch( $string ) {
>>>>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
>>>>>> return parent::stripForSearch( $s );
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Really, you are right, for the real function all these redundant assignments
>>>>> should be strepped for the productivity purposes, I just used a framework
>>>>> from the Japanese language class which does soma Japanese-specific
>>>>> reduction, but I agree with your notice.
>>>>>
>>>>>
>>>>>
>>>> The username anti-spoofing code already knows about a lot of "looks similar"
>>>> characters which may be of some help.
>>>>
>>>> Andrew Dunbar (hippietrail)
>>>>
>>>>
>>>>
>>>>
>>>>
>>> Of itself, the username anti-spoofing code table -- which I originally
>>> wrote -- is rather too thorough for this purpose, since it deliberately
>>> errs on the side of mapping even vaguely similar-looking characters to
>>> one another, regardless of character type and script system,and this,
>>> combined with case-folding and transitivity, leads to some apparently
>>> bizarre mappings that are of no practical use for any other application.
>>>
>>> If you're interested, I can take a look at producing a more limited
>>> punctuation-only version.
>>>
>>> -- Neil
>>>
>>>
>>>
>> http://www.unicode.org/reports/tr39/data/confusables.txt is probably the
>> single best source for information about visual confusables.
>>
>> Staying entirely within the Latin punctuation repertoire, and avoiding
>> combining characters and other exotica such as math characters and
>> dingbats, you might want to consider the following characters as
>> possible unintentional lookalikes for the apostrophe:
>>
>> U+0027 APOSTROPHE
>> U+2019 RIGHT SINGLE QUOTATION MARK
>> U+2018 LEFT SINGLE QUOTATION MARK
>> U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
>> U+2032 PRIME
>> U+00B4 ACUTE ACCENT
>> U+0060 GRAVE ACCENT
>> U+FF40 FULLWIDTH GRAVE ACCENT
>> U+FF07 FULLWIDTH APOSTROPHE
>>
>> There are also lots of other characters that look like these from other
>> languages, and various combining character combinations which could also
>> look the same, but I doubt whether they would be generated in Latin text
>> by accident.
>>
>
> I would add
> U+02BB MODIFIER LETTER TURNED COMMA (hawaiian 'okina)
> U+02C8 MODIFIER LETTER VERTICAL LINE (IPA primary stress mark)
>
> It might be worthwhile folding some dashes and hyphens too.
>
> Andrew Dunbar (hippietrail)
>

Interestingly, following up the above, I've found one source
(http://snowball.tartarus.org/texts/apostrophe.html) that states that
U+201B may be deliberately used as an apostrophe variant by some
publishers in some contexts.

Regarding dashes and hyphens, I've now found my original data set, and a
quick inspection gives this set of various similar-looking Latin
hyphens, dashes and minus signs:

U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
U+2212 MINUS SIGN
U+FE58 SMALL EM DASH
U+FF0D FULLWIDTH HYPHEN-MINUS

I can send the full data set of lookalikes to anyone who is interested:
it can be quite easily extended by regarding the relation "looks like"
as transitive, to include more distant and linguistically dubious visual
confusables such as (just for example) U+2015 HORIZONTAL BAR, U+1173
HANGUL JUNGSEONG EU and U+2F00 KANGXI RADICAL ONE.

-- Neil




_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Neil Harris wrote:
>
> Regarding dashes and hyphens, I've now found my original data set, and
> a quick inspection gives this set of various similar-looking Latin
> hyphens, dashes and minus signs:
> U+002D HYPHEN-MINUS
> U+2010 HYPHEN
> U+2011 NON-BREAKING HYPHEN
> U+2012 FIGURE DASH
> U+2013 EN DASH
>
and at this point I missed out U+2014 EM DASH , which was hiding in the
world of transitive closure mentioned below...
> U+2212 MINUS SIGN
> U+FE58 SMALL EM DASH
> U+FF0D FULLWIDTH HYPHEN-MINUS
>
> I can send the full data set of lookalikes to anyone who is interested:
> it can be quite easily extended by regarding the relation "looks like"
> as transitive, to include more distant and linguistically dubious visual
> confusables such as (just for example) U+2015 HORIZONTAL BAR, U+1173
> HANGUL JUNGSEONG EU and U+2F00 KANGXI RADICAL ONE.
>
> -- Neil
>


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Speaking of AntiSpoof, there is a freshly-opened bug that could use
attention: https://bugzilla.wikimedia.org/show_bug.cgi?id=19273

Thanks,
-Mike

On Sat, 2009-06-20 at 10:39 +0100, Neil Harris wrote:

> Andrew Dunbar wrote:
> > 2009/6/20 Jaska Zedlik <jz53zc@gmail.com>:
> >
> >> Hello,
> >> On Fri, Jun 19, 2009 at 20:31, Rolf Lampa <rolf.lampa@rilnet.com> wrote:
> >>
> >>
> >>> Jaska Zedlik skrev:
> >>> <...>
> >>>
> >>>> The code of the override function is the following:
> >>>>
> >>>> function stripForSearch( $string ) {
> >>>> $s = $string;
> >>>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $s );
> >>>> return parent::stripForSearch( $s );
> >>>> }
> >>>>
> >>> I'm not a PHP programmer, but why using the extra assignment of $s
> >>> instead of using $string directly in the parent call, like so:
> >>>
> >>> function stripForSearch( $string ) {
> >>> $s = preg_replace( '/\xe2\x80\x99/', '\'', $string );
> >>> return parent::stripForSearch( $s );
> >>> }
> >>>
> >>>
> >> Really, you are right, for the real function all these redundant assignments
> >> should be strepped for the productivity purposes, I just used a framework
> >> from the Japanese language class which does soma Japanese-specific
> >> reduction, but I agree with your notice.
> >>
> >
> > The username anti-spoofing code already knows about a lot of "looks similar"
> > characters which may be of some help.
> >
> > Andrew Dunbar (hippietrail)
> >
> >
> >
> Of itself, the username anti-spoofing code table -- which I originally
> wrote -- is rather too thorough for this purpose, since it deliberately
> errs on the side of mapping even vaguely similar-looking characters to
> one another, regardless of character type and script system,and this,
> combined with case-folding and transitivity, leads to some apparently
> bizarre mappings that are of no practical use for any other application.
>
> If you're interested, I can take a look at producing a more limited
> punctuation-only version.
>
> -- Neil
>
>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Mike.lifeguard wrote:
> Speaking of AntiSpoof, there is a freshly-opened bug that could use
> attention: https://bugzilla.wikimedia.org/show_bug.cgi?id=19273
>
> Thanks,
> -Mike
>

Unless the normalization code has changed radically since I first wrote
it, this should not be an issue with the normalization function itself:
case folding happens naturally as part of the normalization process.
Based on reading the report for bug 18447, it's possible that the
surrounding storage and lookup code may now be broken.

-- Neil


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Thanks to everybody for the replies. The one problem which is left, is how
to integrate this into MediaWiki and enable search compatibility for
different apostrophes? Will the solution with overriding the
stripForSearch() function in the local language class be good for this?
zedlik

On Sat, Jun 20, 2009 at 16:27, Neil Harris <usenet@tonal.clara.co.uk> wrote:

> Mike.lifeguard wrote:
> > Speaking of AntiSpoof, there is a freshly-opened bug that could use
> > attention: https://bugzilla.wikimedia.org/show_bug.cgi?id=19273
> >
> > Thanks,
> > -Mike
> >
>
> Unless the normalization code has changed radically since I first wrote
> it, this should not be an issue with the normalization function itself:
> case folding happens naturally as part of the normalization process.
> Based on reading the report for bug 18447, it's possible that the
> surrounding storage and lookup code may now be broken.
>
> -- Neil
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
>>>>> "JZ" == Jaska Zedlik <jz53zc@gmail.com> writes:
JZ> The one problem which is left, is how to integrate this into
JZ> MediaWiki and enable search compatibility for different apostrophes?

In https://bugzilla.wikimedia.org/show_bug.cgi?id=8445 I had to insert
quote marks into peoples searches in order to make them work for
Chinese... (one more point in the quote/apostrophe mess.)

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
On Sat, Jun 20, 2009 at 9:46 PM, Neil Harris<usenet@tonal.clara.co.uk> wrote:
>> Regarding dashes and hyphens, I've now found my original data set, and
>> a quick inspection gives this set of various similar-looking Latin
>> hyphens, dashes and minus signs:
>> U+002D HYPHEN-MINUS
>> U+2010 HYPHEN
>> U+2011 NON-BREAKING HYPHEN
>> U+2012 FIGURE DASH
>> U+2013 EN DASH
>>
> and at this point I missed out U+2014 EM DASH , which was hiding in the
> world of transitive closure mentioned below...
>> U+2212 MINUS SIGN
>> U+FE58 SMALL EM DASH
>> U+FF0D FULLWIDTH HYPHEN-MINUS

I think you have to be mindful of the original goal here: for each
character a user is likely to enter from their keyboard in the search
box, what possible range of characters would they expect to match?

So, apostrophe (U+0027) -> curved right single quote (U+2019): yes, probably.
The other way around...probably not, unless that U+2019 exists on any keyboards.

Hyphen-minus (U+002D) -> em dash (U+2014): I would say no. If you
search for "clock-work", you probably don't want to match a sentence
like "He was building a clock—work that is never easy—at the time."
(contrived, sure)

Just saying you probably don't want the full range of "lookalikes" -
the left side of each mapping should be a keyboard character, and the
right side should be semantically equivalent, unless commonly used
incorrectly.

Steve

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Steve Bennett wrote:
> I think you have to be mindful of the original goal here: for each
> character a user is likely to enter from their keyboard in the search
> box, what possible range of characters would they expect to match?
>
> So, apostrophe (U+0027) -> curved right single quote (U+2019): yes, probably.
> The other way around...probably not, unless that U+2019 exists on any keyboards.
>
> Hyphen-minus (U+002D) -> em dash (U+2014): I would say no. If you
> search for "clock-work", you probably don't want to match a sentence
> like "He was building a clock—work that is never easy—at the time."
> (contrived, sure)
>
> Just saying you probably don't want the full range of "lookalikes" -
> the left side of each mapping should be a keyboard character, and the
> right side should be semantically equivalent, unless commonly used
> incorrectly.

Good point! Likewise, two hyphen-minus, "--", _could_ be considered to match
the em dash.


Tim
--
Tim Larson

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
Steve Bennett wrote:
> So, apostrophe (U+0027) -> curved right single quote (U+2019): yes, probably.
> The other way around...probably not, unless that U+2019 exists on any keyboards.
>
> Hyphen-minus (U+002D) -> em dash (U+2014): I would say no. If you
> search for "clock-work", you probably don't want to match a sentence
> like "He was building a clock—work that is never easy—at the time."
> (contrived, sure)
>
> Just saying you probably don't want the full range of "lookalikes" -
> the left side of each mapping should be a keyboard character, and the
> right side should be semantically equivalent, unless commonly used
> incorrectly.

Unless you cut and paste a term containing a fancy character from
another window, but the page uses the plain character...

-- brion

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
2009/6/23 Brion Vibber <brion@wikimedia.org>:
> Steve Bennett wrote:
>> So, apostrophe (U+0027) -> curved right single quote (U+2019): yes, probably.
>> The other way around...probably not, unless that U+2019 exists on any keyboards.
>>
>> Hyphen-minus (U+002D) -> em dash (U+2014): I would say no. If you
>> search for "clock-work", you probably don't want to match a sentence
>> like "He was building a clock—work that is never easy—at the time."
>> (contrived, sure)
>>
>> Just saying you probably don't want the full range of "lookalikes" -
>> the left side of each mapping should be a keyboard character, and the
>> right side should be semantically equivalent, unless commonly used
>> incorrectly.
>
> Unless you cut and paste a term containing a fancy character from
> another window, but the page uses the plain character...

Indeed keyboards are not the only place characters come from. Word
processers often upgrade apostrophes hyphens and other characters.
This is the generic field of which "smart quotes" is a specific case.
Also "input methods" can insert characters not directly on the
keyboard. And cutting and pasting from web pages where the author
tried to choose specific characters with HTML entities and such.

I have definitely seen edits on Wikipedia where people were
"correcting" various kinds of hyphens and dashes. And of course while
the English Wikipedia forbids curved quotes each other wiki may well
have its own policy.

Andrew Dunbar (hippietrail)



> -- brion
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
On Wed, Jun 24, 2009 at 4:38 AM, Brion Vibber<brion@wikimedia.org> wrote:
> Unless you cut and paste a term containing a fancy character from
> another window, but the page uses the plain character...

Yeah, I know. But my point is made and understood: think through the
reasons for each equivalence, rather than automatically grouping
together all characters of similar appearance.

Steve

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Different apostrophe signs and MediaWiki internal search [ In reply to ]
On Tue, Jun 23, 2009 at 04:26, <jidanni@jidanni.org> wrote:

>
> In https://bugzilla.wikimedia.org/show_bug.cgi?id=8445 I had to insert
> quote marks into peoples searches in order to make them work for
> Chinese... (one more point in the quote/apostrophe mess.)
>
> This is the same idea we would like to use and seems it works. We'll try to
move in this direction, thank you!

zedlik
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l