Le 2 d?c. 06 ? 22:23, Alex Aver a ?crit : > 2006/12/1, Marvin Humphrey <email@example.com>:
>> On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:
> Why I can't use simple $word_char_tokenizer for this set of languages?
> Universal stemmer for mixed texts it's problem. I can separate words
> in latin & cyrillic characters and use special stemmer for Russian
> words. But how can I separate English & French?
You don't necessarily need. 80% of the job an English stemmer does is
to remove "s"/"es" at the end of a word, wich works also fine for
French. The other rules won't hurt (such as s/ed$//) because they
don't match French words.
You can also add some French rules in your stemmer, such as s/aux$/
al/, wich won't have any effect on English words.
In fact, the most important thing is that you use the *same* stemmer
for indexing and querying. Whatever stemming it performs. >
>> Tokenizing Japanese is really, really hard
>> anyway, and KinoSearch provides no native support for it.
> Yes, tokenizing Japanese is hard, but possible - afair dpsearch &
> mnogosearch can do index and search in Japanese. But it isn't critical
> point at this moment ;)
MnogosSearch uses ChaSen, a free japanese parser that has a Perl
front-end. See http://rpmfind.net/linux/RPM/suse/9.3/i386/suse/i586/
More generally, there are some pointers on analyzing Japanese here : http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hoary/japanese/