Mailing List Archive

ClassicAnalyzer Behavior on accent character
Hi,
I indexed a term '?e???????' (aeroplane) and the term was
indexed as "er l n", some characters were trimmed while indexing.

Here is my code

protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader)
> {
> final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> reader);
> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>
> TokenStream tok = new ClassicFilter(src);
> tok = new LowerCaseFilter(getVersion(), tok);
> tok = new StopFilter(getVersion(), tok, stopwords);
> tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> search
>
> return new Analyzer.TokenStreamComponents(src, tok)
> {
> @Override
> protected void setReader(final Reader reader) throws
> IOException
> {
>
> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> super.setReader(reader);
> }
> };
> }



Am I missing anything? Is that expected behavior for my input or any reason
behind such abnormal behavior?

--
Regards,
Chitra
Re: ClassicAnalyzer Behavior on accent character [ In reply to ]
easy, don't use classictokenizer: use standardtokenizer instead.

On Thu, Oct 19, 2017 at 9:37 AM, Chitra <chithu.r111@gmail.com> wrote:
> Hi,
> I indexed a term '?e???????' (aeroplane) and the term was
> indexed as "er l n", some characters were trimmed while indexing.
>
> Here is my code
>
> protected Analyzer.TokenStreamComponents createComponents(final String
>> fieldName, final Reader reader)
>> {
>> final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
>> reader);
>> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>>
>> TokenStream tok = new ClassicFilter(src);
>> tok = new LowerCaseFilter(getVersion(), tok);
>> tok = new StopFilter(getVersion(), tok, stopwords);
>> tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
>> search
>>
>> return new Analyzer.TokenStreamComponents(src, tok)
>> {
>> @Override
>> protected void setReader(final Reader reader) throws
>> IOException
>> {
>>
>> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>> super.setReader(reader);
>> }
>> };
>> }
>
>
>
> Am I missing anything? Is that expected behavior for my input or any reason
> behind such abnormal behavior?
>
> --
> Regards,
> Chitra

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: ClassicAnalyzer Behavior on accent character [ In reply to ]
Hi Robert,
Yes, standardTokenizer solves my case... could you please
explain the difference between ClassicalTokenizer and StandardTokenizer?
How does standardTokenizer solve my case? I surf the web but I was unable
to understand...


Any help is greatly appreciated.

On Fri, Oct 20, 2017 at 12:10 AM, Robert Muir <rcmuir@gmail.com> wrote:

> easy, don't use classictokenizer: use standardtokenizer instead.
>
> On Thu, Oct 19, 2017 at 9:37 AM, Chitra <chithu.r111@gmail.com> wrote:
> > Hi,
> > I indexed a term '?e???????' (aeroplane) and the term was
> > indexed as "er l n", some characters were trimmed while indexing.
> >
> > Here is my code
> >
> > protected Analyzer.TokenStreamComponents createComponents(final String
> >> fieldName, final Reader reader)
> >> {
> >> final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> >> reader);
> >> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_
> TOKEN_LENGTH);
> >>
> >> TokenStream tok = new ClassicFilter(src);
> >> tok = new LowerCaseFilter(getVersion(), tok);
> >> tok = new StopFilter(getVersion(), tok, stopwords);
> >> tok = new ASCIIFoldingFilter(tok); // to enable
> AccentInsensitive
> >> search
> >>
> >> return new Analyzer.TokenStreamComponents(src, tok)
> >> {
> >> @Override
> >> protected void setReader(final Reader reader) throws
> >> IOException
> >> {
> >>
> >> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >> super.setReader(reader);
> >> }
> >> };
> >> }
> >
> >
> >
> > Am I missing anything? Is that expected behavior for my input or any
> reason
> > behind such abnormal behavior?
> >
> > --
> > Regards,
> > Chitra
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Regards,
Chitra
Re: ClassicAnalyzer Behavior on accent character [ In reply to ]
Hi,
I found the difference and understand the behavior of both
tokenizers appropriately.

Could you please suggest me which one is the better to use
ClassicTokenizer/StandardTokenizer?

--
Regards,
Chitra
Re: ClassicAnalyzer Behavior on accent character [ In reply to ]
Classic is ... "classic" ... it exists largely for historical purposes to
provide a tokenizer that does exactly what the javadocs say it does
(regarding punctuation, "produc numbers", and email addresses), so that
people who depend on that behavior can continue to rely on it.

Standard is ... "standard" ... it implements that Unicode Standard text
segmentation rules.


: Date: Fri, 20 Oct 2017 18:58:35 +0530
: From: Chitra <chithu.r111@gmail.com>
: Reply-To: java-user@lucene.apache.org
: To: Lucene Users <java-user@lucene.apache.org>
: Subject: Re: ClassicAnalyzer Behavior on accent character
:
: Hi,
: I found the difference and understand the behavior of both
: tokenizers appropriately.
:
: Could you please suggest me which one is the better to use
: ClassicTokenizer/StandardTokenizer?
:
: --
: Regards,
: Chitra
:

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: ClassicAnalyzer Behavior on accent character [ In reply to ]
That's expected. Non letters are not mapped to letters, correctly.

On Oct 19, 2017 9:38 AM, "Chitra" <chithu.r111@gmail.com> wrote:

> Hi,
> I indexed a term '?e???????' (aeroplane) and the term was
> indexed as "er l n", some characters were trimmed while indexing.
>
> Here is my code
>
> protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader)
> > {
> > final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> > reader);
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >
> > TokenStream tok = new ClassicFilter(src);
> > tok = new LowerCaseFilter(getVersion(), tok);
> > tok = new StopFilter(getVersion(), tok, stopwords);
> > tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> > search
> >
> > return new Analyzer.TokenStreamComponents(src, tok)
> > {
> > @Override
> > protected void setReader(final Reader reader) throws
> > IOException
> > {
> >
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> > super.setReader(reader);
> > }
> > };
> > }
>
>
>
> Am I missing anything? Is that expected behavior for my input or any reason
> behind such abnormal behavior?
>
> --
> Regards,
> Chitra
>