Mailing List Archive

Small Vocabulary
Dear list,
I'm considering to use Lucene for indexing sequences of part-of-speech
(POS) tags instead of words; for those who don't know, POS tags are
linguistically motivated labels that are assigned to tokens (words) to
describe its morpho-syntactic function. Instead of sequences of words, I
would like to index sequences of tags, for instance "ART ADV ADJA NN".
The aim is to be able to search (efficiently) for occurrences of "ADJA".

The question is whether Lucene can be applied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) is problematic, I suppose.

First trials for which I have implemented an analyzer that just outputs
Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
not exactly perfect regarding search performance, in a test corpus with
a few million tokens. The number of tokens in production mode is
expected to be much larger, so I wonder whether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?

Thank you very much,
Carsten Schnober


--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
Lucene 4.0 allows you to use custom codecs and there may be one that
would be better for this sort of data, or you could write one.

In your tests is it the searching that is slow or are you reading lots
of data for lots of docs? The latter is always likely to be slow.
General performance advice as in
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be
relevant. SSDs and loads of RAM never hurt.


--
Ian.


On Mon, Jul 30, 2012 at 2:07 PM, Carsten Schnober
<schnober@ids-mannheim.de> wrote:
> Dear list,
> I'm considering to use Lucene for indexing sequences of part-of-speech
> (POS) tags instead of words; for those who don't know, POS tags are
> linguistically motivated labels that are assigned to tokens (words) to
> describe its morpho-syntactic function. Instead of sequences of words, I
> would like to index sequences of tags, for instance "ART ADV ADJA NN".
> The aim is to be able to search (efficiently) for occurrences of "ADJA".
>
> The question is whether Lucene can be applied to deal with that data
> cleverly because the statistical properties of such pseudo-texts is very
> distinct from natural language texts and make me wonder whether Lucene's
> inverted indexes are suitable. Especially the small vocabulary size (<50
> distinct tokens, depending on the tagging system) is problematic, I suppose.
>
> First trials for which I have implemented an analyzer that just outputs
> Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
> not exactly perfect regarding search performance, in a test corpus with
> a few million tokens. The number of tokens in production mode is
> expected to be much larger, so I wonder whether this approach is
> promising at all.
> Does Lucene (4.0?) provide optimization techniques for extremely small
> vocabulary sizes?
>
> Thank you very much,
> Carsten Schnober
>
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
Am 31.07.2012 12:10, schrieb Ian Lea:

Hi Ian,

> Lucene 4.0 allows you to use custom codecs and there may be one that
> would be better for this sort of data, or you could write one.
>
> In your tests is it the searching that is slow or are you reading lots
> of data for lots of docs? The latter is always likely to be slow.
> General performance advice as in
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be
> relevant. SSDs and loads of RAM never hurt.

You are very right, therer are many results from many docs for the
slower searches performed on that index. However, I am still wondering
about the theoretical implications: having a small vocabulary with many
tokens in an inverted index would yield a rather long list of
occurrences for some/many/all (depending on the actual distribution) of
the search terms.
Thanks for your pointer to the codecs in Lucene 4, I suppose that this
will be the actual point to attack for that scenario. It may be a silly
question, but one that might be of interest for the whole community ;-)
: can someone point me to an in-depth documentation of Lucene 4 codecs,
ideally covering both theoretical backgrounds and implementation? There
are numerous helpful blog entries, presentations, etc. available on the
net, but in case there is some central instance, I have not been able to
find it anyway.
Thanks!
Best regards,
Carsten

--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
There was some interesting work done on optimizing queries including
very common words (stop words) that I think overlaps with your problem.
See this blog post
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
from the Hathi Trust.

The upshot in a nutshell was that queries including terms with very
large postings lists (ie high occurrences) were slow, and the approach
they took to dealing with this was to index n-grams (ie pairs and
triplets of adjacent tokens). However I'm not sure this would help much
if your queries will typically include only a single token.

-Mike

On 07/30/2012 09:07 AM, Carsten Schnober wrote:
> Dear list,
> I'm considering to use Lucene for indexing sequences of part-of-speech
> (POS) tags instead of words; for those who don't know, POS tags are
> linguistically motivated labels that are assigned to tokens (words) to
> describe its morpho-syntactic function. Instead of sequences of words, I
> would like to index sequences of tags, for instance "ART ADV ADJA NN".
> The aim is to be able to search (efficiently) for occurrences of "ADJA".
>
> The question is whether Lucene can be applied to deal with that data
> cleverly because the statistical properties of such pseudo-texts is very
> distinct from natural language texts and make me wonder whether Lucene's
> inverted indexes are suitable. Especially the small vocabulary size (<50
> distinct tokens, depending on the tagging system) is problematic, I suppose.
>
> First trials for which I have implemented an analyzer that just outputs
> Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
> not exactly perfect regarding search performance, in a test corpus with
> a few million tokens. The number of tokens in production mode is
> expected to be much larger, so I wonder whether this approach is
> promising at all.
> Does Lucene (4.0?) provide optimization techniques for extremely small
> vocabulary sizes?
>
> Thank you very much,
> Carsten Schnober
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
Am 06.08.2012 20:29, schrieb Mike Sokolov:

Hi Mike,

> There was some interesting work done on optimizing queries including
> very common words (stop words) that I think overlaps with your problem.
> See this blog post
> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> from the Hathi Trust.
>
> The upshot in a nutshell was that queries including terms with very
> large postings lists (ie high occurrences) were slow, and the approach
> they took to dealing with this was to index n-grams (ie pairs and
> triplets of adjacent tokens). However I'm not sure this would help much
> if your queries will typically include only a single token.

This is very interesting for our use case indeed. However, you are right
that indexing n-grams is not (per sé) a solution for my given problem
because I'm working on an application using multiple indexes. A query
for one isolated frequent term will indeed be rare presumably, or at
least rare enough to tolerate slow response times, but the results will
typically be intersected with results from other indexes.

To illustrate this more practically: the index I described having
relatively few distinct and partially extremely frequent tokens indexes
part-of-speech (POS) tags with positional information stored in the
payload. A parallel index indexes actual text; a typical query may look
for a certain POS tag in one index and a word X at the same position
with a matching payload in the other index. So both indexes need to be
queries completely before the intersection can be performed.

Best,
Carsten



--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
If you do intersection (not join), maybe it make sense to put every
thing into 1 index?

Just transform your input like "brown fox" into "ADJ:brown|<your

Write a custom tokenizer, some filters and that's it.

Of course I'm not aware of all the details, so my solution might not
be applicable to your project.
Maybe you could share more details, so this won't transform in "XY problem".

Keep in mind : always optimize your index for the query usecase,
instead of blindly processing the input data.


On Tue, Aug 7, 2012 at 10:29 AM, Carsten Schnober
<schnober@ids-mannheim.de> wrote:
> Am 06.08.2012 20:29, schrieb Mike Sokolov:
>
> Hi Mike,
>
>> There was some interesting work done on optimizing queries including
>> very common words (stop words) that I think overlaps with your problem.
>> See this blog post
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>> from the Hathi Trust.
>>
>> The upshot in a nutshell was that queries including terms with very
>> large postings lists (ie high occurrences) were slow, and the approach
>> they took to dealing with this was to index n-grams (ie pairs and
>> triplets of adjacent tokens). However I'm not sure this would help much
>> if your queries will typically include only a single token.
>
> This is very interesting for our use case indeed. However, you are right
> that indexing n-grams is not (per sé) a solution for my given problem
> because I'm working on an application using multiple indexes. A query
> for one isolated frequent term will indeed be rare presumably, or at
> least rare enough to tolerate slow response times, but the results will
> typically be intersected with results from other indexes.
>
> To illustrate this more practically: the index I described having
> relatively few distinct and partially extremely frequent tokens indexes
> part-of-speech (POS) tags with positional information stored in the
> payload. A parallel index indexes actual text; a typical query may look
> for a certain POS tag in one index and a word X at the same position
> with a matching payload in the other index. So both indexes need to be
> queries completely before the intersection can be performed.
>
> Best,
> Carsten
>
>
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
Am 07.08.2012 10:20, schrieb Danil ŢORIN:

Hi Danil,

> If you do intersection (not join), maybe it make sense to put every
> thing into 1 index?

Just a note on that: my application performs intersections and joins
(unions) on the results, depending on the query. So the index structure
has to be ready for both, but intersections are clearly more complicated.

> Just transform your input like "brown fox" into "ADJ:brown|<your
> payload> NOUN:fox|<other payload>"

I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
actual token and "brown" and "fox" as payloads (followed by <other

This is a very neat approach and I have vaguely considered that. One
problem is that I aim for a very high level of flexibility, meaning that
additional annotations have to be addable at any point and different
tokenizations apply. However, I will re-consider your suggestion,
possibly applying one of multiple tokenizations as a default in this sense.

> Of course I'm not aware of all the details, so my solution might not
> be applicable to your project.
> Maybe you could share more details, so this won't transform in "XY problem".
>
> Keep in mind : always optimize your index for the query usecase,
> instead of blindly processing the input data.

Thanks for that reminder; this becomes quite difficult in my scenario
though since we want to allow for flexible changes in the index types,
representing different annotations, tokenization logics etc.
Best,
Carsten


--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
Hi Danil,

>> Just transform your input like "brown fox" into "ADJ:brown|<your
>> payload> NOUN:fox|<other payload>"
>
> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
> actual token and "brown" and "fox" as payloads (followed by <other
> payload>), right?

Sorry for replying to myself, but I've realised only now that you
probably meant to replace the full token string ("brown") by "ADJ:brown"
and use the payload otherwise, right? Regarding incoming queries, this
method makes it necessary to perform a Wildcard query (e.g. "NOUN:*")
when I am not interested in the actual text ("brown") -- which may
happen more or less frequently -- am I right? However, this might be an
acceptable trade-off...
Best regards,
Carsten


--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
I mean "ADJ:brown" as a token and only the <payload> as payload, since
you probably only use it for some scoring/postprocessing not the
actual matching.

You can even write a filter that will emit both tokens "ADJ" and
"AJD:brown" on same position (so you'll be able to do phrase queries),
and still maintain join capability.


On Tue, Aug 7, 2012 at 12:13 PM, Carsten Schnober
<schnober@ids-mannheim.de> wrote:
> Am 07.08.2012 10:20, schrieb Danil ŢORIN:
>
> Hi Danil,
>
>> If you do intersection (not join), maybe it make sense to put every
>> thing into 1 index?
>
> Just a note on that: my application performs intersections and joins
> (unions) on the results, depending on the query. So the index structure
> has to be ready for both, but intersections are clearly more complicated.
>
>> Just transform your input like "brown fox" into "ADJ:brown|<your
>> payload> NOUN:fox|<other payload>"
>
> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
> actual token and "brown" and "fox" as payloads (followed by <other
> payload>), right?
>
> This is a very neat approach and I have vaguely considered that. One
> problem is that I aim for a very high level of flexibility, meaning that
> additional annotations have to be addable at any point and different
> tokenizations apply. However, I will re-consider your suggestion,
> possibly applying one of multiple tokenizations as a default in this sense.
>
>> Of course I'm not aware of all the details, so my solution might not
>> be applicable to your project.
>> Maybe you could share more details, so this won't transform in "XY problem".
>>
>> Keep in mind : always optimize your index for the query usecase,
>> instead of blindly processing the input data.
>
> Thanks for that reminder; this becomes quite difficult in my scenario
> though since we want to allow for flexible changes in the index types,
> representing different annotations, tokenization logics etc.
> Best,
> Carsten
>
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Small Vocabulary [ In reply to ]
To avoid wildcard queries, you can write a TokenFilter that will
create both tokens "ADJ" and "ADJ:brown" in same position.
so you can use you index for both lookups without doing wildcard.


On Tue, Aug 7, 2012 at 12:31 PM, Carsten Schnober
<schnober@ids-mannheim.de> wrote:
> Hi Danil,
>
>>> Just transform your input like "brown fox" into "ADJ:brown|<your
>>> payload> NOUN:fox|<other payload>"
>>
>> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
>> actual token and "brown" and "fox" as payloads (followed by <other
>> payload>), right?
>
> Sorry for replying to myself, but I've realised only now that you
> probably meant to replace the full token string ("brown") by "ADJ:brown"
> and use the payload otherwise, right? Regarding incoming queries, this
> method makes it necessary to perform a Wildcard query (e.g. "NOUN:*")
> when I am not interested in the actual text ("brown") -- which may
> happen more or less frequently -- am I right? However, this might be an
> acceptable trade-off...
> Best regards,
> Carsten
>
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org