Mailing List Archive: Re: Wildcards

Re: Wildcards

Jan 25, 2008, 2:28 AM

Post #1 of 21 (6036 views)

On Jan 23, 2008, at 11:47 PM, Nathan Kurz wrote:

> Don't punt on the scoring!

Well, here's the problem, which afflicts the current implementation
of wildcards in Lucene. If we transform the wildcard into an array
of TermQuery objects, then each of them has an individual IDF -- so
in a search for "pet*", the rare term "petard" will contibute more
than the more common term "pets". Should it? The consensus is that
such behavior is sub-optimal.

> From my naive point of view, a wildcard just looks like another way of
> specifying a boolean OR. Why not expand it out with the parser level?
> Sure it might be really big, but there's nothing wrong with providing
> support for industrial strength boolean queries.

However any particular WildcardQuery gets implemented, it will need
some sort of safety valve to prevent "a*" from swamping the server.

> Of course, I say
> that because I'm going to want them one day for my own nefarious
> purposes, and with flexible scoring at that.

Another reason for core KS to concentrate on providing a plugin
scaffolding on which you can hang various KSx extensions, rather than
a smorgasbord of Query subclasses.

>> Actually, if we iterate up front, we could find out the IDF of the
>> fragment and then use that to assess a crude score.
>
> I will be so appreciative some day if you move away from architectures
> that presumes IDF is always going to be the way that things are
> scored.

TF/IDF is hard to beat as a default system. However, I'd like to
make it possible to override, not just at search time, but at index
time. That's the rationale behind the introduction of the abstract
base classes KinoSearch::Index::Reader and
KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
first distro to use these capabilities.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

sprout at cpan

Jan 25, 2008, 8:40 AM

Post #2 of 21 (5942 views)

Permalink

On Jan 25, 2008, at 2:28 AM, Marvin Humphrey wrote:

> However any particular WildcardQuery gets implemented, it will need
> some sort of safety valve to prevent "a*" from swamping the server.

You mean like this?

KSx::Search::RegexpQuery->new(
re => qr/^foo.*/,
field => 'content',
max_terms => 1024,
);

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

marvin at rectangular

Jan 25, 2008, 11:41 AM

Post #3 of 21 (5939 views)

Permalink

On Jan 25, 2008, at 8:40 AM, Father Chrysostomos wrote:

>> However any particular WildcardQuery gets implemented, it will
>> need some sort of safety valve to prevent "a*" from swamping the
>> server.
>
> You mean like this?
>
> KSx::Search::RegexpQuery->new(
> re => qr/^foo.*/,
> field => 'content',
> max_terms => 1024,
> );

Yes, that would work.

It would probably be best if exceeding max_terms in the constructor
caused an exception object to be thrown, allowing code like this:

my $query = eval {
KSx::Search::RegexpQuery->new(
re => qr/^foo.*/,
field => 'content',
max_terms => 1024,
);
};
my $exception = $@;
if ( ref($exception) and $exception->isa('MyCustomException') ) {
tell_user_about_error($exception);
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

nate at verse

Jan 25, 2008, 12:11 PM

Post #4 of 21 (5937 views)

Permalink

On 1/25/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> If we transform the wildcard into an array
> of TermQuery objects, then each of them has an individual IDF -- so
> in a search for "pet*", the rare term "petard" will contibute more
> than the more common term "pets". Should it? The consensus is that
> such behavior is sub-optimal.

Definitely sub-optimal, but to my mind this points out the
shortcomings of TF/IDF when used with Boolean subqueries rather than
the downside of using a Boolean query for wildcards. I hit the same
problem when using Boolean OR's to search for common spelling errors.
Does one really want a search for "speling OR spelling" to prefer
the mis-speling?

In both of these cases, one does not want automatically prefer the
rarer word. My guess would be that any generated query (and thus from
a practical point of view, any Boolean query) does not want this
behaviour. It's only when dealing directly with user entered
keywords that this is a good choice.

In my opinion, one wants the parser to have access to the TF
information and to (optionally) use it when creating the query. And
one wants the the IDF information to be available to the scorer for
it's optional use. But the scorer should not care directly about TF,
only about the weight that has been input for each query term.

> > Of course, I say
> > that because I'm going to want them one day for my own nefarious
> > purposes, and with flexible scoring at that.
>
> Another reason for core KS to concentrate on providing a plugin
> scaffolding on which you can hang various KSx extensions, rather than
> a smorgasbord of Query subclasses.

Agreed. I don't think you need or want a built-in WildcardQuery
class. The core should provide rock solid Boolean components, and a
means of plugging in alternate parsers and scorers.

> TF/IDF is hard to beat as a default system.

TF/IDF is an excellent means for sorting a large database of full text
news articles by relevance based on naively entered keywords. To a
reasonable approximation, web search can be viewed in this light. But
its utility in other situations varies :).

> However, I'd like to make it possible to override, not just at search time,
> but at index time.

I'm not sure I understand this. Is this in the sense of making
certain parts of the index optional, or does it go deeper than this?

> That's the rationale behind the introduction of the abstract
> base classes KinoSearch::Index::Reader and
> KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
> first distro to use these capabilities.

I've been watching the commits, but haven't really had an idea of
where you are going. Could you offer an overview when you have the
time?

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

nate at verse

Jan 25, 2008, 12:32 PM

Post #5 of 21 (5938 views)

Permalink

On 1/25/08, Father Chrysostomos <sprout@cpan.org> wrote:
> On Jan 25, 2008, at 2:28 AM, Marvin Humphrey wrote:
>
> > However any particular WildcardQuery gets implemented, it will need
> > some sort of safety valve to prevent "a*" from swamping the server.
>
> You mean like this?
>
> KSx::Search::RegexpQuery->new(
> re => qr/^foo.*/,
> field => 'content',
> max_terms => 1024,

While one could do this, if the goal is to be a safety valve you might
this check to be enforced at the core Boolean level instead of by each
extension.

But my instinct would to figure out a way simply make the search work,
rather than throwing it out as an exception. Supporting a search for
"a* lovelace" seems reasonable, and shouldn't actually be that
expensive if implemented lazily.

If one was to have a limit, it should probably be on the total length
of the records that need to be searched, not on the number of terms
involved.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

marvin at rectangular

Jan 27, 2008, 8:10 PM

Post #6 of 21 (5939 views)

Permalink

On Jan 25, 2008, at 12:11 PM, Nathan Kurz wrote:

> Does one really want a search for "speling OR spelling" to prefer
> the mis-speling?

Heh. Perhaps one way to solve this would be via a special TermQuery
subclass that overrides the IDF calculation.

Or maybe the default TermQuery class can do flat scoring and
TFIDFTermQuery would override? I imagine that would make you happy. ;)

> In my opinion, one wants the parser to have access to the TF
> information and to (optionally) use it when creating the query.

IDF is known when compiling the Query to a Weight to a Scorer, but TF
is per-document. You aren't going to have access to TF at the Scorer-
compilation stage.

> And
> one wants the the IDF information to be available to the scorer for
> it's optional use. But the scorer should not care directly about TF,
> only about the weight that has been input for each query term.

Well, this is certainly do-able in theory.

> The core should provide rock solid Boolean components, and a
> means of plugging in alternate parsers and scorers.

Nicely put. I concur.

>> TF/IDF is hard to beat as a default system.
>
> TF/IDF is an excellent means for sorting a large database of full text
> news articles by relevance based on naively entered keywords. To a
> reasonable approximation, web search can be viewed in this light. But
> its utility in other situations varies :).

Heh. :)

TF/IDF needs to continue to be the IR model you get when you fire up
standard KS. But the idea of focusing on pure boolean components is
attractive. It would be killer if we could abstract TF/IDF to a
higher level.

KinoSearch's Query/Weight/Scorer compile phase, though less
convoluted than the Lucene model, is still complex. Perhaps we can
divide things up into layers and make the bottom layer boolean-only
and simpler. To get the good TF/IDF scoring we still have to do some
complex weighting, but maybe we can move that up into
KinoSearch::Search::TFIDF::TFIDFTermQuery and such.

>> However, I'd like to make it possible to override, not just at
>> search time,
>> but at index time.
>
> I'm not sure I understand this. Is this in the sense of making
> certain parts of the index optional, or does it go deeper than this?

I don't think I'd stated it that well. ;)

KinoSearch is, at its heart, a segmented inverted indexer. Right
now, each segment has four main components:

* Doc Storage
* Postings
* Lexicons
* Term Vectors

I'd like to make it possible to add other components.

Consider the problem of determining whether a particular lat/lon
pairing falls within a given square. You can use doubled-up
RangeFilters for that, but it's not efficient. You have to take the
intersection of two slices of the index: EVERYTHING with a $lat
between $lat_min and $lat_max, and EVERYTHING with a $lon between
$lon_min and $lon_max -- thus you end up churning through a lot of
docs that are *way* outside the box.

R-trees are a more efficient data structure for geospatial
searching. However, there's no RTreeWriter writing R-tree data to
each segment in KS by default. I'd like to write one and make it
easy to integrate via InvIndexer/SegWriter.

A fellow actually hacked R-trees into Lucene -- the project is called
"GeoLucene" -- but for a variety of reasons, it wasn't very easy or
elegant.

http://www.doc.ic.ac.uk/~es106/thesis/
MultidimensionalIndexingInLucene.pdf
https://sourceforge.net/projects/geolucene/
http://www.gossamer-threads.com/lists/lucene/java-dev/53378

I think it's possible to make KS quite friendly to such extensions --
much more friendly than Lucene, with its crazy file format tightly
coupled to a gigantic core code base.

>> That's the rationale behind the introduction of the abstract
>> base classes KinoSearch::Index::Reader and
>> KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
>> first distro to use these capabilities.
>
> I've been watching the commits, but haven't really had an idea of
> where you are going. Could you offer an overview when you have the
> time?

Well, a lot of what's gone in during the last month or so has been
straightforward porting of Perl code to C code. Figuring out an
elegant way for abstract methods to call back to the host language
from C, allowing them to be overridden via *either* Perl OR C, was a
real breakthrough. I'd like to finish the porting task and get some
experimental Java bindings up and running. One motivation is to make
it possible to benchmark KS use the Lucene benchmarking contrib code
directly -- eliminating the need to port it.

But beyond that, a central goal is to make KS as extensible as
possible, so that it is reasonably easy for motivated hackers -- like
you, like the GeoLucene guy -- to try out IR models that are suitable
for use within the context of a segmented inverted index.

Powerful search engines (Google being the archetype) don't rely on
one IR mechanism alone, but rather balance a slew of them. You can
do this in KS at a basic level by indexing both stemmed and unstemmed
versions of the same text, increasing the size of the index and your
search-time costs, but improving search accuracy: a search for
"horsing" still matches "horse", "horses", etc but *prefers* the
exact match of "horsing".

In order to improve search accuracy beyond the limits of TF/IDF,
especially when dealing with large collections, we need to be able to
scale up both by spreading to multiple machines AND by layering
different IR models on top of each other. That's where KS is headed,
and as things progress, I'm more and more confident that it's going
to work out well.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

marvin at rectangular

Jan 27, 2008, 8:15 PM

Post #7 of 21 (5930 views)

Permalink

On Jan 25, 2008, at 12:32 PM, Nathan Kurz wrote:

> But my instinct would to figure out a way simply make the search work,
> rather than throwing it out as an exception. Supporting a search for
> "a* lovelace" seems reasonable, and shouldn't actually be that
> expensive if implemented lazily.

If you want correct results, you have to cruise through all the docs
that match "a*" no matter what, because you won't know what the top
scorers are until you've seen everything.

> If one was to have a limit, it should probably be on the total length
> of the records that need to be searched, not on the number of terms
> involved.

Or perhaps by introducing search timeouts.

https://issues.apache.org/jira/browse/LUCENE-997

Unfortunately, it's not easy to integrate a bulletproof timeout
mechanism into KS. I think the most efficient approach would be to
use threads: have a timer thread that checks back every once in a
while to see if the query finishes and throws an exception if time
runs out. However, KS doesn't support threads.

I don't think we should get hung up on this detail, though. For
small collections, the cost won't be high enough to matter.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

nate at verse

Jan 29, 2008, 8:06 PM

Post #8 of 21 (5930 views)

Permalink

On 1/27/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> IDF is known when compiling the Query to a Weight to a Scorer, but TF
> is per-document. You aren't going to have access to TF at the Scorer-
> compilation stage.

Sometimes I worry that my arguments would be more persuasive I was
able to use common terms correctly. :) What I meant to say was that
the globals information doesn't need to be known by the query, only by
the Scorer. The Query would deal with only the per-document data.
This seems to be how you correctly interpreted it, despite my
mangling.

> Or maybe the default TermQuery class can do flat scoring and
> TFIDFTermQuery would override? I imagine that would make you happy. ;)

Given the smileys, I'm not sure if this is a joke or not. To be
clear, this solution would make me ill. My desire is to separate the
query from the scoring, so having a different Query class for each
possible scoring option is the antithesis of what I want. What I want
is to have a number of independent Scorers that can be plugged into a
Scorer-agnostic set of Queries: simple Queries, simple Scorers,
complex combinations.

> TF/IDF needs to continue to be the IR model you get when you fire up
> standard KS. But the idea of focusing on pure boolean components is
> attractive. It would be killer if we could abstract TF/IDF to a
> higher level.

Yes, yes, exactly this. Although I do worry that I mean a different
thing by 'this' than you. :( But regardless of how it is abstracted,
I applaud the desire.

> R-trees are a more efficient data structure for geospatial
> searching. However, there's no RTreeWriter writing R-tree data to
> each segment in KS by default. I'd like to write one and make it
> easy to integrate via InvIndexer/SegWriter.

This is a beautiful concrete example. If KinoSearch was flexible enough to
accommodate this smoothly, it seems likely it would be able to
accommodate a very wide range of other uses as well.

> In order to improve search accuracy beyond the limits of TF/IDF,
> especially when dealing with large collections, we need to be able to
> scale up both by spreading to multiple machines AND by layering
> different IR models on top of each other. That's where KS is headed,
> and as things progress, I'm more and more confident that it's going
> to work out well.

This seems like a wonderful goal!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

sprout at cpan

Feb 7, 2008, 4:26 PM

Post #9 of 21 (5920 views)

Permalink

On Jan 25, 2008, at 11:34 AM, Marvin Humphrey wrote:

> Using regexes to modify regexes is probably not something I would
> have thought to do, but that's all the better. My goal is to
> provide the scaffolding. I'm focused on getting Lexicon and
> PostingList right, so that you can use or abuse them as you wish. :)

I¢m trying to implement my RegexpTermQuery class right now. I¢m stuck
on one thing. I¢d like to use the TermScorer, so that it will be
scored the same way as a single term. But looking at TermWeight, I see
that $plist->make_scorer is called inside sub scorer. I can¢t call
that method on a posting list because I don¢t have just one. Am I
using the right approach? Or should I be subclassing Scorer? If that
is the case, what methods should I override? (I¢m still not sure
exactly what the scorer is doing.)

I suppose the answers to these questions are precisely what you are
working on. :-)

Anyway, the attached patch shows what I¢ve been trying to do so far
(completely untested).

Re: Wildcards [ In reply to ]

marvin at rectangular

Feb 8, 2008, 3:59 AM

Post #10 of 21 (5917 views)

Permalink

Father Chrysostomos:

> I suppose the answers to these questions are precisely what you are
> working on. :-)

Please take a look at the newly committed
KinoSearch::Docs::Cookbook::WildcardQuery and let me know how it goes:
<http://xrl.us/bfust>

(We'll have to expose Scorer and Tally as public classes, plus all the methods
overridden in the cookbook examples.)

> Anyway, the attached patch shows what I've been trying to do so far
> (completely untested).

I didn't see this because my email machine just crashed, I've had to restore
from backup, and the list archive didn't preserve it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch