Mailing List Archive

Re: Wildcards
On Jan 23, 2008, at 11:47 PM, Nathan Kurz wrote:

> Don't punt on the scoring!

Well, here's the problem, which afflicts the current implementation
of wildcards in Lucene. If we transform the wildcard into an array
of TermQuery objects, then each of them has an individual IDF -- so
in a search for "pet*", the rare term "petard" will contibute more
than the more common term "pets". Should it? The consensus is that
such behavior is sub-optimal.

> From my naive point of view, a wildcard just looks like another way of
> specifying a boolean OR. Why not expand it out with the parser level?
> Sure it might be really big, but there's nothing wrong with providing
> support for industrial strength boolean queries.

However any particular WildcardQuery gets implemented, it will need
some sort of safety valve to prevent "a*" from swamping the server.

> Of course, I say
> that because I'm going to want them one day for my own nefarious
> purposes, and with flexible scoring at that.

Another reason for core KS to concentrate on providing a plugin
scaffolding on which you can hang various KSx extensions, rather than
a smorgasbord of Query subclasses.

>> Actually, if we iterate up front, we could find out the IDF of the
>> fragment and then use that to assess a crude score.
>
> I will be so appreciative some day if you move away from architectures
> that presumes IDF is always going to be the way that things are
> scored.

TF/IDF is hard to beat as a default system. However, I'd like to
make it possible to override, not just at search time, but at index
time. That's the rationale behind the introduction of the abstract
base classes KinoSearch::Index::Reader and
KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
first distro to use these capabilities.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On Jan 25, 2008, at 2:28 AM, Marvin Humphrey wrote:

> However any particular WildcardQuery gets implemented, it will need
> some sort of safety valve to prevent "a*" from swamping the server.

You mean like this?

KSx::Search::RegexpQuery->new(
re => qr/^foo.*/,
field => 'content',
max_terms => 1024,
);



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On Jan 25, 2008, at 8:40 AM, Father Chrysostomos wrote:

>> However any particular WildcardQuery gets implemented, it will
>> need some sort of safety valve to prevent "a*" from swamping the
>> server.
>
> You mean like this?
>
> KSx::Search::RegexpQuery->new(
> re => qr/^foo.*/,
> field => 'content',
> max_terms => 1024,
> );

Yes, that would work.

It would probably be best if exceeding max_terms in the constructor
caused an exception object to be thrown, allowing code like this:

my $query = eval {
KSx::Search::RegexpQuery->new(
re => qr/^foo.*/,
field => 'content',
max_terms => 1024,
);
};
my $exception = $@;
if ( ref($exception) and $exception->isa('MyCustomException') ) {
tell_user_about_error($exception);
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On 1/25/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> If we transform the wildcard into an array
> of TermQuery objects, then each of them has an individual IDF -- so
> in a search for "pet*", the rare term "petard" will contibute more
> than the more common term "pets". Should it? The consensus is that
> such behavior is sub-optimal.

Definitely sub-optimal, but to my mind this points out the
shortcomings of TF/IDF when used with Boolean subqueries rather than
the downside of using a Boolean query for wildcards. I hit the same
problem when using Boolean OR's to search for common spelling errors.
Does one really want a search for "speling OR spelling" to prefer
the mis-speling?

In both of these cases, one does not want automatically prefer the
rarer word. My guess would be that any generated query (and thus from
a practical point of view, any Boolean query) does not want this
behaviour. It's only when dealing directly with user entered
keywords that this is a good choice.

In my opinion, one wants the parser to have access to the TF
information and to (optionally) use it when creating the query. And
one wants the the IDF information to be available to the scorer for
it's optional use. But the scorer should not care directly about TF,
only about the weight that has been input for each query term.

> > Of course, I say
> > that because I'm going to want them one day for my own nefarious
> > purposes, and with flexible scoring at that.
>
> Another reason for core KS to concentrate on providing a plugin
> scaffolding on which you can hang various KSx extensions, rather than
> a smorgasbord of Query subclasses.

Agreed. I don't think you need or want a built-in WildcardQuery
class. The core should provide rock solid Boolean components, and a
means of plugging in alternate parsers and scorers.

> TF/IDF is hard to beat as a default system.

TF/IDF is an excellent means for sorting a large database of full text
news articles by relevance based on naively entered keywords. To a
reasonable approximation, web search can be viewed in this light. But
its utility in other situations varies :).

> However, I'd like to make it possible to override, not just at search time,
> but at index time.

I'm not sure I understand this. Is this in the sense of making
certain parts of the index optional, or does it go deeper than this?

> That's the rationale behind the introduction of the abstract
> base classes KinoSearch::Index::Reader and
> KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
> first distro to use these capabilities.

I've been watching the commits, but haven't really had an idea of
where you are going. Could you offer an overview when you have the
time?

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On 1/25/08, Father Chrysostomos <sprout@cpan.org> wrote:
> On Jan 25, 2008, at 2:28 AM, Marvin Humphrey wrote:
>
> > However any particular WildcardQuery gets implemented, it will need
> > some sort of safety valve to prevent "a*" from swamping the server.
>
> You mean like this?
>
> KSx::Search::RegexpQuery->new(
> re => qr/^foo.*/,
> field => 'content',
> max_terms => 1024,

While one could do this, if the goal is to be a safety valve you might
this check to be enforced at the core Boolean level instead of by each
extension.

But my instinct would to figure out a way simply make the search work,
rather than throwing it out as an exception. Supporting a search for
"a* lovelace" seems reasonable, and shouldn't actually be that
expensive if implemented lazily.

If one was to have a limit, it should probably be on the total length
of the records that need to be searched, not on the number of terms
involved.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On Jan 25, 2008, at 12:11 PM, Nathan Kurz wrote:

> Does one really want a search for "speling OR spelling" to prefer
> the mis-speling?

Heh. Perhaps one way to solve this would be via a special TermQuery
subclass that overrides the IDF calculation.

Or maybe the default TermQuery class can do flat scoring and
TFIDFTermQuery would override? I imagine that would make you happy. ;)

> In my opinion, one wants the parser to have access to the TF
> information and to (optionally) use it when creating the query.

IDF is known when compiling the Query to a Weight to a Scorer, but TF
is per-document. You aren't going to have access to TF at the Scorer-
compilation stage.

> And
> one wants the the IDF information to be available to the scorer for
> it's optional use. But the scorer should not care directly about TF,
> only about the weight that has been input for each query term.

Well, this is certainly do-able in theory.

> The core should provide rock solid Boolean components, and a
> means of plugging in alternate parsers and scorers.

Nicely put. I concur.

>> TF/IDF is hard to beat as a default system.
>
> TF/IDF is an excellent means for sorting a large database of full text
> news articles by relevance based on naively entered keywords. To a
> reasonable approximation, web search can be viewed in this light. But
> its utility in other situations varies :).

Heh. :)

TF/IDF needs to continue to be the IR model you get when you fire up
standard KS. But the idea of focusing on pure boolean components is
attractive. It would be killer if we could abstract TF/IDF to a
higher level.

KinoSearch's Query/Weight/Scorer compile phase, though less
convoluted than the Lucene model, is still complex. Perhaps we can
divide things up into layers and make the bottom layer boolean-only
and simpler. To get the good TF/IDF scoring we still have to do some
complex weighting, but maybe we can move that up into
KinoSearch::Search::TFIDF::TFIDFTermQuery and such.

>> However, I'd like to make it possible to override, not just at
>> search time,
>> but at index time.
>
> I'm not sure I understand this. Is this in the sense of making
> certain parts of the index optional, or does it go deeper than this?

I don't think I'd stated it that well. ;)

KinoSearch is, at its heart, a segmented inverted indexer. Right
now, each segment has four main components:

* Doc Storage
* Postings
* Lexicons
* Term Vectors

I'd like to make it possible to add other components.

Consider the problem of determining whether a particular lat/lon
pairing falls within a given square. You can use doubled-up
RangeFilters for that, but it's not efficient. You have to take the
intersection of two slices of the index: EVERYTHING with a $lat
between $lat_min and $lat_max, and EVERYTHING with a $lon between
$lon_min and $lon_max -- thus you end up churning through a lot of
docs that are *way* outside the box.

R-trees are a more efficient data structure for geospatial
searching. However, there's no RTreeWriter writing R-tree data to
each segment in KS by default. I'd like to write one and make it
easy to integrate via InvIndexer/SegWriter.

A fellow actually hacked R-trees into Lucene -- the project is called
"GeoLucene" -- but for a variety of reasons, it wasn't very easy or
elegant.

http://www.doc.ic.ac.uk/~es106/thesis/
MultidimensionalIndexingInLucene.pdf
https://sourceforge.net/projects/geolucene/
http://www.gossamer-threads.com/lists/lucene/java-dev/53378

I think it's possible to make KS quite friendly to such extensions --
much more friendly than Lucene, with its crazy file format tightly
coupled to a gigantic core code base.

>> That's the rationale behind the introduction of the abstract
>> base classes KinoSearch::Index::Reader and
>> KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
>> first distro to use these capabilities.
>
> I've been watching the commits, but haven't really had an idea of
> where you are going. Could you offer an overview when you have the
> time?

Well, a lot of what's gone in during the last month or so has been
straightforward porting of Perl code to C code. Figuring out an
elegant way for abstract methods to call back to the host language
from C, allowing them to be overridden via *either* Perl OR C, was a
real breakthrough. I'd like to finish the porting task and get some
experimental Java bindings up and running. One motivation is to make
it possible to benchmark KS use the Lucene benchmarking contrib code
directly -- eliminating the need to port it.

But beyond that, a central goal is to make KS as extensible as
possible, so that it is reasonably easy for motivated hackers -- like
you, like the GeoLucene guy -- to try out IR models that are suitable
for use within the context of a segmented inverted index.

Powerful search engines (Google being the archetype) don't rely on
one IR mechanism alone, but rather balance a slew of them. You can
do this in KS at a basic level by indexing both stemmed and unstemmed
versions of the same text, increasing the size of the index and your
search-time costs, but improving search accuracy: a search for
"horsing" still matches "horse", "horses", etc but *prefers* the
exact match of "horsing".

In order to improve search accuracy beyond the limits of TF/IDF,
especially when dealing with large collections, we need to be able to
scale up both by spreading to multiple machines AND by layering
different IR models on top of each other. That's where KS is headed,
and as things progress, I'm more and more confident that it's going
to work out well.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On Jan 25, 2008, at 12:32 PM, Nathan Kurz wrote:

> But my instinct would to figure out a way simply make the search work,
> rather than throwing it out as an exception. Supporting a search for
> "a* lovelace" seems reasonable, and shouldn't actually be that
> expensive if implemented lazily.

If you want correct results, you have to cruise through all the docs
that match "a*" no matter what, because you won't know what the top
scorers are until you've seen everything.

> If one was to have a limit, it should probably be on the total length
> of the records that need to be searched, not on the number of terms
> involved.

Or perhaps by introducing search timeouts.

https://issues.apache.org/jira/browse/LUCENE-997

Unfortunately, it's not easy to integrate a bulletproof timeout
mechanism into KS. I think the most efficient approach would be to
use threads: have a timer thread that checks back every once in a
while to see if the query finishes and throws an exception if time
runs out. However, KS doesn't support threads.

I don't think we should get hung up on this detail, though. For
small collections, the cost won't be high enough to matter.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On 1/27/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> IDF is known when compiling the Query to a Weight to a Scorer, but TF
> is per-document. You aren't going to have access to TF at the Scorer-
> compilation stage.

Sometimes I worry that my arguments would be more persuasive I was
able to use common terms correctly. :) What I meant to say was that
the globals information doesn't need to be known by the query, only by
the Scorer. The Query would deal with only the per-document data.
This seems to be how you correctly interpreted it, despite my
mangling.

> Or maybe the default TermQuery class can do flat scoring and
> TFIDFTermQuery would override? I imagine that would make you happy. ;)

Given the smileys, I'm not sure if this is a joke or not. To be
clear, this solution would make me ill. My desire is to separate the
query from the scoring, so having a different Query class for each
possible scoring option is the antithesis of what I want. What I want
is to have a number of independent Scorers that can be plugged into a
Scorer-agnostic set of Queries: simple Queries, simple Scorers,
complex combinations.

> TF/IDF needs to continue to be the IR model you get when you fire up
> standard KS. But the idea of focusing on pure boolean components is
> attractive. It would be killer if we could abstract TF/IDF to a
> higher level.

Yes, yes, exactly this. Although I do worry that I mean a different
thing by 'this' than you. :( But regardless of how it is abstracted,
I applaud the desire.

> R-trees are a more efficient data structure for geospatial
> searching. However, there's no RTreeWriter writing R-tree data to
> each segment in KS by default. I'd like to write one and make it
> easy to integrate via InvIndexer/SegWriter.

This is a beautiful concrete example. If KinoSearch was flexible enough to
accommodate this smoothly, it seems likely it would be able to
accommodate a very wide range of other uses as well.

> In order to improve search accuracy beyond the limits of TF/IDF,
> especially when dealing with large collections, we need to be able to
> scale up both by spreading to multiple machines AND by layering
> different IR models on top of each other. That's where KS is headed,
> and as things progress, I'm more and more confident that it's going
> to work out well.

This seems like a wonderful goal!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Wildcards [ In reply to ]
On Jan 25, 2008, at 11:34 AM, Marvin Humphrey wrote:

> Using regexes to modify regexes is probably not something I would
> have thought to do, but that's all the better. My goal is to
> provide the scaffolding. I'm focused on getting Lexicon and
> PostingList right, so that you can use or abuse them as you wish. :)


I¢m trying to implement my RegexpTermQuery class right now. I¢m stuck
on one thing. I¢d like to use the TermScorer, so that it will be
scored the same way as a single term. But looking at TermWeight, I see
that $plist->make_scorer is called inside sub scorer. I can¢t call
that method on a posting list because I don¢t have just one. Am I
using the right approach? Or should I be subclassing Scorer? If that
is the case, what methods should I override? (I¢m still not sure
exactly what the scorer is doing.)

I suppose the answers to these questions are precisely what you are
working on. :-)

Anyway, the attached patch shows what I¢ve been trying to do so far
(completely untested).
Re: Wildcards [ In reply to ]
Father Chrysostomos:

> I suppose the answers to these questions are precisely what you are
> working on. :-)

Please take a look at the newly committed
KinoSearch::Docs::Cookbook::WildcardQuery and let me know how it goes:
<http://xrl.us/bfust>

(We'll have to expose Scorer and Tally as public classes, plus all the methods
overridden in the cookbook examples.)

> Anyway, the attached patch shows what I've been trying to do so far
> (completely untested).

I didn't see this because my email machine just crashed, I've had to restore
from backup, and the list archive didn't preserve it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 8, 2008, at 3:59 AM, Marvin Humphrey wrote:

> Father Chrysostomos:
>
>> I suppose the answers to these questions are precisely what you are
>> working on. :-)
>
> Please take a look at the newly committed
> KinoSearch::Docs::Cookbook::WildcardQuery and let me know how it goes:
> <http://xrl.us/bfust>

Thank you. That¢s very helpful.

> + $tally{$id} = KinoSearch::Search::Tally->new;
> + $tally{$id}->set_score(1.0); # fixed score of 1.0

One question: Is this the place where the weight¢s value should be
specified? (I.e., ->set_score($weight->get_value) )

>> Anyway, the attached patch shows what I've been trying to do so far
>> (completely untested).
>
> I didn't see this because my email machine just crashed, I've had to
> restore
> from backup, and the list archive didn't preserve it.

It¢s probably not much use to you now, but here it is anyway:
Re: Re: Wildcards [ In reply to ]
On 2/8/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> Please take a look at the newly committed
> KinoSearch::Docs::Cookbook::WildcardQuery and let me know how it goes:
> <http://xrl.us/bfust>

Seems like a wonderful sort of example to have. I didn't read though
closely, but saw a small typo: the abstract refers to 'leading
wildcards' instead of 'trailing wildcards'.

--nate

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 8, 2008, at 8:45 AM, Father Chrysostomos wrote:
>
>> + $tally{$id} = KinoSearch::Search::Tally->new;
>> + $tally{$id}->set_score(1.0); # fixed score of 1.0
>
> One question: Is this the place where the weight’s value should be
> specified? (I.e., ->set_score($weight->get_value) )

It would be, except that the example code never even bothers to
calculate such a value.

I wrote up the tutorial in the spirit of designing a public API from
first principles (and in fact, made some changes to the API while
writing it up). I think what's in those docs is pretty coherent, and
manages to fulfill the high-level requirements. The Query-Weight-
Scorer hierarchy is here to stay, methinks, and that's what I wanted
to cover.

The stuff that got left out, though, is a lot less coherent right
now. It's spread out over Weight, Similarity, and Posting, and I'm
not sure exactly how it should be refactored.

As a starting point... Weight should probably have a single crucial
method which takes a Searchable as its main argument:

sub compute {
my ( $self, $searchable ) = @_;
$value{$$self} = $self->get_parent->get_boost;
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 11, 2008, at 6:01 PM, Marvin Humphrey wrote:

> The stuff that got left out, though, is a lot less coherent right
> now. It's spread out over Weight, Similarity, and Posting, and I'm
> not sure exactly how it should be refactored.

If you could explain to me in a few words what each method does, and
who calls it, I might be able to do some brainstorming. Right now most
of it is still voodoo to me.


Father Chrysostomos

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 11, 2008, at 8:41 PM, Father Chrysostomos wrote:

> If you could explain to me in a few words what each method does, and
> who calls it, I might be able to do some brainstorming. Right now
> most of it is still voodoo to me.

I'm working on writing up a coherent explanation.

So far, I'd be happy to tell you about Similarity::query_norm. It
used to do something not very useful, as explained here:

http://xrl.us/bf4q3 (Link to mail-archives.apache.org)

Now it's gone. I zapped it.

More to come.

PS: Weight's docs have been improved a bit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On 2/11/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> I think what's in those docs is pretty coherent, and
> manages to fulfill the high-level requirements. The Query-Weight-
> Scorer hierarchy is here to stay, methinks, and that's what I wanted
> to cover.

Hey Marvin ---

I think you are going along a good path here, and that fleshing out a
couple of these examples would be very useful, both as tutorials and
to determine if parts of the API need refinement. For a simple (and
self-serving) example that would make me happy, I'd love to see a
setup that scores Boolean queries purely by the weights given by the
user: Or's return the highest subscore, And's multiply.

As to the fitness of the Query-Weight-Scorer system, you're probably
right to want to keep it. But I think you can do a lot to tidy it up
and make it more comprehensible to newcomers. Instead of writing a
detailed description of each method on each object, I think you'd get
more benefit out of a solid high level overview. I'm a firm believer
that an easy to explain architecture will be both easier to maintain
and easier to improve.

Father C (and lurkers), I think it would be great if you could write
up your overview as well. Even if you haven't poked around all the
innards in depth, you're much closer to the way it works than most
users will ever be. So without reference to how it actually works,
write up something describing how it should work. I think that
understanding how potential users intuitively view a problem is great
step toward providing them a solution.

Nathan Kurz
nate@verse.com

ps. Marvin, I apologize if I'm coming across as too much of an
armchair critic lately. I'll send you some ice cream one of these
days. While I haven't been writing much code, the ice creams getting
to be really tasty.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:

> Father C (and lurkers), I think it would be great if you could write
> up your overview as well. Even if you haven't poked around all the
> innards in depth, you're much closer to the way it works than most
> users will ever be. So without reference to how it actually works,
> write up something describing how it should work.

Quickly...

I'm pretty close to a coherent API for Weight.

I just refactored so that boosts are dealt with only at construction
time. They now propagate from Query to Weight in a very
straightforward way: simple Query types (TermQuery, PhraseQuery) just
copy the value, while compound query types (BooleanQuery) multiply in
their own boost of their sub-queries. Here's a snip from
BooleanQuery.pm:

# iterate over the clauses, creating a Weight for each one
my $boost = $self->get_boost;
my @sub_weights;
for my $clause ( @{ $self->get_parent->get_clauses->to_perl } ) {
my $sub_query = $clause->get_query;
my $sub_boost = $boost * $sub_query->get_boost;
my $sub_weight = $sub_query->make_weight(
searchable => $searchable,
boost => $sub_boost,
);
push @sub_weights, $sub_weight;
}
$sub_weights{$$self} = \@sub_weights;

What's left to refactor is to divide the remaining methods into two
tasks: calculate a raw value, and normalize.

In the end, we'll have something like this:

sub get_value {
my $self = shift;
my $value = $self->get_raw_value;
$value *= $self->get_boost;
$value *= $self->get_norm_factor;
return $value;
}

Methods like like sum_of_squared_weights, etc, have esoteric meanings
related to cosine similarity measures and other IR theory. It might
be kind of hard to write them up if you aren't up-to-speed on the
relevant topics. Also, the architecture inherited from Lucene was a
spaghettified mess -- the code is hard to follow. While it would be
cool to see writeups, *I* have a hard time with this part of the code
base -- a lot was cargo-culted then verified only by comparing KS
scores against Lucene scores.

Lemme finish revising Weight *before* anybody writes it up. Then I
look forward to making a second leap forward after some feedback.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 13, 2008, at 9:03 PM, Marvin Humphrey wrote:

>
> On Feb 13, 2008, at 8:26 PM, Nathan Kurz wrote:
>
>> Father C (and lurkers), I think it would be great if you could write
>> up your overview as well. Even if you haven't poked around all the
>> innards in depth, you're much closer to the way it works than most
>> users will ever be. So without reference to how it actually works,
>> write up something describing how it should work.

I¢m actually quite clueless in this regard. Yes, I¢ve looked at the
code (at least what¢s in Perl), but I still don¢t know what¢s going
on. As for how it should work, I have no idea...just as long as I can
use it. :-)

> Quickly...
>
> I'm pretty close to a coherent API for Weight.
>
> [... blah blah blah...]
>
> What's left to refactor ....

Could you include a way for a set of terms to be treated as a single
term with regard to scoring, i.e., as if ¡fool¢ and ¡food¢ (in a
wildcard foo* match, for instance) were simple stored as ¡foo¢ in the
index (the way word stemming works)? (If I¢m not making myself clear,
please let me know.) If you don¢t want to include this in core
KinoSearch, could you at least bear this in mind? This would, I
believe, affect the way doc_freq is calculated.


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 14, 2008, at 8:36 PM, Father Chrysostomos wrote:

>> Could you include a way for a set of terms to be treated as a
>> single term with regard to scoring, i.e., as if ‘fool’ and
>> ‘food’ (in a wildcard foo* match, for instance) were simple stored
>> as ‘foo’ in the index (the way word stemming works)?

The way the index is laid out, each term gets its own posting list
with its own set of ascending document numbers. Scorers have to
iterate through document numbers in ascending order. The only way to
combine multiple posting lists is to interleave the doc num sets.
The only search-time options are 1) run through each set and build up
a superset before the Scorer starts iterating, or 2) put multiple
PostingList objects into a priority queue sorted by ascending doc num.

Another approach is to break all terms into all possible substrings at
index time and store them in a separate "substrings" field. The size
of the index will explode, but then "foo*" becomes a simple term query
for "foo".

> (If I’m not making myself clear, please let me know.) If you don’t
> want to include this in core KinoSearch, could you at least bear
> this in mind? This would, I believe, affect the way doc_freq is
> calculated.

Yes, doc_freq is a difficult problem to solve with wildcards.

It's particularly hard when you get to dealing with several indexes
across multiple machines.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 15, 2008, at 1:16 PM, Marvin Humphrey wrote:

>
> On Feb 14, 2008, at 8:36 PM, Father Chrysostomos wrote:
>
>>> Could you include a way for a set of terms to be treated as a
>>> single term with regard to scoring, i.e., as if ‘fool’ and
>>> ‘food’ (in a wildcard foo* match, for instance) were simple stored
>>> as ‘foo’ in the index (the way word stemming works)?
>
> The way the index is laid out, each term gets its own posting list
> with its own set of ascending document numbers. Scorers have to
> iterate through document numbers in ascending order. The only way
> to combine multiple posting lists is to interleave the doc num
> sets. The only search-time options are 1) run through each set and
> build up a superset before the Scorer starts iterating, or 2) put
> multiple PostingList objects into a priority queue sorted by
> ascending doc num.
>
> Another approach is to break all terms into all possible substrings
> at index time and store them in a separate "substrings" field. The
> size of the index will explode, but then "foo*" becomes a simple
> term query for "foo".

I don’t think this latter approach is good, because it’s too specific.
I have other uses for this besides wildcards (Greek word-stemming—not
an easy task).

>> (If I’m not making myself clear, please let me know.) If you don’t
>> want to include this in core KinoSearch, could you at least bear
>> this in mind? This would, I believe, affect the way doc_freq is
>> calculated.
>
> Yes, doc_freq is a difficult problem to solve with wildcards.
>
> It's particularly hard when you get to dealing with several indexes
> across multiple machines.

I’ve written some code that follows approach #1 above, namely, it
iterates through the posting lists one after the other, keeping a list
of doc nums that have been seen. It counts them afterwards, to get an
accurate ‘doc_freq’. Is this something you would be willing to include
in core, so I don’t have to repeat it in multiple subclasses?


Father Chrysostomos

P.S.: I have a couple of other projects that are going to get in the
way for probably more than a week, so I won’t be very responsive.


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: Wildcards [ In reply to ]
On Feb 15, 2008, at 10:25 PM, Father Chrysostomos wrote:

> I’ve written some code that follows approach #1 above, namely, it
> iterates through the posting lists one after the other, keeping a
> list of doc nums that have been seen. It counts them afterwards, to
> get an accurate ‘doc_freq’.

PostingList objects have a get_doc_freq() method, so you can just do
this:

my $doc_freq = 0;
$doc_freq += $_->get_doc_freq for @posting_lists;

> Is this something you would be willing to include in core, so I
> don’t have to repeat it in multiple subclasses?


That approach, which is the one used in
KinoSearch::Docs::Cookbook::WildCardQuery, really isn't a very good
option -- it's just makes for the simplest and shortest code sample.

First, you lose all the information other than document numbers. When
iterating over a PostingList, you'd typically want to access info like
the number of times the term appears in the document. Here's a crude,
unweighted scoring technique using TF:

my @hits;
while ( $posting_list->next ) {
my $posting = $posting_list->get_posting;
push @hits, {
doc_num => $posting->get_doc_num,
freq => $posting->get_freq,
};
}
@hits = sort { $b->{freq} <=> $a->{freq} } @hits;

That snippet uses Perl APIs which aren't public yet, but will be
eventually. It would be short-sighted to put an API into core that
yields only doc numbers and discards other info.

The more flexible option would be to provide a CompositePostingList
class which interleaves the iterators of disparate PostingList objects
using a priority queue. ORScorer more or less works this way; it adds
up scores in a way that's analogous to how a CompositePostingList
object would add up term freq.

If you snoop the code for ORScorer (and its support class
ScorerDocQueue), though, you'll see that it's a little involved. It
would be overkill to add a large, complex CompositePostingList class
to KS right now, in order to avoid short-term code duplication.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch