On Jan 23, 2008, at 11:47 PM, Nathan Kurz wrote:
> Don't punt on the scoring!
Well, here's the problem, which afflicts the current implementation
of wildcards in Lucene. If we transform the wildcard into an array
of TermQuery objects, then each of them has an individual IDF -- so
in a search for "pet*", the rare term "petard" will contibute more
than the more common term "pets". Should it? The consensus is that
such behavior is sub-optimal.
> From my naive point of view, a wildcard just looks like another way of
> specifying a boolean OR. Why not expand it out with the parser level?
> Sure it might be really big, but there's nothing wrong with providing
> support for industrial strength boolean queries.
However any particular WildcardQuery gets implemented, it will need
some sort of safety valve to prevent "a*" from swamping the server.
> Of course, I say
> that because I'm going to want them one day for my own nefarious
> purposes, and with flexible scoring at that.
Another reason for core KS to concentrate on providing a plugin
scaffolding on which you can hang various KSx extensions, rather than
a smorgasbord of Query subclasses.
>> Actually, if we iterate up front, we could find out the IDF of the
>> fragment and then use that to assess a crude score.
>
> I will be so appreciative some day if you move away from architectures
> that presumes IDF is always going to be the way that things are
> scored.
TF/IDF is hard to beat as a default system. However, I'd like to
make it possible to override, not just at search time, but at index
time. That's the rationale behind the introduction of the abstract
base classes KinoSearch::Index::Reader and
KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
first distro to use these capabilities.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
> Don't punt on the scoring!
Well, here's the problem, which afflicts the current implementation
of wildcards in Lucene. If we transform the wildcard into an array
of TermQuery objects, then each of them has an individual IDF -- so
in a search for "pet*", the rare term "petard" will contibute more
than the more common term "pets". Should it? The consensus is that
such behavior is sub-optimal.
> From my naive point of view, a wildcard just looks like another way of
> specifying a boolean OR. Why not expand it out with the parser level?
> Sure it might be really big, but there's nothing wrong with providing
> support for industrial strength boolean queries.
However any particular WildcardQuery gets implemented, it will need
some sort of safety valve to prevent "a*" from swamping the server.
> Of course, I say
> that because I'm going to want them one day for my own nefarious
> purposes, and with flexible scoring at that.
Another reason for core KS to concentrate on providing a plugin
scaffolding on which you can hang various KSx extensions, rather than
a smorgasbord of Query subclasses.
>> Actually, if we iterate up front, we could find out the IDF of the
>> fragment and then use that to assess a crude score.
>
> I will be so appreciative some day if you move away from architectures
> that presumes IDF is always going to be the way that things are
> scored.
TF/IDF is hard to beat as a default system. However, I'd like to
make it possible to override, not just at search time, but at index
time. That's the rationale behind the introduction of the abstract
base classes KinoSearch::Index::Reader and
KinoSearch::Index::Writer. My hope is to write KSx::RTree as the
first distro to use these capabilities.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch