On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill <firstname.lastname@example.org> wrote: >> I would definitely not suggest using SSS for fields like legal brief text or emails where there is huge
>> variability in the length of the content -- i can't think of any context where a "short" email is
>> definitively better/worse then a "long" email. Â more traditional TF/IDF seems like it would make more
>> sense there.
> I was coming to a similar conclusion.
>> well ... hopefully the Similarity docs and the the docs on Lucene scoring have filled in most of those
>> blanks before you drill down into the specifics of how SSS work. Â if not, then any concrete
>> improvements you can suggest would certainly be apprecaited...
> Thanks for the links.
> The first thing I notice is that what is listed at the top of Similarity is totally changed. Â Great stuff about the object interaction. For example, I didn't understand how Weight object fit in until reading that.
> But I see I got what I asked for. Â Someone thought describing the object interaction was more important than the scoring formula itself. Â I chew on it (but I'm currently using the 3.4 code).
> My only thought is that the new stuff seems to be at the expense of the formulas listed in the old class overview for Similarity.
what is previously Similarity in older releases is moved to
TFIDFSimilarity: it extends Similarity and exposes a vector-space API,
with its same formulas in the javadocs: https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
The difference is that in 4.0, the idea is to support other scoring
models beyond the vector space model: thats why if you start looking
at other subclasses of Similarity you will find more options (e.g.
This change is described in CHANGES.txt (below). I hope its not
confusing: if you have ideas to improve the javadocs and present this
stuff better for migrating users, it would be very helpful.
* LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from
Query/Weight/Scorer. If you extended Similarity directly before, you should
extend TFIDFSimilarity instead. Similarity is now a lower-level API to
implement other scoring algorithms. See MIGRATE.txt for more details.
* LUCENE-2959: Added a variety of different relevance ranking systems to Lucene.
- Added Okapi BM25, Language Models, Divergence from Randomness, and
Information-Based Models. The models are pluggable, support all of lucene's
features (boosts, slops, explanations, etc) and queries (spans, etc).
- All models default to the same index-time norm encoding as
DefaultSimilarity, so you can easily try these out/switch back and
forth/run experiments and comparisons without reindexing. Note: most of
the models do rely upon index statistics that are new in Lucene 4.0, so
for existing 3.x indexes its a good idea to upgrade your index to the
new format with IndexUpgrader first.
- Added a new subclass SimilarityBase which provides a simplified API
for plugging in new ranking algorithms without dealing with all of the
nuances and implementation details of Lucene.
- For example, to use BM25 for all fields:
If you instead want to apply different similarities (e.g. ones with
different parameter values or different algorithms entirely) to different
fields, implement PerFieldSimilarityWrapper with your per-field logic.
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org