Mailing List Archive

Term pollution from binary data
Hi All,

We are experiencing OOM's when binary data contained in text files
(e.g., a base64 section of a text file) is indexed. We have extensive
recognition of file types but have encountered binary sections inside of
otherwise normal text files.

We are using the default value of 128 for termIndexInterval. The
problem arises because binary data generates a large set of random
tokens, leading to totalTerms/termIndexInterval terms stored in memory.
Increasing the -Xmx is not viable as it is already maxed.

Does anybody know of a better solution to this problem than writing some
kind of binary section recognizer/filter?

It appears that termIndexInterval is factored into the stored index and
thus cannot be changed dynamically to work around the problem after an
index has become polluted. Other than identifying the documents
containing binary data, deleting them, and then optimizing the whole
index, has anybody found a better way to recover from this problem?

Thanks for any insights or suggestions,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
I think the binary section recognizer is probably your best best.

If you write an analyzer that ignores terms that consist of only
hexadecimal digits, and contain embedded digits, you will probably
reduce the pollution quite a bit, and it is trivial to write, and not
too expensive to check.


On Nov 6, 2007, at 6:56 PM, Chuck Williams wrote:

> Hi All,
>
> We are experiencing OOM's when binary data contained in text files
> (e.g., a base64 section of a text file) is indexed. We have
> extensive recognition of file types but have encountered binary
> sections inside of otherwise normal text files.
>
> We are using the default value of 128 for termIndexInterval. The
> problem arises because binary data generates a large set of random
> tokens, leading to totalTerms/termIndexInterval terms stored in
> memory. Increasing the -Xmx is not viable as it is already maxed.
>
> Does anybody know of a better solution to this problem than writing
> some kind of binary section recognizer/filter?
>
> It appears that termIndexInterval is factored into the stored index
> and thus cannot be changed dynamically to work around the problem
> after an index has become polluted. Other than identifying the
> documents containing binary data, deleting them, and then
> optimizing the whole index, has anybody found a better way to
> recover from this problem?
>
> Thanks for any insights or suggestions,
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
Chuck Williams wrote:
> It appears that termIndexInterval is factored into the stored index and
> thus cannot be changed dynamically to work around the problem after an
> index has become polluted. Other than identifying the documents
> containing binary data, deleting them, and then optimizing the whole
> index, has anybody found a better way to recover from this problem?

Hadoop's MapFile is similar to Lucene's term index, and supports a
feature where only a subset of the index entries are loaded (determined
by io.map.index.skip). It would not be difficult to add such a feature
to Lucene by changing TermInfosReader#ensureIndexIsRead().

Here's a (totally untested) patch.

Doug
Re: Term pollution from binary data [ In reply to ]
I like this approach: it means, at search time, you can choose to
further subsample the already subsampled (during indexing) set of
terms for the TermInfosReader index. So you can easily turn the
knob to trade off memory usage vs IO cost/latency during searching.

I'll open an issue and work through this patch.

One thing is: I'd prefer to not use system property for this, since
it's so global, but I'm not sure how to better do it.

A static int on the class would likewise be global. Passing down an
argument to the ctor would be good, except, it would have to be
threaded up into SegmentReader, IndexReader, etc., mutiplying the
ctors these classes already have.

We can't add a "setIndexDivisor(...)" method because the terms are
already loading (consuming too much ram) during the ctor.

This would be the perfect time to use optional named/keyword
arguments, but Java does not support them (grrrr).

What if, instead, we passed down a Properties instance to IndexReader
ctors? Or alternatively a dedicated class, eg,
"IndexReaderInitParameters"? The advantage of a dedicated class is
it's strongly typed at compile time, and, you could put things in
there like an optional DeletionPolicy instance as well. I think there
are a growing list of these sorts of "advanced optional parameters
used during init" that could be handled with such an approach?

Any other options here?

Mike

"Doug Cutting" <cutting@apache.org> wrote:
> Chuck Williams wrote:
> > It appears that termIndexInterval is factored into the stored index and
> > thus cannot be changed dynamically to work around the problem after an
> > index has become polluted. Other than identifying the documents
> > containing binary data, deleting them, and then optimizing the whole
> > index, has anybody found a better way to recover from this problem?
>
> Hadoop's MapFile is similar to Lucene's term index, and supports a
> feature where only a subset of the index entries are loaded (determined
> by io.map.index.skip). It would not be difficult to add such a feature
> to Lucene by changing TermInfosReader#ensureIndexIsRead().
>
> Here's a (totally untested) patch.
>
> Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
Michael McCandless wrote:
> One thing is: I'd prefer to not use system property for this, since
> it's so global, but I'm not sure how to better do it.

I agree. That was the quick-and-dirty hack. Ideally it should be a
method on IndexReader. I can think of two ways to do that:

1. Add a generic method like IndexReader#setProperty(String,String).
2. Add a specific method like IndexReader#setTermIndexDivisor(int).

I slightly prefer the former, as it permits various IndexReaders
implementations to support arbitrary properties, at the expense of being
untyped, but that might be overkill. Thoughts?

> We can't add a "setIndexDivisor(...)" method because the terms are
> already loading (consuming too much ram) during the ctor.

Aren't indexes loaded lazily? That's an important optimization for
merging, no? For performance reasons, opening an IndexReader shouldn't
do much more than open files. However, if we build a more generic
mechanism, we should not rely on that.

> What if, instead, we passed down a Properties instance to IndexReader
> ctors? Or alternatively a dedicated class, eg,
> "IndexReaderInitParameters"? The advantage of a dedicated class is
> it's strongly typed at compile time, and, you could put things in
> there like an optional DeletionPolicy instance as well. I think there
> are a growing list of these sorts of "advanced optional parameters
> used during init" that could be handled with such an approach?

(I probably should have read your entire message before starting to
respond... But it's nice to see that we think alike!) This is similar
to my (2) approach, but attempts to solve the typing issue, although I'm
not sure how...

The way we handle it in Hadoop is to pass around a <String,String> map
in the abstract kernel, then have concrete implementation classes
provide static methods that access it. So this might look something like:

public class LuceneProperties extends Properties {
// utility methods to handle conversion of values to and from Strings
void setInt(String prop, int value);
int getInt(String prop);
void setClass(String prop, Class value);
Class getClass(String prop);
Object newInstance(String prop)
...
}

public class SegmentReaderProperties {
private static final String DIVISOR_PROP =
"org.apache.lucene.index.SegmentReader.divisor";
public static setTermIndexDivisor(LuceneProperties props, int i) {
props.setInt(DIVISOR_PROP, i);
}
}

Then the IndexReader constructor methods could accept a
LuceneProperties. No point in making this IndexReader specific, since
it might be useful for, e.g., IndexWriter, Searchers, Directories, etc.

An advantage of a <String,String> map over a <String,Object> map for
Hadoop is that it's trivial to serialize.

Is this what you had in mind?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
I think it would be better to have IndexReaderProperties, and
IndexWriterProperties.

Just seems an easier API for maintenance. It is more logical, as it
keeps related items together.

On Nov 8, 2007, at 12:04 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> One thing is: I'd prefer to not use system property for this, since
>> it's so global, but I'm not sure how to better do it.
>
> I agree. That was the quick-and-dirty hack. Ideally it should be
> a method on IndexReader. I can think of two ways to do that:
>
> 1. Add a generic method like IndexReader#setProperty(String,String).
> 2. Add a specific method like IndexReader#setTermIndexDivisor(int).
>
> I slightly prefer the former, as it permits various IndexReaders
> implementations to support arbitrary properties, at the expense of
> being untyped, but that might be overkill. Thoughts?
>
>> We can't add a "setIndexDivisor(...)" method because the terms are
>> already loading (consuming too much ram) during the ctor.
>
> Aren't indexes loaded lazily? That's an important optimization for
> merging, no? For performance reasons, opening an IndexReader
> shouldn't do much more than open files. However, if we build a
> more generic mechanism, we should not rely on that.
>
>> What if, instead, we passed down a Properties instance to IndexReader
>> ctors? Or alternatively a dedicated class, eg,
>> "IndexReaderInitParameters"? The advantage of a dedicated class is
>> it's strongly typed at compile time, and, you could put things in
>> there like an optional DeletionPolicy instance as well. I think
>> there
>> are a growing list of these sorts of "advanced optional parameters
>> used during init" that could be handled with such an approach?
>
> (I probably should have read your entire message before starting to
> respond... But it's nice to see that we think alike!) This is
> similar to my (2) approach, but attempts to solve the typing issue,
> although I'm not sure how...
>
> The way we handle it in Hadoop is to pass around a <String,String>
> map in the abstract kernel, then have concrete implementation
> classes provide static methods that access it. So this might look
> something like:
>
> public class LuceneProperties extends Properties {
> // utility methods to handle conversion of values to and from
> Strings
> void setInt(String prop, int value);
> int getInt(String prop);
> void setClass(String prop, Class value);
> Class getClass(String prop);
> Object newInstance(String prop)
> ...
> }
>
> public class SegmentReaderProperties {
> private static final String DIVISOR_PROP =
> "org.apache.lucene.index.SegmentReader.divisor";
> public static setTermIndexDivisor(LuceneProperties props, int i) {
> props.setInt(DIVISOR_PROP, i);
> }
> }
>
> Then the IndexReader constructor methods could accept a
> LuceneProperties. No point in making this IndexReader specific,
> since it might be useful for, e.g., IndexWriter, Searchers,
> Directories, etc.
>
> An advantage of a <String,String> map over a <String,Object> map
> for Hadoop is that it's trivial to serialize.
>
> Is this what you had in mind?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
robert engels wrote:
> I think it would be better to have IndexReaderProperties, and
> IndexWriterProperties.

What methods would these have?

The notion of a termIndexDivisor is specific to a particular IndexReader
implementation, so probably shouldn't be handled by a generic
IndexReaderProperties. This case is even stronger for things like merge
and deletion policy-specific properties. We'd like folks who implement
a new policy or a new IndexReader to be able to add new properties
without having to add new methods to IndexReaderProperties.

So if all that's in IndexReaderProperties is generic stuff like
#getInt(String), with nothing IndexReader-specific, then why not use a
generic class like LuceneProperties, with all specific setters and
getters on the classes they're specific to. Does that make sense?

> Just seems an easier API for maintenance. It is more logical, as it
> keeps related items together.

Without more examples, I don't see this to be the case. My goal is that
if someone adds a new parameter for a new implementation that's not in
Lucene's core, they should be able to add setters and getters there,
without altering the core, keeping related items together.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
I was thinking of more along the Java ImageIO ImageRead/WriteParam
stuff.

class IndexReaderParam {
get/set UseLargeBuffers()
get/set UseReadAhead();
.. etc. other "standard" options, a particular index reader if free
to ignore them ...
}

a custom IndexReader would create a custom CustomIndexReaderParam class.

class CustomIndexReader {
void setIndexReaderParam(IndexReaderParm p) {
CustomIndexReaderParam cp;
if(p instanceof CustomerIndexReaderParam) {
cp = (CustomIndexReaderParam)p;
} else {
cp = new CustomIndexReaderParam(p);
}
... set params ...
}
}

class CustomIndexReader extends IndexReaderParam {
get/set CacheTerms
.. etc..

CustomIndexReader(){}
CustomIndexReader(IndexReaderParam p){
set super properties from p
}
}


I don't really like the get/set name/value pair stuff. You end
needing to define constants, etc. Make reading code much more
difficult, and refactoring harder. It also makes generating the
javadoc for the API much more difficult.

On Nov 8, 2007, at 1:17 PM, Doug Cutting wrote:

> robert engels wrote:
>> I think it would be better to have IndexReaderProperties, and
>> IndexWriterProperties.
>
> What methods would these have?
>
> The notion of a termIndexDivisor is specific to a particular
> IndexReader implementation, so probably shouldn't be handled by a
> generic IndexReaderProperties. This case is even stronger for
> things like merge and deletion policy-specific properties. We'd
> like folks who implement a new policy or a new IndexReader to be
> able to add new properties without having to add new methods to
> IndexReaderProperties.
>
> So if all that's in IndexReaderProperties is generic stuff like
> #getInt(String), with nothing IndexReader-specific, then why not
> use a generic class like LuceneProperties, with all specific
> setters and getters on the classes they're specific to. Does that
> make sense?
>
>> Just seems an easier API for maintenance. It is more logical, as
>> it keeps related items together.
>
> Without more examples, I don't see this to be the case. My goal is
> that if someone adds a new parameter for a new implementation
> that's not in Lucene's core, they should be able to add setters and
> getters there, without altering the core, keeping related items
> together.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
"Doug Cutting" <cutting@apache.org> wrote:

> Aren't indexes loaded lazily? That's an important optimization for
> merging, no? For performance reasons, opening an IndexReader shouldn't
> do much more than open files. However, if we build a more generic
> mechanism, we should not rely on that.

Woops, you are right! So in this case we could wait until after ctor
to set the property. I will take that approach for this, then, so we
can decouple it from the "generic properties" discussion. I think,
also, I will throw an IllegalStateException if you try to set this
after the index was already loaded.

For other things, eg the DeletionPolicy instance & lock timeout for
IndexWriter, and infoStream for both IndexWriter & IndexReader, we
need to use them in the ctor but we don't want to explode the number
of ctors. Eg we now have setDefaultLockTimeout/setDefaultInfoStream
which we could deprecate if we can set this in generic properties
instead.

> > What if, instead, we passed down a Properties instance to IndexReader
> > ctors? Or alternatively a dedicated class, eg,
> > "IndexReaderInitParameters"? The advantage of a dedicated class is
> > it's strongly typed at compile time, and, you could put things in
> > there like an optional DeletionPolicy instance as well. I think there
> > are a growing list of these sorts of "advanced optional parameters
> > used during init" that could be handled with such an approach?
>
> (I probably should have read your entire message before starting to
> respond... But it's nice to see that we think alike!)

That is nice!

> This is similar to my (2) approach, but attempts to solve the typing
> issue, although I'm not sure how...
>
> The way we handle it in Hadoop is to pass around a <String,String> map
> in the abstract kernel, then have concrete implementation classes
> provide static methods that access it. So this might look something
> like:
>
> public class LuceneProperties extends Properties {
> // utility methods to handle conversion of values to and from Strings
> void setInt(String prop, int value);
> int getInt(String prop);
> void setClass(String prop, Class value);
> Class getClass(String prop);
> Object newInstance(String prop)
> ...
> }
>
> public class SegmentReaderProperties {
> private static final String DIVISOR_PROP =
> "org.apache.lucene.index.SegmentReader.divisor";
> public static setTermIndexDivisor(LuceneProperties props, int i) {
> props.setInt(DIVISOR_PROP, i);
> }
> }
>
> Then the IndexReader constructor methods could accept a
> LuceneProperties. No point in making this IndexReader specific, since
> it might be useful for, e.g., IndexWriter, Searchers, Directories, etc.
>
> An advantage of a <String,String> map over a <String,Object> map for
> Hadoop is that it's trivial to serialize.
>
> Is this what you had in mind?

I like that approach! I think I'd prefer <String,Object> so we could
put InfoStream, DeletionPolicy and other class instances in there?
(Without requiring that they have zero-arg ctors). Unless there would
be some reason for Lucene to also need serialization?

(Actually, for infoStream I think eventually we should switch to a
logging framework).

Hmmm, one wrinkle: when we would "look at" a property? I guess it's
per-property. EG infoStream we could "look at" every time we needed
to print something to it. But eg say we have "deletionPolicy" in
there, and you suddenly change it in your properties, then, when are
we supposed to notice that and re-init it? That is a downside vs
putting set/get on the class directly because with set/get the class
obviously knows when the property is being changed.

OK, I'm no longer sure this is [yet] necessary for Lucene! What
"properties" would we actually want to put here and NOT in the ctors
or set/gets on the class itself? It feels like a vanishing set.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
Le jeudi 8 novembre 2007, Michael McCandless a écrit :
> "Doug Cutting" <cutting@apache.org> wrote:
> > Aren't indexes loaded lazily? That's an important optimization for
> > merging, no? For performance reasons, opening an IndexReader shouldn't
> > do much more than open files. However, if we build a more generic
> > mechanism, we should not rely on that.
>
> Woops, you are right! So in this case we could wait until after ctor
> to set the property. I will take that approach for this, then, so we
> can decouple it from the "generic properties" discussion. I think,
> also, I will throw an IllegalStateException if you try to set this
> after the index was already loaded.
>
> For other things, eg the DeletionPolicy instance & lock timeout for
> IndexWriter, and infoStream for both IndexWriter & IndexReader, we
> need to use them in the ctor but we don't want to explode the number
> of ctors. Eg we now have setDefaultLockTimeout/setDefaultInfoStream
> which we could deprecate if we can set this in generic properties
> instead.
>
> > > What if, instead, we passed down a Properties instance to IndexReader
> > > ctors? Or alternatively a dedicated class, eg,
> > > "IndexReaderInitParameters"? The advantage of a dedicated class is
> > > it's strongly typed at compile time, and, you could put things in
> > > there like an optional DeletionPolicy instance as well. I think there
> > > are a growing list of these sorts of "advanced optional parameters
> > > used during init" that could be handled with such an approach?
> >
> > (I probably should have read your entire message before starting to
> > respond... But it's nice to see that we think alike!)
>
> That is nice!
>
> > This is similar to my (2) approach, but attempts to solve the typing
> > issue, although I'm not sure how...
> >
> > The way we handle it in Hadoop is to pass around a <String,String> map
> > in the abstract kernel, then have concrete implementation classes
> > provide static methods that access it. So this might look something
> > like:
> >
> > public class LuceneProperties extends Properties {
> > // utility methods to handle conversion of values to and from Strings
> > void setInt(String prop, int value);
> > int getInt(String prop);
> > void setClass(String prop, Class value);
> > Class getClass(String prop);
> > Object newInstance(String prop)
> > ...
> > }
> >
> > public class SegmentReaderProperties {
> > private static final String DIVISOR_PROP =
> > "org.apache.lucene.index.SegmentReader.divisor";
> > public static setTermIndexDivisor(LuceneProperties props, int i) {
> > props.setInt(DIVISOR_PROP, i);
> > }
> > }
> >
> > Then the IndexReader constructor methods could accept a
> > LuceneProperties. No point in making this IndexReader specific, since
> > it might be useful for, e.g., IndexWriter, Searchers, Directories, etc.
> >
> > An advantage of a <String,String> map over a <String,Object> map for
> > Hadoop is that it's trivial to serialize.
> >
> > Is this what you had in mind?
>
> I like that approach! I think I'd prefer <String,Object> so we could
> put InfoStream, DeletionPolicy and other class instances in there?
> (Without requiring that they have zero-arg ctors). Unless there would
> be some reason for Lucene to also need serialization?
>
> (Actually, for infoStream I think eventually we should switch to a
> logging framework).
>
> Hmmm, one wrinkle: when we would "look at" a property? I guess it's
> per-property. EG infoStream we could "look at" every time we needed
> to print something to it. But eg say we have "deletionPolicy" in
> there, and you suddenly change it in your properties, then, when are
> we supposed to notice that and re-init it? That is a downside vs
> putting set/get on the class directly because with set/get the class
> obviously knows when the property is being changed.
>
> OK, I'm no longer sure this is [yet] necessary for Lucene! What
> "properties" would we actually want to put here and NOT in the ctors
> or set/gets on the class itself? It feels like a vanishing set.

And from my point of view as a deep user of the Lucene API, generally I do not
like generic properties settings because it makes the API undocumented. The
java doc around the setter and the getter of the property is as usefull as :
/**
* Set a property
*
* @param prop the property to set
* @param value the value to bind to the property
*/
public void setProperty(String prop, Object value)

Then you get quite lost because you cannot have the exhausive list about the
properties you can set.
Maybe you, Lucene developpers, can today ensure that the javadoc arount this
setter will be enougth exhaustive to be useable. But tomorrow, a developper
adding a new property have to not forgot to update the documentation of the
generic setter. Even if I think that Lucene developpers are a lot more
carefull than in some other open source project, nobody is perfect ;)
And having some fields in a java class is not that far harder to maintain I
think.

I do not much about hadoop, but such interface might be interesting because
the configuration are send to different remote server. So there should be a
generic class to not duplicate the serialization code. I don't think Lucene
should do that kind of thing.

just my 2c.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
Nicolas Lalevée wrote:
> And from my point of view as a deep user of the Lucene API, generally I do not
> like generic properties settings because it makes the API undocumented. The
> java doc around the setter and the getter of the property is as usefull as :
> /**
> * Set a property
> *
> * @param prop the property to set
> * @param value the value to bind to the property
> */
> public void setProperty(String prop, Object value)

That wouldn't be the method that users would call. That method would
only used by implementors. Users would call something like
SegmentReader#setTermIndexDivisor(LuceneProperties props, int).

> Then you get quite lost because you cannot have the exhausive list about the
> properties you can set.

The documentation would be with the documentation of the facility in
question. So if you want to know what settings SegmentReader supports,
then you'd look at the SegmentReader javadoc.

Perhaps one could use annotations and a doclet to generate a page
listing all available options if that was desired.

In any case, I think Michael is opting to skip this proposal for now.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
"Doug Cutting" <cutting@apache.org> wrote:

> In any case, I think Michael is opting to skip this proposal for now.

At least for the time being, yes. I think Lucene doesn't (yet) need
this and we should stick with straightforward
setters/args-to-constructors for now.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Term pollution from binary data [ In reply to ]
Doug Cutting wrote on 11/07/2007 09:26 AM:
> Hadoop's MapFile is similar to Lucene's term index, and supports a
> feature where only a subset of the index entries are loaded
> (determined by io.map.index.skip). It would not be difficult to add
> such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead().
>
> Here's a (totally untested) patch.

Doug, thanks for this suggestion and your quick patch.

I fleshed this out in the version of Lucene we are using, a bit after
2.1. There was an off-by-1 bug plus a few missing pieces. The attached
patch is for 2.1+, but might be useful as it at least contains the
corrections and missing elements. It also contains extensions to the
tests to exercise the patch.

I tried integrating this into 2.3, but enough has changed so that it was
not straightforward (primarily for the test case extensions -- the
implementation seems it will apply with just a bit of manual merging).
Unfortunately, I have so many local changes that is has become difficult
to track the latest Lucene. The task of syncing up will come soon.
I'll post a proper patch against the trunk in jira at a future date if
the issue is not already resolved before then.

Michael McCandless wrote on 11/08/2007 12:43 AM:
> I'll open an issue and work through this patch.
>
Michael, I did not see the issue, else would have posted this there.
Unfortunately, I'm pretty far behind on lucene mail these days.
> One thing is: I'd prefer to not use system property for this, since
> it's so global, but I'm not sure how to better do it.
>

Agree strongly that this is not global. Whether ctors or an
index-specific properties object or whatever, it is important to be able
to set this on some indexes and not others in a single application.

Thanks for picking this up!

Chuck
Re: Term pollution from binary data [ In reply to ]
"Chuck Williams" <chuck@manawiz.com> wrote:
> Doug Cutting wrote on 11/07/2007 09:26 AM:
> > Hadoop's MapFile is similar to Lucene's term index, and supports a
> > feature where only a subset of the index entries are loaded
> > (determined by io.map.index.skip). It would not be difficult to add
> > such a feature to Lucene by changing TermInfosReader#ensureIndexIsRead().
> >
> > Here's a (totally untested) patch.
>
> Doug, thanks for this suggestion and your quick patch.
>
> I fleshed this out in the version of Lucene we are using, a bit after
> 2.1. There was an off-by-1 bug plus a few missing pieces. The attached
> patch is for 2.1+, but might be useful as it at least contains the
> corrections and missing elements. It also contains extensions to the
> tests to exercise the patch.

Thanks Chuck, I will start from your patch & get it working on trunk.

> I tried integrating this into 2.3, but enough has changed so that it was
> not straightforward (primarily for the test case extensions -- the
> implementation seems it will apply with just a bit of manual merging).
> Unfortunately, I have so many local changes that is has become difficult
> to track the latest Lucene. The task of syncing up will come soon.
> I'll post a proper patch against the trunk in jira at a future date if
> the issue is not already resolved before then.
>
> Michael McCandless wrote on 11/08/2007 12:43 AM:
> > I'll open an issue and work through this patch.
> >
> Michael, I did not see the issue, else would have posted this there.
> Unfortunately, I'm pretty far behind on lucene mail these days.

Sorry, I haven't yet gotten to opening the issue. I will try to do so
soon!

> > One thing is: I'd prefer to not use system property for this, since
> > it's so global, but I'm not sure how to better do it.
> >
>
> Agree strongly that this is not global. Whether ctors or an
> index-specific properties object or whatever, it is important to be able
> to set this on some indexes and not others in a single application.
>
> Thanks for picking this up!

Will do! Sorry for the delay.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org