Mailing List Archive

Field compression too slow
Hello all,

I am experiencing some performance problems indexing large(ish) amounts of
text using the IndexField.Store.COMPRESS option when creating a Field in
Lucene.

I have a sample document which has about 4.5MB of text to be stored as
compressed data within the field, and the indexing of this document seems to
take an inordinate amount of time (over 10 minutes!). When debugging I can
see that it's stuck on the deflate() calls of the Deflater used by Lucene.

I noted that Lucene by default uses the
Deflater.BEST_COMPRESSIONcompression level when encountering a
compressed field.

I'm not sure if it would help my particular situation, but is there any way
to provide the option of specifying the compression level? The level used
by Lucene (level 9) is the maximum possible compression level. Ideally I
would like to be able to alter the compression level on the basis of the
field size. This way I can smooth out the compression times across the
various document sizes. I am more interested in consistent time than I am
consistent compression.

Or... could there some other reason my document takes this long to index?
(and hold up all other threads).

Thanks.
Re: Field compression too slow [ In reply to ]
> I'm not sure if it would help my particular situation, but is there any way
> to provide the option of specifying the compression level? The level used
> by Lucene (level 9) is the maximum possible compression level. Ideally I
> would like to be able to alter the compression level on the basis of the
> field size. This way I can smooth out the compression times across the
> various document sizes. I am more interested in consistent time than I am
> consistent compression.

I agree, we should make the compression level configurable. It's
disturbing that it takes minutes to compress a 4.5 MB document! I'll
open a Jira issue for this.

> Or... could there some other reason my document takes this long to index?
> (and hold up all other threads).

You might want to try just running the command-line "zip" utility,
specifying best compression, to see how long it takes? Lucene is just
using java.util.zip.* APIs (which is the same compression as "zip").

One correction: this compression should not block other threads. This
runs outside of "synchronized" code, meaning, if you have other threads
adding documents, they can do so fully in parallel with your one thread
that's doing the slow compression.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Field compression too slow [ In reply to ]
>> I'm not sure if it would help my particular situation, but is there
>> any way
>> to provide the option of specifying the compression level? The level
>> used
>> by Lucene (level 9) is the maximum possible compression level. Ideally I
>> would like to be able to alter the compression level on the basis of the
>> field size. This way I can smooth out the compression times across the
>> various document sizes. I am more interested in consistent time than
>> I am
>> consistent compression.
>
> I agree, we should make the compression level configurable. It's
> disturbing that it takes minutes to compress a 4.5 MB document! I'll
> open a Jira issue for this.

OK I created https://issues.apache.org/jira/browse/LUCENE-648 to track
this issue.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Field compression too slow [ In reply to ]
Thanks for the Jira issue...

one question on your synchronization comment...

I have "assumed" I can't have two threads writing to the index concurrently,
so have implemented my own read/write locking system. Are you saying I
don't need to bother with this? My reading of the doco suggests that you
shouldn't have two IndexWriters open on the same index.

I know that if I try a search from a different JVM while the index is being
written I get the odd "FileNotFound" exception, so I had assumed writing
concurrently would be a bigger problem.

Of course there is a difference between multiple threads in a single JVM,
and threads in multiple JVM's (which is my situation). But I may be able to
re-architect so I have a single JVM reading/writing the one index if it will
allow me to ignore my own locking/unlocking system.

As it turns out I have devised an alternate strategy. Storing large amounts
of data in the index (compressed or not) seems to have the secondary effect
of slowing down retrieval of results... and even led to OutOfMemory errors
for me (presumably because the hits.doc(n) call loads the stored fields into
memory?).

I needed to store the contents of all fields, so when I re-index the
document (as some fields change) I don't lose this data (my kingdom for the
ability to "update" a field!). I decided to store the "large" data
elsewhere outside the index (where I can store/compress it asynchronously)
and pull it out from here when I need to re-index.

Thanks again for the response.

On 8/11/06, Michael McCandless <lucene@mikemccandless.com> wrote:
>
>
> >> I'm not sure if it would help my particular situation, but is there
> >> any way
> >> to provide the option of specifying the compression level? The level
> >> used
> >> by Lucene (level 9) is the maximum possible compression level. Ideally
> I
> >> would like to be able to alter the compression level on the basis of
> the
> >> field size. This way I can smooth out the compression times across the
> >> various document sizes. I am more interested in consistent time than
> >> I am
> >> consistent compression.
> >
> > I agree, we should make the compression level configurable. It's
> > disturbing that it takes minutes to compress a 4.5 MB document! I'll
> > open a Jira issue for this.
>
> OK I created https://issues.apache.org/jira/browse/LUCENE-648 to track
> this issue.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Field compression too slow [ In reply to ]
> I have "assumed" I can't have two threads writing to the index
> concurrently,
> so have implemented my own read/write locking system. Are you saying I
> don't need to bother with this? My reading of the doco suggests that you
> shouldn't have two IndexWriters open on the same index.
>
> I know that if I try a search from a different JVM while the index is being
> written I get the odd "FileNotFound" exception, so I had assumed writing
> concurrently would be a bigger problem.
>
> Of course there is a difference between multiple threads in a single JVM,
> and threads in multiple JVM's (which is my situation). But I may be
> able to
> re-architect so I have a single JVM reading/writing the one index if it
> will
> allow me to ignore my own locking/unlocking system.

You are right, only one "writer" (= IndexWriter adding docs or
IndexReader deleting docs) may be open at a time, but, you can have
multiple threads (within one JVM) sharing that writer and they should
nicely parallelize (within that one JVM). It sounds like your
situation can't take advantage of multiple threads on one writer...

Then, multiple IndexSearchers in different JVMs, can be instantiated.
Multiple threads should share a single IndexSearcher within one JVM and
should nicely parallelize.

However, IndexSearchers & IndexWriters must synchronize to make sure
you don't get that FileNotFound exception. Basically, every time a
new IndexReader (used by IndexSearcher) is instantiated it needs to
ensure no IndexWriter is in the process of committing (writing a new
segments file).

This is currently implemented with file based locks, but this method
of locking has known bugs on remotely mounted filesystems (it sounds
likely this may be your use case?).

If you need to share an index on a remote filesystem, you either need
to do your own locking (sounds like you've done this), or take an
approach like the Solr project where you take "known safe" snapshots
of the index and each searcher cuts over to latest snapshot when it's
ready.

> As it turns out I have devised an alternate strategy. Storing large
> amounts
> of data in the index (compressed or not) seems to have the secondary effect
> of slowing down retrieval of results... and even led to OutOfMemory errors
> for me (presumably because the hits.doc(n) call loads the stored fields
> into
> memory?).
>
> I needed to store the contents of all fields, so when I re-index the
> document (as some fields change) I don't lose this data (my kingdom for the
> ability to "update" a field!). I decided to store the "large" data
> elsewhere outside the index (where I can store/compress it asynchronously)
> and pull it out from here when I need to re-index.

Yes, when you load a doc through Hits.doc(n) it will load all stored
fields. There have been some good recent fixes in this area (but
after 2.0 release I believe), including ability to mark fields for
"lazy loading", and ability to load a document bug specifying a subset
of the fields that you actually want. See
http://issues.apache.org/jira/browse/LUCENE-545 for juicy details.

> Thanks again for the response.

You're welcome!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Field compression too slow [ In reply to ]
> I have a sample document which has about 4.5MB of text to be stored as
> compressed data within the field, and the indexing of this document
> seems to
> take an inordinate amount of time (over 10 minutes!). When debugging I can
> see that it's stuck on the deflate() calls of the Deflater used by Lucene.

Would it be possible to get a copy of this document's text (only if
you're able to share it)? I'd like to run some tests to work out the
tradeoff (time taken vs % deflated) of the different levels we can pass
to the zip library. If not that's fine, I'll just run on various random
text sources I can find.

Thanks.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Field compression too slow [ In reply to ]
I can share the data.. but it would be quicker for you to just pull out some
random text from anywhere you like.

The issue is that the text was in an email, which was one of about 2,000 and
I don't know which one. I got the 4.5MB figure from the number of bytes in
the byte array reported in the debugger... and didn't bother to record the
email file it was contained in. Anyway.. I think it was text extracted from
a PDF extracted from a ZIP... so it would take me a while to locate!

It's worth noting that the time I quoted is somewhat misleading. I killed
my process after 10 minutes because I realised there was a problem and any
further time was irrelevant. But... the length of time is partially due to
the load on the process.

I am processing multiple files concurrently, and in so doing am performing a
bunch of CPU intensive tasks (text extraction, encryption etc). Most of
this happens in separate threads, but they are all competing for CPU time.

The only way to really benchmark the performance of the compression is to
combine both compression levels, with thread numbers to see how it scales.

I'm confident that the compression mechanism used in Lucene is fine (had a
look at the code... all seems pretty good), so I would guess that Lucene
would have performance comparable to "vanilla" compression using the native
java libs.

I'm betting you get non-linear scalability no matter what the compression
level (due to the max throughput of the CPU, bus speed etc); but you may
find scalability tends towards a linear curve (oxymoron?) the lower the
compression level.

This is really what I am looking for.

Also.. upon reflection I'm not certain using compression inside the index is
really a valuable process without lazy loading anyway. The time-cost of
decompression when iterating hits reduces the overall effectiveness of the
index. This is obviously solved by lazy loading (for searches) and I am
excited about this feature being added. Obviously it depends on the
use-case, but in mine I realised that storing large amounts of data in the
index is just not the right way to do things. So I changed my architecture
so that the larger amounts of data are stored (and compressed) elsewhere,
then brought back in when I need to update a document.

Of course all my problems would be solved if I had lazy loading AND field
updating :)

On 8/11/06, Michael McCandless <lucene@mikemccandless.com> wrote:
>
>
> > I have a sample document which has about 4.5MB of text to be stored as
> > compressed data within the field, and the indexing of this document
> > seems to
> > take an inordinate amount of time (over 10 minutes!). When debugging I
> can
> > see that it's stuck on the deflate() calls of the Deflater used by
> Lucene.
>
> Would it be possible to get a copy of this document's text (only if
> you're able to share it)? I'd like to run some tests to work out the
> tradeoff (time taken vs % deflated) of the different levels we can pass
> to the zip library. If not that's fine, I'll just run on various random
> text sources I can find.
>
> Thanks.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Field compression too slow [ In reply to ]
> I can share the data.. but it would be quicker for you to just pull out
> some
> random text from anywhere you like.

OK, I hear you. I'll pull together some test data ... thanks.

> Also.. upon reflection I'm not certain using compression inside the
> index is
> really a valuable process without lazy loading anyway. The time-cost of
> decompression when iterating hits reduces the overall effectiveness of the
> index. This is obviously solved by lazy loading (for searches) and I am
> excited about this feature being added. Obviously it depends on the
> use-case, but in mine I realised that storing large amounts of data in the
> index is just not the right way to do things. So I changed my architecture
> so that the larger amounts of data are stored (and compressed) elsewhere,
> then brought back in when I need to update a document.
>
> Of course all my problems would be solved if I had lazy loading AND field
> updating :)

Compleletely agreed! Lazy loading & specific field selection on loading
a doc, have been addressed ... but field updating in the presence of
compressed fields hasn't yet been addressed (I think). I'll raise this
use case on the java-dev list.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Field compression too slow [ In reply to ]
Mike, which version of Lucene supports lazy loading? Thanks.

>From: Michael McCandless <lucene@mikemccandless.com>
>Reply-To: java-user@lucene.apache.org
>To: java-user@lucene.apache.org
>Subject: Re: Field compression too slow
>Date: Fri, 11 Aug 2006 06:59:58 -0400
>
>
>>I can share the data.. but it would be quicker for you to just pull out
>>some
>>random text from anywhere you like.
>
>OK, I hear you. I'll pull together some test data ... thanks.
>
>>Also.. upon reflection I'm not certain using compression inside the index
>>is
>>really a valuable process without lazy loading anyway. The time-cost of
>>decompression when iterating hits reduces the overall effectiveness of the
>>index. This is obviously solved by lazy loading (for searches) and I am
>>excited about this feature being added. Obviously it depends on the
>>use-case, but in mine I realised that storing large amounts of data in the
>>index is just not the right way to do things. So I changed my
>>architecture
>>so that the larger amounts of data are stored (and compressed) elsewhere,
>>then brought back in when I need to update a document.
>>
>>Of course all my problems would be solved if I had lazy loading AND field
>>updating :)
>
>Compleletely agreed! Lazy loading & specific field selection on loading a
>doc, have been addressed ... but field updating in the presence of
>compressed fields hasn't yet been addressed (I think). I'll raise this use
>case on the java-dev list.
>
>Mike
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Field compression too slow [ In reply to ]
SVN Head does. Has not been released yet.

See http://issues.apache.org/jira/browse/LUCENE-545
and
http://issues.apache.org/jira/browse/LUCENE-609

for some of the issues with it.

On Aug 11, 2006, at 8:19 AM, Dragon Fly wrote:

> Mike, which version of Lucene supports lazy loading? Thanks.
>
>> From: Michael McCandless <lucene@mikemccandless.com>
>> Reply-To: java-user@lucene.apache.org
>> To: java-user@lucene.apache.org
>> Subject: Re: Field compression too slow
>> Date: Fri, 11 Aug 2006 06:59:58 -0400
>>
>>
>>> I can share the data.. but it would be quicker for you to just
>>> pull out some
>>> random text from anywhere you like.
>>
>> OK, I hear you. I'll pull together some test data ... thanks.
>>
>>> Also.. upon reflection I'm not certain using compression inside
>>> the index is
>>> really a valuable process without lazy loading anyway. The time-
>>> cost of
>>> decompression when iterating hits reduces the overall
>>> effectiveness of the
>>> index. This is obviously solved by lazy loading (for searches)
>>> and I am
>>> excited about this feature being added. Obviously it depends on the
>>> use-case, but in mine I realised that storing large amounts of
>>> data in the
>>> index is just not the right way to do things. So I changed my
>>> architecture
>>> so that the larger amounts of data are stored (and compressed)
>>> elsewhere,
>>> then brought back in when I need to update a document.
>>>
>>> Of course all my problems would be solved if I had lazy loading
>>> AND field
>>> updating :)
>>
>> Compleletely agreed! Lazy loading & specific field selection on
>> loading a doc, have been addressed ... but field updating in the
>> presence of compressed fields hasn't yet been addressed (I
>> think). I'll raise this use case on the java-dev list.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today -
> it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/
> direct/01/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org