Mailing List Archive

possible bug with indexing with term vectors
On Fri, 28 Sep 2007, Andi Vajda wrote:

> I found a bug with indexing documents that contain fields with Term Vectors.
> The indexing fails with 'reading past EOF' errors in what seems the index
> optimizing phase during addIndexes(). (I index first into a RAMDirectory,
> then addIndexes() into an FSDIrectory).
>
> I have not filed the bug yet formally as I need to isolate the code. If I
> turn indexing with term vectors off, indexing completes fine.

I tried all morning to isolate the problem but I seem to be unable to
reproduce it in a simple unit test. In my application, I've been able to get
errors by doing even less: just creating a FSDirectory and adding documents
with fields with term vectors fails when optimizing the index with the error
below. I even tried to add the same documents, in the same order, in the unit
test but to no avail. It just works.

What is different about my environment ? Well, I'm running PyLucene, but the
new one, the one using a Apple's Java VM, the same VM I'm using to run the
unit test. And I'm not doing anything special like calling back into Python or
something, I'm just calling regular Lucene APIs adding documents into an
IndexWriter on an FSDirectory using a StandardAnalyzer. If I stop using term
vectors, all is working fine.

I'd like to get to the bottom of this but could use some help. Does the
stacktrace below ring a bell ? Is there a way to run the whole indexing and
optimizing in one single thread ?

Thanks !

Andi..

Exception in thread "Thread-4"
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: read
past EOF
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:263)
Caused by: java.io.IOException: read past EOF
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)
at
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)
at
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:232)
java.io.IOException: background merge hit exception: _5u:c372 _5v:c5 into _5w
[optimize]
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1621)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1571)
Caused by: java.io.IOException: read past EOF
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)
at
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)
at
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:232)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
"Andi Vajda" <vajda@osafoundation.org> wrote:
>
> I tried all morning to isolate the problem but I seem to be unable
> to reproduce it in a simple unit test. In my application, I've been
> able to get errors by doing even less: just creating a FSDirectory
> and adding documents with fields with term vectors fails when
> optimizing the index with the error below. I even tried to add the
> same documents, in the same order, in the unit test but to no
> avail. It just works.

Are you trying your unit test first in Python (using PyLucene)?

> What is different about my environment ? Well, I'm running PyLucene,
> but the new one, the one using a Apple's Java VM, the same VM I'm
> using to run the unit test. And I'm not doing anything special like
> calling back into Python or something, I'm just calling regular
> Lucene APIs adding documents into an IndexWriter on an FSDirectory
> using a StandardAnalyzer. If I stop using term vectors, all is
> working fine.

Spooky. It's definitely possible something is broken (there is alot
of new code in 2.3).

Are your documents irregular wrt term vectors? (Ie some docs have
none, others store the terms but not positions/offsets, etc?). Any
interesting changes to Lucene's defaults (autoCommit=false, etc)?

> I'd like to get to the bottom of this but could use some help. Does
> the stacktrace below ring a bell ? Is there a way to run the whole
> indexing and optimizing in one single thread ?

You can easily turn off the concurrent (background) merges by doing
this:

writer.setMergeScheduler(new SerialMergeScheduler())

though that probably isn't punched through to Python in PyLucene. You
can also build a Lucene JAR w/ a small change to IndexWriter.java to
do the same thing.

That stacktrace is happening while merging term vectors during an
optimize. It's specifically occuring when loading the term vectors
for a given doc X; we read a position from the index stream (tvx) just
fine, but then when we try to read the first vInt from the document
stream (tvd) we hit the EOF exception. So that position was too large
or the tvd file was somehow truncated. Weird.

Can you call "writer.setInfoStream(System.out)" and get the error to
occur and then post the resulting log? It may shed some light
here....

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
On Fri, 28 Sep 2007, Michael McCandless wrote:

>> I tried all morning to isolate the problem but I seem to be unable
>> to reproduce it in a simple unit test. In my application, I've been
>> able to get errors by doing even less: just creating a FSDirectory
>> and adding documents with fields with term vectors fails when
>> optimizing the index with the error below. I even tried to add the
>> same documents, in the same order, in the unit test but to no
>> avail. It just works.
>
> Are you trying your unit test first in Python (using PyLucene)?

No, I wrote it in Java to begin with. But it's a good idea. I should try it
from PyLucene too.

>> What is different about my environment ? Well, I'm running PyLucene,
>> but the new one, the one using a Apple's Java VM, the same VM I'm
>> using to run the unit test. And I'm not doing anything special like
>> calling back into Python or something, I'm just calling regular
>> Lucene APIs adding documents into an IndexWriter on an FSDirectory
>> using a StandardAnalyzer. If I stop using term vectors, all is
>> working fine.
>
> Are your documents irregular wrt term vectors? (Ie some docs have
> none, others store the terms but not positions/offsets, etc?). Any
> interesting changes to Lucene's defaults (autoCommit=false, etc)?

All default config values, no config changes. All documents follow the same
pattern of having five fields, one with term vectors.

>> I'd like to get to the bottom of this but could use some help. Does
>> the stacktrace below ring a bell ? Is there a way to run the whole
>> indexing and optimizing in one single thread ?
>
> You can easily turn off the concurrent (background) merges by doing
> this:
>
> writer.setMergeScheduler(new SerialMergeScheduler())
>
> though that probably isn't punched through to Python in PyLucene. You
> can also build a Lucene JAR w/ a small change to IndexWriter.java to
> do the same thing.

The new PyLucene is built with a code generator and all public APIs and
classes are made available to Python. SerialMergeScheduler is available.

> That stacktrace is happening while merging term vectors during an
> optimize. It's specifically occuring when loading the term vectors
> for a given doc X; we read a position from the index stream (tvx) just
> fine, but then when we try to read the first vInt from the document
> stream (tvd) we hit the EOF exception. So that position was too large
> or the tvd file was somehow truncated. Weird.
>
> Can you call "writer.setInfoStream(System.out)" and get the error to
> occur and then post the resulting log? It may shed some light
> here....

I called writer.setMergeScheduler(SerialMergeScheduler()) just after creating
the writer and called writer.setInfoStream(System.out) just before calling
optimize(). Below is what I get:

IW [Thread-0]: flush: segment=_7f docStoreSegment=_7f docStoreOffset=0
flushDocs=true flushDeletes=false flushDocStores=true numDocs=372
numBufDelTerms=0
IW [Thread-0]: index before flush

closeDocStore: 5 files to flush to segment _7f

flush postings as segment _7f numDocs=372
oldRAMSize=75040 newFlushedSize=21018 docs/MB=18,558.867 new/old=28.009%
IW [Thread-0]: checkpoint: wrote segments file "segments_9j"
Deleter [Thread-0]: now checkpoint "segments_9j" [1 segments ; isCommit =
true]
Deleter [Thread-0]: IncRef "_7f.fnm": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.frq": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.prx": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.tis": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.tii": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.nrm": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.tvx": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.tvf": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.tvd": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.fdx": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7f.fdt": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9i"
Deleter [Thread-0]: DecRef "segments_9i": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9i"
IW [Thread-0]: checkpoint: wrote segments file "segments_9k"
Deleter [Thread-0]: now checkpoint "segments_9k" [1 segments ; isCommit =
true]
Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9j"
Deleter [Thread-0]: DecRef "_7f.fnm": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fnm"
Deleter [Thread-0]: DecRef "_7f.frq": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.frq"
Deleter [Thread-0]: DecRef "_7f.prx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.prx"
Deleter [Thread-0]: DecRef "_7f.tis": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tis"
Deleter [Thread-0]: DecRef "_7f.tii": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tii"
Deleter [Thread-0]: DecRef "_7f.nrm": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.nrm"
Deleter [Thread-0]: DecRef "_7f.tvx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvx"
Deleter [Thread-0]: DecRef "_7f.tvf": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvf"
Deleter [Thread-0]: DecRef "_7f.tvd": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvd"
Deleter [Thread-0]: DecRef "_7f.fdx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fdx"
Deleter [Thread-0]: DecRef "_7f.fdt": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fdt"
Deleter [Thread-0]: DecRef "segments_9j": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9j"
IW [Thread-0]: optimize: index now _7f:c372
IW [Thread-0]: flush: segment=null docStoreSegment=null docStoreOffset=0
flushDocs=false flushDeletes=false flushDocStores=false numDocs=0
numBufDelTerms=0
IW [Thread-0]: index before flush _7f:c372
IW [Thread-0]: at close: _7f:c372
<DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600
IW [Thread-0]: flush: segment=_7g docStoreSegment=_7g docStoreOffset=0
flushDocs=true flushDeletes=false flushDocStores=true numDocs=5
numBufDelTerms=0
IW [Thread-0]: index before flush _7f:c372

closeDocStore: 5 files to flush to segment _7g

flush postings as segment _7g numDocs=5
oldRAMSize=76960 newFlushedSize=437 docs/MB=11,997.437 new/old=0.568%
IW [Thread-0]: checkpoint: wrote segments file "segments_9l"
Deleter [Thread-0]: now checkpoint "segments_9l" [2 segments ; isCommit =
true]
Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]: IncRef "_7g.fnm": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.frq": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.prx": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.tis": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.tii": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.nrm": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.tvx": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.tvf": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.tvd": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.fdx": pre-incr count is 0
Deleter [Thread-0]: IncRef "_7g.fdt": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9k"
Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]: DecRef "segments_9k": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9k"
IW [Thread-0]: checkpoint: wrote segments file "segments_9m"
Deleter [Thread-0]: now checkpoint "segments_9m" [2 segments ; isCommit =
true]
Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]: IncRef "_7g.cfs": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9l"
Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]: DecRef "_7g.fnm": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fnm"
Deleter [Thread-0]: DecRef "_7g.frq": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.frq"
Deleter [Thread-0]: DecRef "_7g.prx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.prx"
Deleter [Thread-0]: DecRef "_7g.tis": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tis"
Deleter [Thread-0]: DecRef "_7g.tii": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tii"
Deleter [Thread-0]: DecRef "_7g.nrm": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.nrm"
Deleter [Thread-0]: DecRef "_7g.tvx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvx"
Deleter [Thread-0]: DecRef "_7g.tvf": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvf"
Deleter [Thread-0]: DecRef "_7g.tvd": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvd"
Deleter [Thread-0]: DecRef "_7g.fdx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fdx"
Deleter [Thread-0]: DecRef "_7g.fdt": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fdt"
Deleter [Thread-0]: DecRef "segments_9l": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9l"
IW [Thread-0]: optimize: index now _7f:c372 _7g:c5
IW [Thread-0]: add merge to pendingMerges: _7f:c372 _7g:c5 [optimize] [total 1
pending]
Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]: IncRef "_7g.cfs": pre-incr count is 1
IW [Thread-0]: now merge
merge=_7f:c372 _7g:c5 into _7h [optimize]
index=_7f:c372 _7g:c5
IW [Thread-0]: merging _7f:c372 _7g:c5 into _7h [optimize]
IW [Thread-0]: merge: total 377 docs
IW [Thread-0]: hit exception during merge; now refresh deleter on segment _7h
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.fdt"
Deleter [Thread-0]: delete "_7h.fdt"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.fdx"
Deleter [Thread-0]: delete "_7h.fdx"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.fnm"
Deleter [Thread-0]: delete "_7h.fnm"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.frq"
Deleter [Thread-0]: delete "_7h.frq"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.nrm"
Deleter [Thread-0]: delete "_7h.nrm"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.prx"
Deleter [Thread-0]: delete "_7h.prx"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.tii"
Deleter [Thread-0]: delete "_7h.tii"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.tis"
Deleter [Thread-0]: delete "_7h.tis"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.tvd"
Deleter [Thread-0]: delete "_7h.tvd"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.tvf"
Deleter [Thread-0]: delete "_7h.tvf"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced
file "_7h.tvx"
Deleter [Thread-0]: delete "_7h.tvx"
IW [Thread-0]: hit exception during merge
Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]: DecRef "_7g.cfs": pre-decr count is 2
java.io.IOException: read past EOF
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)
at
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)
at
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)
at
org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37)
at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1668)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1604)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1571)

IW [Thread-0]: flush: segment=null docStoreSegment=null docStoreOffset=0
flushDocs=false flushDeletes=false flushDocStores=false numDocs=0
numBufDelTerms=0
IW [Thread-0]: index before flush _7f:c372 _7g:c5
IW [Thread-0]: close: wrote segments file "segments_9n"
Deleter [Thread-0]: now checkpoint "segments_9n" [2 segments ; isCommit =
true]
Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]: IncRef "_7g.cfs": pre-incr count is 1
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9m"
Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]: DecRef "_7g.cfs": pre-decr count is 2
Deleter [Thread-0]: DecRef "segments_9m": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9m"
IW [Thread-0]: at close: _7f:c372 _7g:c5

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
Hmmm, not sure, but in looking at DocumentsWriter, it seems like
lines around 553 might be at issue:
if (tvx != null) {
tvx.writeLong(tvd.getFilePointer());
if (numVectorFields > 0) {
tvd.writeVInt(numVectorFields);
for(int i=0;i<numVectorFields;i++)
tvd.writeVInt(vectorFieldNumbers[i]);
assert 0 == vectorFieldPointers[0];
tvd.writeVLong(tvf.getFilePointer());
long lastPos = vectorFieldPointers[0];
for(int i=1;i<numVectorFields;i++) {
long pos = vectorFieldPointers[i];
tvd.writeVLong(pos-lastPos);
lastPos = pos;
}
tvfLocal.writeTo(tvf);
tvfLocal.reset();
}
}

Specifically, the exception being thrown seems to be that it is
trying to read in a vInt that contains the number of fields that have
vectors. However, in DocumentsWriter, it only writes out this vInt
if the numVectorFields is > 0.

I think you might try:
if (numVectorFields > 0){
....
}
else{
tvd.writeVInt(0)
}

In the old TermVectorsWriter, it used to be:
private void writeDoc() throws IOException {
if (isFieldOpen())
throw new IllegalStateException("Field is still open while
writing document");
//System.out.println("Writing doc pointer: " + currentDocPointer);
// write document index record
tvx.writeLong(currentDocPointer);

// write document data record
final int size = fields.size();

// write the number of fields
tvd.writeVInt(size);

// write field numbers
for (int i = 0; i < size; i++) {
TVField field = (TVField) fields.elementAt(i);
tvd.writeVInt(field.number);
}

http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/
org/apache/lucene/index/TermVectorsWriter.java?view=markup



On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:

>
> On Fri, 28 Sep 2007, Andi Vajda wrote:
>
>> I found a bug with indexing documents that contain fields with
>> Term Vectors. The indexing fails with 'reading past EOF' errors in
>> what seems the index optimizing phase during addIndexes(). (I
>> index first into a RAMDirectory, then addIndexes() into an
>> FSDIrectory).
>>
>> I have not filed the bug yet formally as I need to isolate the
>> code. If I turn indexing with term vectors off, indexing completes
>> fine.
>
> I tried all morning to isolate the problem but I seem to be unable
> to reproduce it in a simple unit test. In my application, I've been
> able to get errors by doing even less: just creating a FSDirectory
> and adding documents with fields with term vectors fails when
> optimizing the index with the error below. I even tried to add the
> same documents, in the same order, in the unit test but to no
> avail. It just works.
>
> What is different about my environment ? Well, I'm running
> PyLucene, but the new one, the one using a Apple's Java VM, the
> same VM I'm using to run the unit test. And I'm not doing anything
> special like calling back into Python or something, I'm just
> calling regular Lucene APIs adding documents into an IndexWriter on
> an FSDirectory using a StandardAnalyzer. If I stop using term
> vectors, all is working fine.
>
> I'd like to get to the bottom of this but could use some help. Does
> the stacktrace below ring a bell ? Is there a way to run the whole
> indexing and optimizing in one single thread ?
>
> Thanks !
>
> Andi..
>
> Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy
> $MergeException: java.io.IOException: read past EOF
> at org.apache.lucene.index.ConcurrentMergeScheduler
> $MergeThread.run(ConcurrentMergeScheduler.java:263)
> Caused by: java.io.IOException: read past EOF
> at org.apache.lucene.store.BufferedIndexInput.refill
> (BufferedIndexInput.java:146)
> at org.apache.lucene.store.BufferedIndexInput.readByte
> (BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt
> (IndexInput.java:76)
> at org.apache.lucene.index.TermVectorsReader.get
> (TermVectorsReader.java:207)
> at org.apache.lucene.index.SegmentReader.getTermFreqVectors
> (SegmentReader.java:692)
> at org.apache.lucene.index.SegmentMerger.mergeVectors
> (SegmentMerger.java:279)
> at org.apache.lucene.index.SegmentMerger.merge
> (SegmentMerger.java:122)
> at org.apache.lucene.index.IndexWriter.mergeMiddle
> (IndexWriter.java:2898)
> at org.apache.lucene.index.IndexWriter.merge
> (IndexWriter.java:2647)
> at org.apache.lucene.index.ConcurrentMergeScheduler
> $MergeThread.run(ConcurrentMergeScheduler.java:232)
> java.io.IOException: background merge hit exception: _5u:c372
> _5v:c5 into _5w [optimize]
> at org.apache.lucene.index.IndexWriter.optimize
> (IndexWriter.java:1621)
> at org.apache.lucene.index.IndexWriter.optimize
> (IndexWriter.java:1571)
> Caused by: java.io.IOException: read past EOF
> at org.apache.lucene.store.BufferedIndexInput.refill
> (BufferedIndexInput.java:146)
> at org.apache.lucene.store.BufferedIndexInput.readByte
> (BufferedIndexInput.java:38)
> at org.apache.lucene.store.IndexInput.readVInt
> (IndexInput.java:76)
> at org.apache.lucene.index.TermVectorsReader.get
> (TermVectorsReader.java:207)
> at org.apache.lucene.index.SegmentReader.getTermFreqVectors
> (SegmentReader.java:692)
> at org.apache.lucene.index.SegmentMerger.mergeVectors
> (SegmentMerger.java:279)
> at org.apache.lucene.index.SegmentMerger.merge
> (SegmentMerger.java:122)
> at org.apache.lucene.index.IndexWriter.mergeMiddle
> (IndexWriter.java:2898)
> at org.apache.lucene.index.IndexWriter.merge
> (IndexWriter.java:2647)
> at org.apache.lucene.index.ConcurrentMergeScheduler
> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
You are right Grant -- good catch!!! I have a unit test showing it
now. Thank you :)

So, this case is tickled if you have a doc (or docs) that have some
fields with term vectors enabled, but then later as part of the same
buffered set of docs you have 1 or more docs that have no fields with
term vectors enabled.

I'll fix it.

The thing is, from Andi's description I'm not sure this is the case
he's hitting? He said all docs have 5 fields, one of them with term
vectors enabled ... hmmm.

Mike

On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll" <grant.ingersoll@gmail.com> said:
> Hmmm, not sure, but in looking at DocumentsWriter, it seems like
> lines around 553 might be at issue:
> if (tvx != null) {
> tvx.writeLong(tvd.getFilePointer());
> if (numVectorFields > 0) {
> tvd.writeVInt(numVectorFields);
> for(int i=0;i<numVectorFields;i++)
> tvd.writeVInt(vectorFieldNumbers[i]);
> assert 0 == vectorFieldPointers[0];
> tvd.writeVLong(tvf.getFilePointer());
> long lastPos = vectorFieldPointers[0];
> for(int i=1;i<numVectorFields;i++) {
> long pos = vectorFieldPointers[i];
> tvd.writeVLong(pos-lastPos);
> lastPos = pos;
> }
> tvfLocal.writeTo(tvf);
> tvfLocal.reset();
> }
> }
>
> Specifically, the exception being thrown seems to be that it is
> trying to read in a vInt that contains the number of fields that have
> vectors. However, in DocumentsWriter, it only writes out this vInt
> if the numVectorFields is > 0.
>
> I think you might try:
> if (numVectorFields > 0){
> ....
> }
> else{
> tvd.writeVInt(0)
> }
>
> In the old TermVectorsWriter, it used to be:
> private void writeDoc() throws IOException {
> if (isFieldOpen())
> throw new IllegalStateException("Field is still open while
> writing document");
> //System.out.println("Writing doc pointer: " + currentDocPointer);
> // write document index record
> tvx.writeLong(currentDocPointer);
>
> // write document data record
> final int size = fields.size();
>
> // write the number of fields
> tvd.writeVInt(size);
>
> // write field numbers
> for (int i = 0; i < size; i++) {
> TVField field = (TVField) fields.elementAt(i);
> tvd.writeVInt(field.number);
> }
>
> http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/
> org/apache/lucene/index/TermVectorsWriter.java?view=markup
>
>
>
> On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:
>
> >
> > On Fri, 28 Sep 2007, Andi Vajda wrote:
> >
> >> I found a bug with indexing documents that contain fields with
> >> Term Vectors. The indexing fails with 'reading past EOF' errors in
> >> what seems the index optimizing phase during addIndexes(). (I
> >> index first into a RAMDirectory, then addIndexes() into an
> >> FSDIrectory).
> >>
> >> I have not filed the bug yet formally as I need to isolate the
> >> code. If I turn indexing with term vectors off, indexing completes
> >> fine.
> >
> > I tried all morning to isolate the problem but I seem to be unable
> > to reproduce it in a simple unit test. In my application, I've been
> > able to get errors by doing even less: just creating a FSDirectory
> > and adding documents with fields with term vectors fails when
> > optimizing the index with the error below. I even tried to add the
> > same documents, in the same order, in the unit test but to no
> > avail. It just works.
> >
> > What is different about my environment ? Well, I'm running
> > PyLucene, but the new one, the one using a Apple's Java VM, the
> > same VM I'm using to run the unit test. And I'm not doing anything
> > special like calling back into Python or something, I'm just
> > calling regular Lucene APIs adding documents into an IndexWriter on
> > an FSDirectory using a StandardAnalyzer. If I stop using term
> > vectors, all is working fine.
> >
> > I'd like to get to the bottom of this but could use some help. Does
> > the stacktrace below ring a bell ? Is there a way to run the whole
> > indexing and optimizing in one single thread ?
> >
> > Thanks !
> >
> > Andi..
> >
> > Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy
> > $MergeException: java.io.IOException: read past EOF
> > at org.apache.lucene.index.ConcurrentMergeScheduler
> > $MergeThread.run(ConcurrentMergeScheduler.java:263)
> > Caused by: java.io.IOException: read past EOF
> > at org.apache.lucene.store.BufferedIndexInput.refill
> > (BufferedIndexInput.java:146)
> > at org.apache.lucene.store.BufferedIndexInput.readByte
> > (BufferedIndexInput.java:38)
> > at org.apache.lucene.store.IndexInput.readVInt
> > (IndexInput.java:76)
> > at org.apache.lucene.index.TermVectorsReader.get
> > (TermVectorsReader.java:207)
> > at org.apache.lucene.index.SegmentReader.getTermFreqVectors
> > (SegmentReader.java:692)
> > at org.apache.lucene.index.SegmentMerger.mergeVectors
> > (SegmentMerger.java:279)
> > at org.apache.lucene.index.SegmentMerger.merge
> > (SegmentMerger.java:122)
> > at org.apache.lucene.index.IndexWriter.mergeMiddle
> > (IndexWriter.java:2898)
> > at org.apache.lucene.index.IndexWriter.merge
> > (IndexWriter.java:2647)
> > at org.apache.lucene.index.ConcurrentMergeScheduler
> > $MergeThread.run(ConcurrentMergeScheduler.java:232)
> > java.io.IOException: background merge hit exception: _5u:c372
> > _5v:c5 into _5w [optimize]
> > at org.apache.lucene.index.IndexWriter.optimize
> > (IndexWriter.java:1621)
> > at org.apache.lucene.index.IndexWriter.optimize
> > (IndexWriter.java:1571)
> > Caused by: java.io.IOException: read past EOF
> > at org.apache.lucene.store.BufferedIndexInput.refill
> > (BufferedIndexInput.java:146)
> > at org.apache.lucene.store.BufferedIndexInput.readByte
> > (BufferedIndexInput.java:38)
> > at org.apache.lucene.store.IndexInput.readVInt
> > (IndexInput.java:76)
> > at org.apache.lucene.index.TermVectorsReader.get
> > (TermVectorsReader.java:207)
> > at org.apache.lucene.index.SegmentReader.getTermFreqVectors
> > (SegmentReader.java:692)
> > at org.apache.lucene.index.SegmentMerger.mergeVectors
> > (SegmentMerger.java:279)
> > at org.apache.lucene.index.SegmentMerger.merge
> > (SegmentMerger.java:122)
> > at org.apache.lucene.index.IndexWriter.mergeMiddle
> > (IndexWriter.java:2898)
> > at org.apache.lucene.index.IndexWriter.merge
> > (IndexWriter.java:2647)
> > at org.apache.lucene.index.ConcurrentMergeScheduler
> > $MergeThread.run(ConcurrentMergeScheduler.java:232)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
There are a couple of JIRA issues related to TVs as well, mostly edge
cases, but Andi might want to take a look at them to see if they
describe his situation.

-Grant

On Sep 29, 2007, at 8:35 AM, Michael McCandless wrote:

>
> You are right Grant -- good catch!!! I have a unit test showing it
> now. Thank you :)
>
> So, this case is tickled if you have a doc (or docs) that have some
> fields with term vectors enabled, but then later as part of the same
> buffered set of docs you have 1 or more docs that have no fields with
> term vectors enabled.
>
> I'll fix it.
>
> The thing is, from Andi's description I'm not sure this is the case
> he's hitting? He said all docs have 5 fields, one of them with term
> vectors enabled ... hmmm.
>
> Mike
>
> On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll"
> <grant.ingersoll@gmail.com> said:
>> Hmmm, not sure, but in looking at DocumentsWriter, it seems like
>> lines around 553 might be at issue:
>> if (tvx != null) {
>> tvx.writeLong(tvd.getFilePointer());
>> if (numVectorFields > 0) {
>> tvd.writeVInt(numVectorFields);
>> for(int i=0;i<numVectorFields;i++)
>> tvd.writeVInt(vectorFieldNumbers[i]);
>> assert 0 == vectorFieldPointers[0];
>> tvd.writeVLong(tvf.getFilePointer());
>> long lastPos = vectorFieldPointers[0];
>> for(int i=1;i<numVectorFields;i++) {
>> long pos = vectorFieldPointers[i];
>> tvd.writeVLong(pos-lastPos);
>> lastPos = pos;
>> }
>> tvfLocal.writeTo(tvf);
>> tvfLocal.reset();
>> }
>> }
>>
>> Specifically, the exception being thrown seems to be that it is
>> trying to read in a vInt that contains the number of fields that have
>> vectors. However, in DocumentsWriter, it only writes out this vInt
>> if the numVectorFields is > 0.
>>
>> I think you might try:
>> if (numVectorFields > 0){
>> ....
>> }
>> else{
>> tvd.writeVInt(0)
>> }
>>
>> In the old TermVectorsWriter, it used to be:
>> private void writeDoc() throws IOException {
>> if (isFieldOpen())
>> throw new IllegalStateException("Field is still open while
>> writing document");
>> //System.out.println("Writing doc pointer: " +
>> currentDocPointer);
>> // write document index record
>> tvx.writeLong(currentDocPointer);
>>
>> // write document data record
>> final int size = fields.size();
>>
>> // write the number of fields
>> tvd.writeVInt(size);
>>
>> // write field numbers
>> for (int i = 0; i < size; i++) {
>> TVField field = (TVField) fields.elementAt(i);
>> tvd.writeVInt(field.number);
>> }
>>
>> http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/
>> org/apache/lucene/index/TermVectorsWriter.java?view=markup
>>
>>
>>
>> On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:
>>
>>>
>>> On Fri, 28 Sep 2007, Andi Vajda wrote:
>>>
>>>> I found a bug with indexing documents that contain fields with
>>>> Term Vectors. The indexing fails with 'reading past EOF' errors in
>>>> what seems the index optimizing phase during addIndexes(). (I
>>>> index first into a RAMDirectory, then addIndexes() into an
>>>> FSDIrectory).
>>>>
>>>> I have not filed the bug yet formally as I need to isolate the
>>>> code. If I turn indexing with term vectors off, indexing completes
>>>> fine.
>>>
>>> I tried all morning to isolate the problem but I seem to be unable
>>> to reproduce it in a simple unit test. In my application, I've been
>>> able to get errors by doing even less: just creating a FSDirectory
>>> and adding documents with fields with term vectors fails when
>>> optimizing the index with the error below. I even tried to add the
>>> same documents, in the same order, in the unit test but to no
>>> avail. It just works.
>>>
>>> What is different about my environment ? Well, I'm running
>>> PyLucene, but the new one, the one using a Apple's Java VM, the
>>> same VM I'm using to run the unit test. And I'm not doing anything
>>> special like calling back into Python or something, I'm just
>>> calling regular Lucene APIs adding documents into an IndexWriter on
>>> an FSDirectory using a StandardAnalyzer. If I stop using term
>>> vectors, all is working fine.
>>>
>>> I'd like to get to the bottom of this but could use some help. Does
>>> the stacktrace below ring a bell ? Is there a way to run the whole
>>> indexing and optimizing in one single thread ?
>>>
>>> Thanks !
>>>
>>> Andi..
>>>
>>> Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy
>>> $MergeException: java.io.IOException: read past EOF
>>> at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:263)
>>> Caused by: java.io.IOException: read past EOF
>>> at org.apache.lucene.store.BufferedIndexInput.refill
>>> (BufferedIndexInput.java:146)
>>> at org.apache.lucene.store.BufferedIndexInput.readByte
>>> (BufferedIndexInput.java:38)
>>> at org.apache.lucene.store.IndexInput.readVInt
>>> (IndexInput.java:76)
>>> at org.apache.lucene.index.TermVectorsReader.get
>>> (TermVectorsReader.java:207)
>>> at org.apache.lucene.index.SegmentReader.getTermFreqVectors
>>> (SegmentReader.java:692)
>>> at org.apache.lucene.index.SegmentMerger.mergeVectors
>>> (SegmentMerger.java:279)
>>> at org.apache.lucene.index.SegmentMerger.merge
>>> (SegmentMerger.java:122)
>>> at org.apache.lucene.index.IndexWriter.mergeMiddle
>>> (IndexWriter.java:2898)
>>> at org.apache.lucene.index.IndexWriter.merge
>>> (IndexWriter.java:2647)
>>> at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>>> java.io.IOException: background merge hit exception: _5u:c372
>>> _5v:c5 into _5w [optimize]
>>> at org.apache.lucene.index.IndexWriter.optimize
>>> (IndexWriter.java:1621)
>>> at org.apache.lucene.index.IndexWriter.optimize
>>> (IndexWriter.java:1571)
>>> Caused by: java.io.IOException: read past EOF
>>> at org.apache.lucene.store.BufferedIndexInput.refill
>>> (BufferedIndexInput.java:146)
>>> at org.apache.lucene.store.BufferedIndexInput.readByte
>>> (BufferedIndexInput.java:38)
>>> at org.apache.lucene.store.IndexInput.readVInt
>>> (IndexInput.java:76)
>>> at org.apache.lucene.index.TermVectorsReader.get
>>> (TermVectorsReader.java:207)
>>> at org.apache.lucene.index.SegmentReader.getTermFreqVectors
>>> (SegmentReader.java:692)
>>> at org.apache.lucene.index.SegmentMerger.mergeVectors
>>> (SegmentMerger.java:279)
>>> at org.apache.lucene.index.SegmentMerger.merge
>>> (SegmentMerger.java:122)
>>> at org.apache.lucene.index.IndexWriter.mergeMiddle
>>> (IndexWriter.java:2898)
>>> at org.apache.lucene.index.IndexWriter.merge
>>> (IndexWriter.java:2647)
>>> at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>> ------------------------------------------------------
>> Grant Ingersoll
>> http://www.grantingersoll.com/
>> http://lucene.grantingersoll.com
>> http://www.paperoftheweek.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
On Sat, 29 Sep 2007, Michael McCandless wrote:

>> The new PyLucene is built with a code generator and all public APIs and
>> classes are made available to Python. SerialMergeScheduler is available.
>
> Wild! Does this mean PyLucene will track tightly to Lucene releases
> going forward?

Yes, even more tightly than before since I don't have to patch the Lucene
sources anymore.

> What happened prior to this first optimize call? Did you just create
> the writer, switch to SerialMergeScheduler, add N docs, then call
> setInfoStream(...) and writer.optimize()?

Yes, that's almost exactly it. I create the writer new (with create=true) then
close it and its directory. Then reopen it and add N docs.

> The debug output starts with an optimize() call, which first flushes
> 372 docs to segment _7f; this is the first segment in the index. Had
> you opened this writer with create=true?

I open the writer with true when the app creates its initial repository.
Then the writer is added to and oped without create=true.

> This optimize() does nothing because the index has only one segment
> (_7f) in compound file format, so it's already optimized. Then the
> writer is closed.
>
> Then this is printed:
>
> <DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600
>
> Which is odd because 191 != 372. Can't explain that difference...

That's because an item can have several attributes that get indexed, each
becoming a Lucene document (an item is a Chandler object).

> Then another index writer is opened, 5 docs are added, then optimize()
> is called, which flushes 5 docs to segment _7g and converts it to
> compound file format.
>
> Finally we try to merge _7f and _7g for optimize, and we hit the EOF
> exception trying to read the term vector for a doc from one of these
> two segments.

Ok, this could explain why the test is passing. In the test I only do one
batch of indexing, not several like here. I missed that difference. My
apologies. I'm going to change my test now and report back...

Thank you for the explanations.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
On Sat, 29 Sep 2007, Andi Vajda wrote:

> Ok, this could explain why the test is passing. In the test I only do one
> batch of indexing, not several like here. I missed that difference. My
> apologies. I'm going to change my test now and report back...

I finally isolated the bug into a simple Java unit test. It indeed had to do
with doing multiple batches of document additions.

Michael, I'm going to send you the test and the data privately as it contains
data that shouldn't probably end up on JIRA such as personal email snippets of
mine.

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: possible bug with indexing with term vectors [ In reply to ]
On Sat, 29 Sep 2007, Andi Vajda wrote:

> I finally isolated the bug into a simple Java unit test. It indeed had to do
> with doing multiple batches of document additions.

I forgot to say that I ran this with the most recent fixes for bug 1008.
To be precise, lucene svn rev 580605.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org