Mailing List Archive: Fast way to get the start of document

Fast way to get the start of document

Jun 22, 2012, 12:23 PM

Post #1 of 5 (1091 views)

Our Hit highlighting (Using the older Highlighter) is wired with a "too huge" limit, so we could skip the multi-million character files, not just for highlighter.setMaxDocCharsToAnalyze, but if a document is really above the too huge limit, we don't
even try, and just produce a fragment from the front of the document. This results in almost reasonable response to time, even for a result sets of crazy huge documents (or ones with just 1 huge doc). I think this is all pretty normal. Tell me if I'm wrong.

Given the above, while timing what was going on, I realized that I was reading in the entire body of the text in the skip highlighting case just to grab the 1st 100 or so characters.
I was doing

String text = fieldable.stringValue(); // Oh my!

Is there a way to _not_ read the whole multi-million characters in and only _start_ reading the contents of a large field? See code below which got me no better results.
Some details

1. Using Lucene 3.4

2. Storing the (Tika) parse text of documents

a. These are human produced documents; PDF, word etc. often 10K of characters, sometimes 100Ks, but very occasionally a few million)

3. At this time, we store positions, but not offsets.

4. We are using the old Highlighter, not the FastVectorHighlighter (because of #3 above).

5. A basic search result is a page of 10 documents with short "blurb" (one fragment that shows a good hit).

I would be willing to live with a token stream to gen the intro blurb, but using the following code when under the too large code path (forget the highlighting) can add .5 seconds (compared to not reading anything which is not a solution just a comparison).
So here is my code.
Fieldable textFld = doc.getFieldable(TEXT);
if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
blurb = highlightBlurb(scoreDoc, document, textFld, workingBlurbLen);
} else {
logger.debug("----------- didn't call highlighter textLength = " + fullTextLength);
TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT, document, analyzer);
OffsetAttribute offset = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTerm = tokenStream.addAttribute(CharTermAttribute.class);
StringBuilder blurbB = new StringBuilder("");
while (tokenStream.incrementToken() && blurbB.length() < workingBlurbLen) {
blurbB.append(charTerm.toString());
blurbB.append(" ");
}
blurb = blurbB.toString();
}
What could I do in the else that is faster? Is not having offsets effecting this code path?
While your answering the above, I will be running some stats to suggest to management why we SHOULD store offsets, so we can use FastVectorHighlighter,
but I'm afraid I might still want the too-huge-to-highlight path.

-Paul

Re: Fast way to get the start of document [ In reply to ]

jack at basetechnology

Jun 23, 2012, 3:17 PM

Post #2 of 5 (1071 views)

Permalink

Simply have two fields, "full_body" and "limited_body". The former would
index but not store the full document text from Tika (the "content"
metadata.) The latter would store but not necessarily index the first 10K or
so characters of the full text. Do searches on the full body field and
highlighting on the limited body field.

-- Jack Krupansky

-----Original Message-----
From: Paul Hill
Sent: Friday, June 22, 2012 2:23 PM
To: java-user@lucene.apache.org
Subject: Fast way to get the start of document

Our Hit highlighting (Using the older Highlighter) is wired with a "too
huge" limit, so we could skip the multi-million character files, not just
for highlighter.setMaxDocCharsToAnalyze, but if a document is really above
the too huge limit, we don't
even try, and just produce a fragment from the front of the document. This
results in almost reasonable response to time, even for a result sets of
crazy huge documents (or ones with just 1 huge doc). I think this is all
pretty normal. Tell me if I'm wrong.

Given the above, while timing what was going on, I realized that I was
reading in the entire body of the text in the skip highlighting case just to
grab the 1st 100 or so characters.
I was doing

String text = fieldable.stringValue(); // Oh my!

Is there a way to _not_ read the whole multi-million characters in and only
_start_ reading the contents of a large field? See code below which got me
no better results.
Some details

1. Using Lucene 3.4

2. Storing the (Tika) parse text of documents

a. These are human produced documents; PDF, word etc. often 10K of
characters, sometimes 100Ks, but very occasionally a few million)

3. At this time, we store positions, but not offsets.

4. We are using the old Highlighter, not the FastVectorHighlighter
(because of #3 above).

5. A basic search result is a page of 10 documents with short "blurb"
(one fragment that shows a good hit).

I would be willing to live with a token stream to gen the intro blurb, but
using the following code when under the too large code path (forget the
highlighting) can add .5 seconds (compared to not reading anything which is
not a solution just a comparison).
So here is my code.
Fieldable textFld = doc.getFieldable(TEXT);
if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
blurb = highlightBlurb(scoreDoc, document, textFld,
workingBlurbLen);
} else {
logger.debug("----------- didn't call highlighter textLength = "
+ fullTextLength);
TokenStream tokenStream =
TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT, document,
analyzer);
OffsetAttribute offset =
tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTerm =
tokenStream.addAttribute(CharTermAttribute.class);
StringBuilder blurbB = new StringBuilder("");
while (tokenStream.incrementToken() && blurbB.length() <
workingBlurbLen) {
blurbB.append(charTerm.toString());
blurbB.append(" ");
}
blurb = blurbB.toString();
}
What could I do in the else that is faster? Is not having offsets effecting
this code path?
While your answering the above, I will be running some stats to suggest to
management why we SHOULD store offsets, so we can use FastVectorHighlighter,
but I'm afraid I might still want the too-huge-to-highlight path.

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast way to get the start of document [ In reply to ]

sokolov at ifactory

Jun 23, 2012, 7:16 PM

Post #3 of 5 (1068 views)

Permalink

I got the sense from Paul's post that he wanted a solution that didn't
require changing his index, although I'm not sure there is one. Paul if
you're willing to re-index, you could also store the length of the text
as a numeric field, retrieve that and use it to drive the decision about
whether to highlight.

-Mike Sokolov

On 6/23/2012 6:17 PM, Jack Krupansky wrote:
> Simply have two fields, "full_body" and "limited_body". The former
> would index but not store the full document text from Tika (the
> "content" metadata.) The latter would store but not necessarily index
> the first 10K or so characters of the full text. Do searches on the
> full body field and highlighting on the limited body field.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Paul Hill
> Sent: Friday, June 22, 2012 2:23 PM
> To: java-user@lucene.apache.org
> Subject: Fast way to get the start of document
>
> Our Hit highlighting (Using the older Highlighter) is wired with a
> "too huge" limit, so we could skip the multi-million character files,
> not just for highlighter.setMaxDocCharsToAnalyze, but if a document is
> really above the too huge limit, we don't
> even try, and just produce a fragment from the front of the document.
> This results in almost reasonable response to time, even for a result
> sets of crazy huge documents (or ones with just 1 huge doc). I think
> this is all pretty normal. Tell me if I'm wrong.
>
> Given the above, while timing what was going on, I realized that I was
> reading in the entire body of the text in the skip highlighting case
> just to grab the 1st 100 or so characters.
> I was doing
>
> String text = fieldable.stringValue(); // Oh my!
>
> Is there a way to _not_ read the whole multi-million characters in and
> only _start_ reading the contents of a large field? See code below
> which got me no better results.
> Some details
>
> 1. Using Lucene 3.4
>
> 2. Storing the (Tika) parse text of documents
>
> a. These are human produced documents; PDF, word etc. often 10K
> of characters, sometimes 100Ks, but very occasionally a few million)
>
> 3. At this time, we store positions, but not offsets.
>
> 4. We are using the old Highlighter, not the
> FastVectorHighlighter (because of #3 above).
>
> 5. A basic search result is a page of 10 documents with short
> "blurb" (one fragment that shows a good hit).
>
> I would be willing to live with a token stream to gen the intro blurb,
> but using the following code when under the too large code path
> (forget the highlighting) can add .5 seconds (compared to not reading
> anything which is not a solution just a comparison).
> So here is my code.
> Fieldable textFld = doc.getFieldable(TEXT);
> if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
> blurb = highlightBlurb(scoreDoc, document, textFld,
> workingBlurbLen);
> } else {
> logger.debug("----------- didn't call highlighter
> textLength = " + fullTextLength);
> TokenStream tokenStream =
> TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT,
> document, analyzer);
> OffsetAttribute offset =
> tokenStream.addAttribute(OffsetAttribute.class);
> CharTermAttribute charTerm =
> tokenStream.addAttribute(CharTermAttribute.class);
> StringBuilder blurbB = new StringBuilder("");
> while (tokenStream.incrementToken() && blurbB.length() <
> workingBlurbLen) {
> blurbB.append(charTerm.toString());
> blurbB.append(" ");
> }
> blurb = blurbB.toString();
> }
> What could I do in the else that is faster? Is not having offsets
> effecting this code path?
> While your answering the above, I will be running some stats to
> suggest to management why we SHOULD store offsets, so we can use
> FastVectorHighlighter,
> but I'm afraid I might still want the too-huge-to-highlight path.
>
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Fast way to get the start of document [ In reply to ]

paul at metajure

Jun 25, 2012, 10:03 AM

Post #4 of 5 (1066 views)

Permalink

Mike and Jack,

Thanks for the suggestions.

As Mike suggested, I already have the pre-stored length field.
I DO NOT read in the whole doc just to make the decision on "too huge", but I DO read it to _obtain_ the trivial
Intro. fragment instead of an excellent highlighted fragment. I wanted a (memory saving) stream, so I could read just a little (1st buffer).

I am willing to change the index, so one solution is to not store an additional "reasonable_body_for_highlight_frament_generation", but a
smaller "just the 1st page" field only for too-huge documents that I use only when I want to only get the "Intro fragment" (with possibly no highlights).
But Jack's suggestion makes my think I should consider adding as many initial pages as I can get away with for too-huge documents and then I might just luck out and find a decent high-lightable section.
(Adding 10 pages only to the 1 in 5000 document that is too-huge, doesn't seem like much overhead for an index).

Our choice is that we'd like to hit highlight docs which are as huge as possible, because we are working with
customers who tend to be verbose, very verbose on occasion, and would love to find the perfect quote in Appendix Q of a 95 page report (but maybe I need
to have a talk with product management about this). It is a tradeoff where we can try to educate the users and tell them that we are sorry that their query is slow, but if they want a faster response try using less common words
and a few more of them and you won't run into your too-huge documents unless you really want to see them.

So is there NO way to read the "all_text" field and only read _the_start_ of it?
Otherwise, I'm thinking I'll go with an extra 1st page field for the too-huge documents.

-Paul

> -----Original Message-----
> From: Mike Sokolov [mailto:sokolov@ifactory.com]
> Sent: Saturday, June 23, 2012 7:16 PM
> To: java-user@lucene.apache.org
> Cc: Jack Krupansky
> Subject: Re: Fast way to get the start of document
>
> I got the sense from Paul's post that he wanted a solution that didn't require changing his index, although
> I'm not sure there is one. Paul if you're willing to re-index, you could also store the length of the text as a
> numeric field, retrieve that and use it to drive the decision about whether to highlight.
>
> -Mike Sokolov
>
> On 6/23/2012 6:17 PM, Jack Krupansky wrote:
> > Simply have two fields, "full_body" and "limited_body". The former
> > would index but not store the full document text from Tika (the
> > "content" metadata.) The latter would store but not necessarily index
> > the first 10K or so characters of the full text. Do searches on the
> > full body field and highlighting on the limited body field.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Paul Hill
> > Sent: Friday, June 22, 2012 2:23 PM
> > To: java-user@lucene.apache.org
> > Subject: Fast way to get the start of document
> >
> > Our Hit highlighting (Using the older Highlighter) is wired with a
> > "too huge" limit, so we could skip the multi-million character files,
> > not just for highlighter.setMaxDocCharsToAnalyze, but if a document is
> > really above the too huge limit, we don't even try, and just produce a
> > fragment from the front of the document.
> > This results in almost reasonable response to time, even for a result
> > sets of crazy huge documents (or ones with just 1 huge doc). I think
> > this is all pretty normal. Tell me if I'm wrong.
> >
> > Given the above, while timing what was going on, I realized that I was
> > reading in the entire body of the text in the skip highlighting case
> > just to grab the 1st 100 or so characters.
> > I was doing
> >
> > String text = fieldable.stringValue(); // Oh my!
> >
> > Is there a way to _not_ read the whole multi-million characters in and
> > only _start_ reading the contents of a large field? See code below
> > which got me no better results.
> > Some details
> >
> > 1. Using Lucene 3.4
> >
> > 2. Storing the (Tika) parse text of documents
> >
> > a. These are human produced documents; PDF, word etc. often 10K
> > of characters, sometimes 100Ks, but very occasionally a few million)
> >
> > 3. At this time, we store positions, but not offsets.
> >
> > 4. We are using the old Highlighter, not the
> > FastVectorHighlighter (because of #3 above).
> >
> > 5. A basic search result is a page of 10 documents with short
> > "blurb" (one fragment that shows a good hit).
> >
> > I would be willing to live with a token stream to gen the intro blurb,
> > but using the following code when under the too large code path
> > (forget the highlighting) can add .5 seconds (compared to not reading
> > anything which is not a solution just a comparison).
> > So here is my code.
> > Fieldable textFld = doc.getFieldable(TEXT);
> > if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
> > blurb = highlightBlurb(scoreDoc, document, textFld,
> > workingBlurbLen);
> > } else {
> > logger.debug("----------- didn't call highlighter
> > textLength = " + fullTextLength);
> > TokenStream tokenStream =
> > TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT,
> > document, analyzer);
> > OffsetAttribute offset =
> > tokenStream.addAttribute(OffsetAttribute.class);
> > CharTermAttribute charTerm =
> > tokenStream.addAttribute(CharTermAttribute.class);
> > StringBuilder blurbB = new StringBuilder("");
> > while (tokenStream.incrementToken() && blurbB.length() <
> > workingBlurbLen) {
> > blurbB.append(charTerm.toString());
> > blurbB.append(" ");
> > }
> > blurb = blurbB.toString();
> > }
> > What could I do in the else that is faster? Is not having offsets
> > effecting this code path?
> > While your answering the above, I will be running some stats to
> > suggest to management why we SHOULD store offsets, so we can use
> > FastVectorHighlighter, but I'm afraid I might still want the
> > too-huge-to-highlight path.
> >
> > -Paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast way to get the start of document [ In reply to ]

sokolov at ifactory

Jun 25, 2012, 10:16 AM

Post #5 of 5 (1063 views)

Permalink

I should also mention FastVectorHighlighter - are you using that? I
believe it would find a highlight at the end of a huge document much
faster. It would still read the whole doc into memory, but wouldn't
have to analyze it. There are also some limiting parameters there which
prevent blowups for very large docs (hl.phraseLimit; see LUCENE-3234)

-Mike

On 06/25/2012 01:03 PM, Paul Hill wrote:
> Mike and Jack,
>
> Thanks for the suggestions.
>
> As Mike suggested, I already have the pre-stored length field.
> I DO NOT read in the whole doc just to make the decision on "too huge", but I DO read it to _obtain_ the trivial
> Intro. fragment instead of an excellent highlighted fragment. I wanted a (memory saving) stream, so I could read just a little (1st buffer).
>
> I am willing to change the index, so one solution is to not store an additional "reasonable_body_for_highlight_frament_generation", but a
> smaller "just the 1st page" field only for too-huge documents that I use only when I want to only get the "Intro fragment" (with possibly no highlights).
> But Jack's suggestion makes my think I should consider adding as many initial pages as I can get away with for too-huge documents and then I might just luck out and find a decent high-lightable section.
> (Adding 10 pages only to the 1 in 5000 document that is too-huge, doesn't seem like much overhead for an index).
>
> Our choice is that we'd like to hit highlight docs which are as huge as possible, because we are working with
> customers who tend to be verbose, very verbose on occasion, and would love to find the perfect quote in Appendix Q of a 95 page report (but maybe I need
> to have a talk with product management about this). It is a tradeoff where we can try to educate the users and tell them that we are sorry that their query is slow, but if they want a faster response try using less common words
> and a few more of them and you won't run into your too-huge documents unless you really want to see them.
>
> So is there NO way to read the "all_text" field and only read _the_start_ of it?
> Otherwise, I'm thinking I'll go with an extra 1st page field for the too-huge documents.
>
> -Paul
>
>
>> -----Original Message-----
>> From: Mike Sokolov [mailto:sokolov@ifactory.com]
>> Sent: Saturday, June 23, 2012 7:16 PM
>> To: java-user@lucene.apache.org
>> Cc: Jack Krupansky
>> Subject: Re: Fast way to get the start of document
>>
>> I got the sense from Paul's post that he wanted a solution that didn't require changing his index, although
>> I'm not sure there is one. Paul if you're willing to re-index, you could also store the length of the text as a
>> numeric field, retrieve that and use it to drive the decision about whether to highlight.
>>
>> -Mike Sokolov
>>
>> On 6/23/2012 6:17 PM, Jack Krupansky wrote:
>>
>>> Simply have two fields, "full_body" and "limited_body". The former
>>> would index but not store the full document text from Tika (the
>>> "content" metadata.) The latter would store but not necessarily index
>>> the first 10K or so characters of the full text. Do searches on the
>>> full body field and highlighting on the limited body field.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Paul Hill
>>> Sent: Friday, June 22, 2012 2:23 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Fast way to get the start of document
>>>
>>> Our Hit highlighting (Using the older Highlighter) is wired with a
>>> "too huge" limit, so we could skip the multi-million character files,
>>> not just for highlighter.setMaxDocCharsToAnalyze, but if a document is
>>> really above the too huge limit, we don't even try, and just produce a
>>> fragment from the front of the document.
>>> This results in almost reasonable response to time, even for a result
>>> sets of crazy huge documents (or ones with just 1 huge doc). I think
>>> this is all pretty normal. Tell me if I'm wrong.
>>>
>>> Given the above, while timing what was going on, I realized that I was
>>> reading in the entire body of the text in the skip highlighting case
>>> just to grab the 1st 100 or so characters.
>>> I was doing
>>>
>>> String text = fieldable.stringValue(); // Oh my!
>>>
>>> Is there a way to _not_ read the whole multi-million characters in and
>>> only _start_ reading the contents of a large field? See code below
>>> which got me no better results.
>>> Some details
>>>
>>> 1. Using Lucene 3.4
>>>
>>> 2. Storing the (Tika) parse text of documents
>>>
>>> a. These are human produced documents; PDF, word etc. often 10K
>>> of characters, sometimes 100Ks, but very occasionally a few million)
>>>
>>> 3. At this time, we store positions, but not offsets.
>>>
>>> 4. We are using the old Highlighter, not the
>>> FastVectorHighlighter (because of #3 above).
>>>
>>> 5. A basic search result is a page of 10 documents with short
>>> "blurb" (one fragment that shows a good hit).
>>>
>>> I would be willing to live with a token stream to gen the intro blurb,
>>> but using the following code when under the too large code path
>>> (forget the highlighting) can add .5 seconds (compared to not reading
>>> anything which is not a solution just a comparison).
>>> So here is my code.
>>> Fieldable textFld = doc.getFieldable(TEXT);
>>> if ( fullTextLength<= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
>>> blurb = highlightBlurb(scoreDoc, document, textFld,
>>> workingBlurbLen);
>>> } else {
>>> logger.debug("----------- didn't call highlighter
>>> textLength = " + fullTextLength);
>>> TokenStream tokenStream =
>>> TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT,
>>> document, analyzer);
>>> OffsetAttribute offset =
>>> tokenStream.addAttribute(OffsetAttribute.class);
>>> CharTermAttribute charTerm =
>>> tokenStream.addAttribute(CharTermAttribute.class);
>>> StringBuilder blurbB = new StringBuilder("");
>>> while (tokenStream.incrementToken()&& blurbB.length()<
>>> workingBlurbLen) {
>>> blurbB.append(charTerm.toString());
>>> blurbB.append(" ");
>>> }
>>> blurb = blurbB.toString();
>>> }
>>> What could I do in the else that is faster? Is not having offsets
>>> effecting this code path?
>>> While your answering the above, I will be running some stats to
>>> suggest to management why we SHOULD store offsets, so we can use
>>> FastVectorHighlighter, but I'm afraid I might still want the
>>> too-huge-to-highlight path.
>>>
>>> -Paul
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org