I am now hacking the BooleanScorer2 to let it keep the docID() of the

leaf scorer(mostly possible TermScorer) the same as the top-level Scorer.

Why I want to do this is: When I Collect a doc, I want to know which term

is matched(especially for BooleanClause whose Occur is SHOULD). we have

discussed some solutions, such as adding bit masks in disjunction scorers.

with this method, when we finds a matched doc, we can recursively find

which leaf scorer is matched. But we think it's not very efficient and not

convenient to use(this is my proposal but not agreed by others in our

team). and then we came up with another one: Modifying DisjunctionSumScorer.

we analysed the codes and found that the only Scorers used by

BooleanScorer2 that will make the children scorers' docID() not equal to

parent is an anonymous class inherited from DisjunctionSumScorer. All other

ones including SingleMatchScorer, countingConjunctionSumScorer(anonymous),

dualConjuctionSumScorer, ReqOptSumScorer and ReqExclScorer are fit our need.

The implementation algorithm of DisjunctionSumScorer use a heap to find

the smallest doc. after finding a matched doc, the currentDoc is the

matched doc and all the scorers in the heap will call nextDoc() so all of

the scorers' current docID the nextDoc of currentDoc. if there are N level

DisjunctionSumScorer, the leaf scorer's current doc is the n-th next docId

of the root of the scorer tree.

So we modify the DisjuctionSumScorer and let it behavior as we expected.

And then I wrote some TestCase and it works well. And also I wrote some

random generated TermScorer and compared the nextDoc(),score() and

advance(int) method of original DisjunctionSumScorer and modified one.

nextDoc() and score() and exactly the same. But for advance(int target), we

found some interesting and strange things.

at the beginning, I think if target is less than current docID, it will

just return current docID and do nothing. this assumption let my algorithm

go wrong. Then I read the codes of TermScorer and found each call of

advance(int) of TermScorer will call nextDoc() no matter whether current

docID is larger than target or not.

So I am confused and then read the javadoc of DocIdSetIterator:

----------------- javadoc of DocIdSetIterator.advance(int

target)-------------

int org.apache.lucene.search.DocIdSetIterator.advance(int target) throws

IOException

Advances to the first beyond (see NOTE below) the current whose document

number is greater than or equal

to target. Returns the current document number or NO_MORE_DOCS if there

are no more docs in the set.

Behaves as if written:

int advance(int target) {

int doc;

while ((doc = nextDoc()) < target) {

}

return doc;

}

Some implementations are considerably more efficient than that.

NOTE: when target < current implementations may opt not to advance beyond

their current docID().

NOTE: this method may be called with NO_MORE_DOCS for efficiency by some

Scorers. If your

implementation cannot efficiently determine that it should exhaust, it is

recommended that you check for

that value in each call to this method.

NOTE: after the iterator has exhausted you should not call this method, as

it may result in unpredicted

behavior.

--------------------------------------

Then I modified my algorithm again and found that

DisjunctionSumScorer.advance(int target) has some strange behavior. most of

the cases, it will return currentDoc if target < currentDoc. but in some

boundary condition, it will not.

it's not a bug but let me sad. I thought my algorithm has some bug because

it's advance method is not exactly the same as original

DisjunctionSumScorer's.

----codes of DisjunctionSumScorer---

@Override

public int advance(int target) throws IOException {

if (scorerDocQueue.size() < minimumNrMatchers) {

return currentDoc = NO_MORE_DOCS;

}

if (target <= currentDoc) {

return currentDoc;

}

....

-------------------

for most case if (target <= currentDoc) it will return currentDoc;

but if previous advance will make sub scorers exhausted, then if may return

NO_MORE_DOCS

an example is:

currentDoc=-1

minimumNrMatchers=1

subScorers:

TermScorer: docIds: [1, 2, 6]

TermScorer: docIds: [2, 4]

after first call advance(5)

currentDoc=6

only first scorer is now in the heap, scorerDocQueue.size()==1

then call advance(6)

because scorerDocQueue.size() < minimumNrMatchers, it just return

NO_MORE_DOCS

My question is why the advance(int target) method is defined like this? for

the reason of efficient or any other reasons?