Mailing List Archive

Indexing with Semantics
I'm using Lucene's Term Freq vector to calculate cosine similarity between
documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene
takes this as 3 separate terms, but 3 of them means same "owe". Is there
any functionality in Lucene that can be used to index by semantics? so that
it indexes "owe" "owed" "owing" as one word "owe" with term frequency =3 ?

If not I'd welcome any suggestions achieving this task?

--
Regards

Kasun Perera
Re: Indexing with Semantics [ In reply to ]
stemmer
semantic is a "large" word, care to use it.

On Sat, Apr 28, 2012 at 11:02 AM, Kasun Perera <kasunp@opensource.lk> wrote:
> I'm using Lucene's Term Freq vector to calculate cosine similarity between
> documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene
> takes this as 3 separate terms, but 3 of them means same "owe". Is there
> any functionality in Lucene that can be used to index by semantics? so that
> it indexes "owe" "owed" "owing" as one word "owe" with term frequency =3 ?
>
> If not I'd welcome any suggestions achieving this task?
>
> --
> Regards
>
> Kasun Perera

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Indexing with Semantics [ In reply to ]
Hi,
The logic you are looking for is Lemmatization - http://en.wikipedia.org/wiki/Lemmatisation.
I don't think Lucene has a built-in lemmatizer but you can use GATE which is an open source project:
http://gate.ac.uk
http://gate.ac.uk/gate/doc/plugins.html

Enjoy!



-----Original Message-----
From: Kasun Perera [mailto:kasunp@opensource.lk]
Sent: Saturday, April 28, 2012 6:03 AM
To: java-user@lucene.apache.org
Subject: Indexing with Semantics

I'm using Lucene's Term Freq vector to calculate cosine similarity between documents, Say my docments has these 3 terms, "owe" "owed" "owing". Lucene takes this as 3 separate terms, but 3 of them means same "owe". Is there any functionality in Lucene that can be used to index by semantics? so that it indexes "owe" "owed" "owing" as one word "owe" with term frequency =3 ?

If not I'd welcome any suggestions achieving this task?

--
Regards

Kasun Perera

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org