Finding a single fields terms with Lucene (PyLucene)
Asked Answered
D

1

3

I'm fairly new to Lucene's Term Vectors - and want to make sure my term gathering is as efficient as it possibly can be. I'm getting the unique terms and then retrieving the docFreq() of the term to perform faceting.

I'm gathering all documents terms from the index using:

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
terms = ireader.terms() #Returns TermEnum

This works fine, but is there a way to only return terms for specific fields (across all documents) - wouldn't that be more efficient?

Such as:

 ireader.terms(Field="country")
Drury answered 1/2, 2012 at 4:59 Comment(1)
I think this may be the solution... wiki.apache.org/lucene-java/…Drury
D
3

IndexReader.terms() accepts an optional Field() object. Field objects are composed of two arguments, the Field Name, and Value which lucene calls the "Term Field" and the "Term Text".

By providing a Field argument with an empty value for 'term text' we can start our term iteration at the term we are concerned with.

lindex = SimpleFSDirectory(File(indexdir))
ireader = IndexReader.open(lindex, True)
# Query the lucene index for the terms starting at a term named "field_name"
terms = ireader.terms(Term("field_name", "")) #Start at the field "field_name"
facets = {'other': 0}
while terms.next():
    if terms.term().field() != "field_name":  #We've got every value
        break
    print "Field Name:", terms.term().field()
    print "Field Value:", terms.term().text()
    print "Matching Docs:", int(ireader.docFreq(term))

Hopefully others searching for how to perform faceting in PyLucene will see come across this post. The key is indexing terms as-is. Just for completeness this is how field values should be indexed.

dir = SimpleFSDirectory(File(indexdir))
analyzer = StandardAnalyzer(Version.LUCENE_30)
writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))
print "Currently there are %d documents in the index..." % writer.numDocs()
print "Adding %s Documents to Index..." % docs.count()
for val in terms:
    doc = Document()
    #Store the field, as-is, with term-vectors.
    doc.add(Field("field_name", val, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.YES))
    writer.addDocument(doc)

writer.optimize()
writer.close()
Drury answered 3/3, 2012 at 23:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.