Lucene 4.4. How to get term frequency over all index?
Asked Answered
L

1

7

I'm trying to compute tf-idf value of each term in a document. So, I iterate through the terms in a document and want to find the frequency of the term in the whole corpus and the number of documents in which the term appears. Following is my code:

//@param index path to index directory
//@param docNbr the document number in the index
public void readingIndex(String index, int docNbr) {
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));

    Document doc = reader.document(docNbr);         
    System.out.println("Processing file: "+doc.get("id"));

    Terms termVector = reader.getTermVector(docNbr, "contents");
    TermsEnum itr = termVector.iterator(null);
    BytesRef term = null;

    while ((term = itr.next()) != null) {               
        String termText = term.utf8ToString();                              
        long termFreq = itr.totalTermFreq();   //FIXME: this only return frequency in this doc
        long docCount = itr.docFreq();   //FIXME: docCount = 1 in all cases 

        System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);   
    }            

    reader.close();     
}

Although the documentation says totalTermFreq() returns the total number of occurrences of this term across all documents, when testing I found it only returns the frequency of the term in the document given by docNbr. and docFreq() always return 1.

How can I get frequency of a term across the whole index?

Update Of course, I can create a map to map a term to its frequency. Then iterate through each document to count the total number of time a term occur. However, I thought Lucene should have a built in method for that purpose. Thank you,

Loud answered 13/12, 2013 at 20:17 Comment(0)
S
14

IndexReader.TotalTermFreq(Term) will provide this for you. Your calls to the similar methods on the TermsEnum are indeed providing the stats for all documents, in the enumeration. Using the reader should get you the stats for all the documents in the index itself. Something like:

String termText = term.utf8ToString();
Term termInstance = new Term("contents", term);                              
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);

System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
Swarm answered 13/12, 2013 at 21:16 Comment(3)
Great! It works. I saw this method before but was not sure how to convert BytesRef back to Term. BTW, do you have any insight why Lucene has itr.next() return BytesRef and not Term? and why have docFreq() on TermsEnum if it only returns 1? Thanks.Loud
Yes, you could have a TermsEnum iterating over terms on multiple documents, or an entire index, in which case it would be a more useful statistic. As far as why it passes back the BytesRef, I was wondering that myself. In 3.X it passed a Term back from term(), but it changed in 4.0 to pass back the BytesRef instead. Could be that it was redesigned in such a way that the TermsEnum` itself doesn't really store what field the term was found in. Just a guess though, not really sure.Swarm
Yes. Great AnswerAudrieaudris

© 2022 - 2024 — McMap. All rights reserved.