Total Number documents in Corpus
is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20
.
Number of Document matching term
is the count of in how many documents the term t
occurs. So if you have 20 documents in total and the term t
occurs in 15 of the documents then the value for Number of Documents matching term
is 15.
The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249
Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf
on these documents.
A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.
Another possibility is to create a vector for the query using the idf
of each term in the query. All terms which don't occur in the query are given the value of 0
. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.
Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.
I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.