tf-idf and previously unseen terms

Asked 21/10, 2008 at 18:53 Answered 31/12, 2008 at 22:15

TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine matching, those would have no impact on the total match.

Coblenz answered 21/10, 2008 at 18:53 Comment(2)

There's a relevant Facebook tech talk with Peter Norvig discussing this. In the part about Segmentation (about 5:30) he actually glosses over this issue by saying "And you have to do a little bit of a trick if you're missing a word - if it's a word you've never seen before". This hints that it's a known problem with a not totally trivial solution because he doesn't tell us what the trick is. – Flaviaflavian 18/10, 2012 at 10:4

Using BPE (Byte Pair encoding) would be a way to handle out-of-vocabulary items. You'd be using subwords instead of or as well as words. It's not a new technique but has recently been popularized in LLMs. – Flaviaflavian 11/3 at 20:25

Er, nope, doesn't break down.

Say I have two documents, A "weasel goat" and B "cheese gopher". If we actually represented these as vectors, they might look something like:

A [1,1,0,0]
B [0,0,1,1]

and if we've allocated these vectors in an index file, yeah, we've got a problem when it comes time to add a new term. But the trick of it is, that vector never exists. The key is the inverted index.

As far as new terms not affecting a cosine match, that might be true depending on what you mean. If I search my corpus of (A,B) with the query "marmoset kungfu", neither marmoset nor kungfu exist in the corpus. So the vector representing my query will be orthogonal to all the documents in the collection, and get a bad cosine similarity score. But considering none of the terms match, that seems pretty reasonable.

Scorcher answered 31/10, 2008 at 22:11 Comment(0)

When you talk about "break down" I think you mean that the new terms have no impact on the similarity measure, because they do not have any representation in the vector space defined by the original vocabulary.

One approach to handle this smoothing problem would be to consider fixing the vocabulary to a smaller vocabulary and treat all words rarer than a certain threshold as belonging to the special _UNKNOWN_ word.

However, I don't think your definition of "break down" is very clear; could you clarify what you mean there? If you could clear that up, perhaps we could discuss ways to work around those problems.

Oleomargarine answered 31/12, 2008 at 22:15 Comment(0)

Recommended topics

Hot tags