What is the use of Brown Corpus in measuring Semantic Similarity based on WordNet

Asked 9/9, 2013 at 19:45 Answered 16/9, 2013 at 15:27

I came across several methods for measuring semantic similarity that use the structure and hierarchy of WordNet, e.g. Jiang and Conrath measure (JNC), Resnik measure(RES), Lin measure (LIN) etc.

The way they are measured using NLTK is:

sim2=wn.jcn_similarity(entry1,entry2,brown_ic)
sim3=entry1.res_similarity(entry2, brown_ic)
sim4=entry1.lin_similarity(entry2,brown_ic)

If WordNet is the basis of calculating semantic similarity, what is the use of Brown Corpus here?

Casi answered 9/9, 2013 at 19:45 Comment(0)

Take a look at the explanation at the NLTK howto for wordnet.

Specifically, the *_ic notation is information content.

synset1.res_similarity(synset2, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.

A bit more info on information content from here:

The conventional way of measuring the IC of word senses is to combine knowledge of their hierarchical structure from an ontology like WordNet with statistics on their actual usage in text as derived from a large corpus

Dastardly answered 9/9, 2013 at 20:43 Comment(2)

Can we say then that even though wn_ic=wn.ic(wn) could be used, to have a valid similarity measurement it should come from a text (e.g. brown) that is not wordnet? because the paper you refer to says: We feel that WordNet can also be used as a statistical resource with no need for external ones – Thurlow 3/10, 2017 at 22:42

The paper suggests a method based on number of hyponyms. – Thurlow 3/10, 2017 at 22:48

The brown_ic in your code refers to the information content file ~/nltk_data/corpora/wordnet_ic/ic-brown.dat. For more detail on the format of the ic-brown.dat, check out this thread from the NLTK-user group.

Overall, the ic-brown.dat file lists every word existing in the Brown corpus and their information content values (which are associated with word frequencies).

The semantic measures by JC, Resnik, and Lin all require the use of a corpus in addition to the WordNet. These measures combine WordNet with corpus statistics and they are shown to achieve better correlations to human judgment than using WordNet alone (Li 2006; Pedersen 2010).

Stableman answered 16/9, 2013 at 15:27 Comment(0)

Recommended topics

Hot tags