understanding semcor corpus structure h
Asked Answered
A

1

7

I'm learning NLP. I currently playing with Word Sense Disambiguation. I'm planning to use the semcor corpus as training data but I have trouble understanding the xml structure. I tried googling but did not get any resource describing the content structure of semcor.

<s snum="1">
<wf cmd="ignore" pos="DT">The</wf>
<wf cmd="done" lemma="group" lexsn="1:03:00::" pn="group" pos="NNP" rdf="group" wnsn="1">Fulton_County_Grand_Jury</wf>
<wf cmd="done" lemma="say" lexsn="2:32:00::" pos="VB" wnsn="1">said</wf>
<wf cmd="done" lemma="friday" lexsn="1:28:00::" pos="NN" wnsn="1">Friday</wf>
<wf cmd="ignore" pos="DT">an</wf>
<wf cmd="done" lemma="investigation" lexsn="1:09:00::" pos="NN" wnsn="1">investigation</wf>
<wf cmd="ignore" pos="IN">of</wf>
<wf cmd="done" lemma="atlanta" lexsn="1:15:00::" pos="NN" wnsn="1">Atlanta</wf>
<wf cmd="ignore" pos="POS">'s</wf>
<wf cmd="done" lemma="recent" lexsn="5:00:00:past:00" pos="JJ" wnsn="2">recent</wf>
<wf cmd="done" lemma="primary_election" lexsn="1:04:00::" pos="NN" wnsn="1">primary_election</wf>
<wf cmd="done" lemma="produce" lexsn="2:39:01::" pos="VB" wnsn="4">produced</wf>
<punc>``</punc>
<wf cmd="ignore" pos="DT">no</wf>
<wf cmd="done" lemma="evidence" lexsn="1:09:00::" pos="NN" wnsn="1">evidence</wf>
<punc>''</punc>
<wf cmd="ignore" pos="IN">that</wf>
<wf cmd="ignore" pos="DT">any</wf>
<wf cmd="done" lemma="irregularity" lexsn="1:04:00::" pos="NN" wnsn="1">irregularities</wf>
<wf cmd="done" lemma="take_place" lexsn="2:30:00::" pos="VB" wnsn="1">took_place</wf>
<punc>.</punc>
</s>
  • I'm assuming wnsn is 'word sense'. Is it correct?
  • What does the attribute lexsn mean? How does it map to wordnet?
  • What does the attribute pn refer to? (third line)
  • How is the rdf attribute assigned? (again third line)
  • In general, what are the possible attributes?
Alike answered 3/1, 2011 at 10:27 Comment(1)
Have u understood this.. I need to convert this data for WSD classification task. How can i do that ?Foltz
M
12

The format is described in the "doc/cxtfile.txt" file in the SemCor 1.6 archive; for some reason, documentation is not included in later versions.

Microsecond answered 4/1, 2011 at 2:50 Comment(2)
The wnsn is of the "word used" or its "lemmatised form", because they can be different.Levigate
The above link does not work anymore. Here is the current SemCor 1.6 archive and from here you can download other SemCor versions.Intensity

© 2022 - 2024 — McMap. All rights reserved.