I am looking for a free tagged corpus for a system to train on to for Named Entity Recognition. Most of the ones I find (like the New York Times one) are expensive and not open. Can anyone help?
Free Tagged Corpus for Named Entity Recognition [closed]
Asked Answered
The same question was asked on opendata.stackexchange.com/q/7250/1652 (where it's not closed) –
Constrict
There's a list of corpora at http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html
The CoNLL 2003 corpus, which is on that list, is free and is available from http://www.cnts.ua.ac.be/conll2003/ner/ (annotations) and NIST (text).
Do we have to follow the procedure of filling forms, sending application to NIST for getting the dataset as stated in this link ? or some alternative is there? –
Sharpen
CoNLL 2003 NIST text data is not "free" - it's only free for non-commercial, research use. –
Randolph
@Randolph The agreements are here: trec.nist.gov/data/reuters/reuters.html In my reading of them, there's no non-commercial restriction (but then IANL). –
Notus
@tom hmm, yea, I read this as being for non-commercial research only - "The information may only be used for research and development of natural-language-processing, information-retrieval or document-understanding systems." - but I guess you're right, it doesn't say that. I'm not a lawyer either - I'll have to get one to look at it :) –
Randolph
The Python NLTK has access to the nltk.corpus.conll2000
corpus. Calling conll2000.iob_words()
returns a list of (word, part-of-speech, IOB) triples, where IOB is a tag in the Inside-entity/Outside-entity/Beginning-of-entity format.
There are about 250k total words in a newswire-style context.
Can we also dump the dataset for using it on some other tool? such as the Tagger by GLample. –
Sharpen
CONLL2000 does not mark named entities. –
Sullivan
dbPedia is open and free
dbPedia is built from WikiPedia and it is a very big corpus. Build an Lucene index on triples involving rdfs:label
on all dbPedia titles dump.
As one of the other answers states, DBpedia is not a tagged corpus. –
Notus
In 2012 (and today) my comment was true, but this could change in the future. If you're interested in DBpedia based corpuses you might want to follow the Open Extraction Challenge (wiki.dbpedia.org/textext) to generate NIF output for DBpedia from Wikipedia text. –
Notus
© 2022 - 2024 — McMap. All rights reserved.