Free Tagged Corpus for Named Entity Recognition [closed]
Asked Answered
L

3

9

I am looking for a free tagged corpus for a system to train on to for Named Entity Recognition. Most of the ones I find (like the New York Times one) are expensive and not open. Can anyone help?

Lucienlucienne answered 25/7, 2010 at 17:27 Comment(1)
The same question was asked on opendata.stackexchange.com/q/7250/1652 (where it's not closed)Constrict
N
6

There's a list of corpora at http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html

The CoNLL 2003 corpus, which is on that list, is free and is available from http://www.cnts.ua.ac.be/conll2003/ner/ (annotations) and NIST (text).

Notus answered 12/7, 2012 at 20:42 Comment(4)
Do we have to follow the procedure of filling forms, sending application to NIST for getting the dataset as stated in this link ? or some alternative is there?Sharpen
CoNLL 2003 NIST text data is not "free" - it's only free for non-commercial, research use.Randolph
@Randolph The agreements are here: trec.nist.gov/data/reuters/reuters.html In my reading of them, there's no non-commercial restriction (but then IANL).Notus
@tom hmm, yea, I read this as being for non-commercial research only - "The information may only be used for research and development of natural-language-processing, information-retrieval or document-understanding systems." - but I guess you're right, it doesn't say that. I'm not a lawyer either - I'll have to get one to look at it :)Randolph
I
2

The Python NLTK has access to the nltk.corpus.conll2000 corpus. Calling conll2000.iob_words() returns a list of (word, part-of-speech, IOB) triples, where IOB is a tag in the Inside-entity/Outside-entity/Beginning-of-entity format.

There are about 250k total words in a newswire-style context.

Iridescent answered 20/3, 2011 at 23:0 Comment(2)
Can we also dump the dataset for using it on some other tool? such as the Tagger by GLample.Sharpen
CONLL2000 does not mark named entities.Sullivan
T
1

dbPedia is open and free

dbPedia is built from WikiPedia and it is a very big corpus. Build an Lucene index on triples involving rdfs:label on all dbPedia titles dump.

Tad answered 25/7, 2010 at 17:35 Comment(2)
As one of the other answers states, DBpedia is not a tagged corpus.Notus
In 2012 (and today) my comment was true, but this could change in the future. If you're interested in DBpedia based corpuses you might want to follow the Open Extraction Challenge (wiki.dbpedia.org/textext) to generate NIF output for DBpedia from Wikipedia text.Notus

© 2022 - 2024 — McMap. All rights reserved.