Free Tagged Corpus for Named Entity Recognition [closed]

Asked 25/7, 2010 at 17:27 Answered 12/7, 2012 at 20:42

nltk corpus named-entity-recognition tagged-corpus

I am looking for a free tagged corpus for a system to train on to for Named Entity Recognition. Most of the ones I find (like the New York Times one) are expensive and not open. Can anyone help?

Lucienlucienne answered 25/7, 2010 at 17:27 Comment(1)

The same question was asked on opendata.stackexchange.com/q/7250/1652 (where it's not closed) – Constrict 24/3, 2016 at 0:57

There's a list of corpora at http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html

The CoNLL 2003 corpus, which is on that list, is free and is available from http://www.cnts.ua.ac.be/conll2003/ner/ (annotations) and NIST (text).

Notus answered 12/7, 2012 at 20:42 Comment(4)

Do we have to follow the procedure of filling forms, sending application to NIST for getting the dataset as stated in this link ? or some alternative is there? – Sharpen 4/1, 2017 at 8:10

CoNLL 2003 NIST text data is not "free" - it's only free for non-commercial, research use. – Randolph 5/9, 2018 at 21:18

@Randolph The agreements are here: trec.nist.gov/data/reuters/reuters.html In my reading of them, there's no non-commercial restriction (but then IANL). – Notus 6/9, 2018 at 23:49

@tom hmm, yea, I read this as being for non-commercial research only - "The information may only be used for research and development of natural-language-processing, information-retrieval or document-understanding systems." - but I guess you're right, it doesn't say that. I'm not a lawyer either - I'll have to get one to look at it :) – Randolph 9/10, 2018 at 17:14

The Python NLTK has access to the nltk.corpus.conll2000 corpus. Calling conll2000.iob_words() returns a list of (word, part-of-speech, IOB) triples, where IOB is a tag in the Inside-entity/Outside-entity/Beginning-of-entity format.

There are about 250k total words in a newswire-style context.

Iridescent answered 20/3, 2011 at 23:0 Comment(2)

Can we also dump the dataset for using it on some other tool? such as the Tagger by GLample. – Sharpen 4/1, 2017 at 8:15

CONLL2000 does not mark named entities. – Sullivan 17/7, 2017 at 21:39

dbPedia is open and free

dbPedia is built from WikiPedia and it is a very big corpus. Build an Lucene index on triples involving rdfs:label on all dbPedia titles dump.

Tad answered 25/7, 2010 at 17:35 Comment(2)

As one of the other answers states, DBpedia is not a tagged corpus. – Notus 12/7, 2012 at 20:32

In 2012 (and today) my comment was true, but this could change in the future. If you're interested in DBpedia based corpuses you might want to follow the Open Extraction Challenge (wiki.dbpedia.org/textext) to generate NIF output for DBpedia from Wikipedia text. – Notus 19/7, 2017 at 21:10

Recommended topics

Hot tags