Large classification document corpus
Asked Answered
C

2

5

Can anyone point me to some large corpus that I use for classification?

But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that.

I was able only to find this Reuters and 20 newsgroups, which is very small for the thing I need.

Coherent answered 27/8, 2015 at 10:17 Comment(1)
Provided an answer. Please accept or comment if it was not helpfulSeedling
S
6

The most popular datasets for text-classification evaluation are:

However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria:

  • Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.

  • Enron Email Dataset You could do a variety of different classifcation tasks here.

  • Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request

You can browse other publicly available datasets here

Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice

Update:

Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder

Seedling answered 27/8, 2015 at 23:29 Comment(0)
S
1

Huge Reddit archive spanning 10/2007 to 5/2015

Sowens answered 27/8, 2015 at 10:57 Comment(4)
Thanks, but this doesn't seem like a labeled, classification ready, dataset?Coherent
What exactly do you mean by labeled?Ussery
@Ussery I mean a corpus of documents where for each document you know to which class it belongs, for example - sports, history, music, etc.Coherent
The archive is in JSON format so the tet is easily parsed out and being Reddit, is well organized. The difference between r/Drugs and drugs is semantic IMHO. It's not completely formatted for ML, but it's as close as any dataset I've seen, particularly one of this size and scope. Let us know if you find what you're looking for as we all may have use for it too.Sowens

© 2022 - 2024 — McMap. All rights reserved.