Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line
Asked Answered
L

1

3

I am getting familiar with NLTK and text categorization by Jacob Perkins's book "Python Text Processing with NLTK 2.0 Cookbook".

My corpus documents/texts each consists of a paragraph of text, so each of them is in a separate line of file, not in a separate file. The number of such these paragraphs/lines are about 2 millions. Therefore there are about 2 million on machine learning instances.

Each line in my file (a paragraph of text - a combination of domain title, description, keywords), that is a subject of feature extraction: tokenization, etc. to make it an instance for a machine learning algorithm.

I have two files like that with all the positives and negavives.

How can I load it to CategorizedCorpusReader? Is it possible?

I tried other solutions before, like scikit, and finally picked NLTK hoping for an easier point to start with a result.

Leakage answered 18/12, 2014 at 3:37 Comment(0)
P
2

Assuming that you have two files:

file_pos.txt, file_neg.txt

from nltk.corpus.reader import CategorizedCorpusReader
reader = CategorizedCorpusReader('/path/to/corpora/', \
                                 r'file_.*\.txt', \
                                 cat_pattern=r'file_(\w+)\.txt')

After this, you can apply the usual Corpus functions to it like:

>>> reader.categories()
['neg', 'pos']
>>> reader.fileids(categories=['neg'])
['file_neg.txt']

As well as tagged_sents, tagged_words, etc.

You might enjoy this tutorial about creating a custom corpora: https://www.packtpub.com/books/content/python-text-processing-nltk-20-creating-custom-corpora

Primogeniture answered 4/3, 2015 at 10:39 Comment(2)
How do you retrieve a single document from the corpus, in this case? You can't use corpus.sents()[0], corpus.paras()[0], corpus.words()[0] because these methods will give not give you the first document.Saviour
That's a fair point. One thing you could do is split the pos and neg sentences into their own files. That way it might be a bit more straightforward to get them individually. Edit: When I say into their own files, I say split them into pos_1..N.txt and neg_1..N.txtPrimogeniture

© 2022 - 2024 — McMap. All rights reserved.