I am trying to create a term document matrix with NLTK and pandas. I wrote the following function:
def fnDTM_Corpus(xCorpus):
import pandas as pd
'''to create a Term Document Matrix from a NLTK Corpus'''
fd_list = []
for x in range(0, len(xCorpus.fileids())):
fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
DTM.fillna(0,inplace = True)
return DTM.T
to run it
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'
newcorpus = PlaintextCorpusReader(corpus_root, '.*')
x = fnDTM_Corpus(newcorpus)
It works well for few small files in the corpus but gives me a MemoryError when I try to run it with a corpus of 4,000 files (of about 2 kb each).
Am I missing something?
I am using a 32 bit python. (am on windows 7, 64-bit OS, Core Quad CPU, 8 GB RAM). Do I really need to use 64 bit for corpus of this size ?
gensim
or similar libraries that have optimized their code for tf-idf? radimrehurek.com/gensim – Itempd.get_dummies(df_column)
could do the job. Maybe I am missing something about the document term matrix – Awoke