How can I improve the performance of NLTK? alternatives?
Asked Answered
T

1

2

I tried searching for it here and there, but could not find any good solution, so though of asking nlp experts. I am developing an text similarity finding application for which I need to match thousands and thousands of documents (of around 1000 words each) with each other. For nlp part, my best bet is NLTK (seeing its capabilities and algorithm friendlyness of python.But now when parts of speech tagging in itself taking so much of time, I believe, nltk may not be best suitable. Java or C won't hurt me, hence any solution will work for me. Please note, I have already started migrating from mysql to hbase in order to work with more freedom on such large number of data. But still question exists, how to perform algos. Mahout may be a choice, but that too is for machine learning, not dedicated for nlp (may be good for speech recognition). What else are available options. In gist, I need high performance nlp, (a step down from high performance machine learning). (I am inclined a bit towards Mahout, seeing future usage).

It is about scaling nltk.

Tabbatha answered 3/4, 2013 at 8:57 Comment(1)
NLTK is very slow; it's mostly useful for prototyping. Consider Gensim, that's much more scalable.Gilligan
C
1

You can use Mahout to find which documents are most related to one another.

Here is a quick tutorial (link) that will teach you some of the concepts, but they are best explained in chapter 8 in the Mahout in Action book.

Basically, you need first to represent your data in Hadoop SequenceFile format, for which you can use the seqdirectory command, but this might prove too slow, given that it will want each document as its own file (so if you have "thousands and thousands of documents" I/O will suffer.) This post is related in the sense that it talks about how to make a SequenceFile from a CSV file, where each line is a document. Although, if I'm not mistaken, Mahout's trunk may have some functionality for this. You might want to ask in the Mahout user mailing list.

Then, after your documents are represented in Hadoop SequenceFile format, you need to apply the seq2sparse command. The full list of available command line options are on chapter 8 of the book, but you can poke the command for its help prompt and it will give you a list of commands. One of the commands you will need is -a which is the class name of the (lucene?) text analyzer you want to use, here is where you get rid of stop words, word stemming, remove punctuation, etc... The default analyzer is org.apache.lucene.analysis.standard.StandardAnalyzer.

Then you represent your data as a matrix with the rowid command.

After, you use the rowsimilarity command to get most similar documents.

Hope this helps.

Chloechloette answered 3/4, 2013 at 9:42 Comment(3)
thanks @Julian . Definitely I am going to try. I was wondering if we could use nlp power of dedicated nlp libraries like Opennlp/nltk etc with scalability power of mahout? Like if in future I need to analyze all pos tags in sentences, then I may want to use nltk, but if data set is large, mahout is obvious choice. Here is the confusion.Tabbatha
I don't know about those libraries. But on a side note, if you need speed, you might want to look at something like minhashingChloechloette
thanks, looking at it. One important thing, that may solve problem of various people around is- I have developed an algorithm for clustering in nltk. I have retrieved various pos tags and other pretty stuffs using it. Now comes the scalability. How can we achieve scalability using nltk ? Is it Mahout? or something else? This is real question.Tabbatha

© 2022 - 2024 — McMap. All rights reserved.