Is there a way to improve performance of nltk.sentiment.vader Sentiment analyser?
Asked Answered
E

1

5

My text is derived from a social network, so you can imagine it's nature, I think text is clean and minimal as far as I could imagine; after performing following sanitization:

  • no urls, no usernames
  • no punctuation, no accents
  • no numbers
  • no stopwords (I think vader does this anyway)

I think run time is linear, and I don’t intend to do any parallelization because of the amount of effort needed to change available code, For a example, for around 1000 texts ranging from ~50 kb to ~150 kb bytes, it takes around

and the running time is around 10 minutes on my machine.

Is there a better way in feeding the algorithm to speed up cooking time? The code is as simple as SentimentIntensityAnalyzer is intended to work, here is the main part

sid = SentimentIntensityAnalyzer()

c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
                dump_fetched = c.fetchall()

textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]
Envelop answered 25/7, 2017 at 7:41 Comment(6)
This is the perfect scenario for async (if you are downloading data) or multiprocessing (if you want to post process downloaded data). Are you sure you don't want to go that way?Leander
If you mean asynchronizing data retrieval from data base and data processing I don't think it will improve a lot, because Select statement is very fast compared to processing. in the other hand, I know this case is theoretically an embarrassingly parallel, as sentiment of some part of data has no impact on another part, and is a subject for async map or some other parallelization module, but in my case, I don't want to mess with this part really. I only want to play on data pre-processing I mentioned 4 steps, can we imagine more?Envelop
What is gc.collect()? I guess you could get a small performance increase if you don't use Dataframes which seem to be an overkill here. However you are not going to get much more performance without parallelizing your code or buying a more powerful CPU (and I mean in terms of power of a core, not in the number of cores).Bahena
yes Adonis, the thing I was pointing is not strictly programming efficiently, I'm more focus on the nature (possible tuning) of input data for sentiment analysis in case of social networks, in other words for voluminous data. I'm doing this for scientific studies, not in a production environment.Envelop
that's why I said: Is there a better way in feeding the algorithm to speed up cooking time? I hope my question is more clearEnvelop
Just from looking at the repo readme, it seems as though one of the recent releases was performance focused. Is the NLTK version the most recent? Here's the github: github.com/cjhutto/vaderSentimentChromatolysis
C
8

/1. You need not remove the stopwords, nltk+vader already does that.

/2. You need not remove the punctuation, as that affects vader's polarity calculations too, apart from the processing overhead. So, go ahead with the punctuation.

    >>> txt = "this is superb!"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
    >>> txt = "this is superb"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

/3.You shall introduce sentence tokenization too, as it would improve the accuracy, and then calculate average polarity for a paragraph based on the sentences.Example here : https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517

/4. The polarity calculations are completely independent of each other, and can use a multiprocessing pool for a small size, say 10, to provide good boost in speed.

polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

Catoptrics answered 9/8, 2017 at 13:21 Comment(3)
Thumbs up for the first two points, for the third are you sure Vader is not optimizing that already, since it seems very basic. for the fourth, I done multiprocessing, memory aware computation, caching and the whole stuffEnvelop
@Abderrahimben No,it doesn't. Vader recommends sentence tokenization in its documentation too. The best thing would be try out both approaches i.e. paragraph sentiment vs average sentence sentiment to see how it goes for your data.Catoptrics
@DhruvPathak, where can I see that vader removes stop wordsLathery

© 2022 - 2024 — McMap. All rights reserved.