Is there a way to improve performance of nltk.sentiment.vader Sentiment analyser? - McMap

About

Is there a way to improve performance of nltk.sentiment.vader Sentiment analyser?

Asked 25/7, 2017 at 7:41 Answered 9/8, 2017 at 13:21

Solved python performance data-manipulation sentiment-analysis vader

E

1

5

My text is derived from a social network, so you can imagine it's nature, I think text is clean and minimal as far as I could imagine; after performing following sanitization:

no urls, no usernames
no punctuation, no accents
no numbers
no stopwords (I think vader does this anyway)

I think run time is linear, and I don’t intend to do any parallelization because of the amount of effort needed to change available code, For a example, for around 1000 texts ranging from ~50 kb to ~150 kb bytes, it takes around

and the running time is around 10 minutes on my machine.

Is there a better way in feeding the algorithm to speed up cooking time? The code is as simple as SentimentIntensityAnalyzer is intended to work, here is the main part

sid = SentimentIntensityAnalyzer()

c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
                dump_fetched = c.fetchall()

textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

Envelop answered 25/7, 2017 at 7:41 Comment(6)

This is the perfect scenario for async (if you are downloading data) or multiprocessing (if you want to post process downloaded data). Are you sure you don't want to go that way? – Leander 8/8, 2017 at 7:30

If you mean asynchronizing data retrieval from data base and data processing I don't think it will improve a lot, because Select statement is very fast compared to processing. in the other hand, I know this case is theoretically an embarrassingly parallel, as sentiment of some part of data has no impact on another part, and is a subject for async map or some other parallelization module, but in my case, I don't want to mess with this part really. I only want to play on data pre-processing I mentioned 4 steps, can we imagine more? – Envelop 8/8, 2017 at 7:48

What is gc.collect()? I guess you could get a small performance increase if you don't use Dataframes which seem to be an overkill here. However you are not going to get much more performance without parallelizing your code or buying a more powerful CPU (and I mean in terms of power of a core, not in the number of cores). – Bahena 8/8, 2017 at 9:22

yes Adonis, the thing I was pointing is not strictly programming efficiently, I'm more focus on the nature (possible tuning) of input data for sentiment analysis in case of social networks, in other words for voluminous data. I'm doing this for scientific studies, not in a production environment. – Envelop 8/8, 2017 at 11:31

that's why I said: Is there a better way in feeding the algorithm to speed up cooking time? I hope my question is more clear – Envelop 8/8, 2017 at 11:32

Just from looking at the repo readme, it seems as though one of the recent releases was performance focused. Is the NLTK version the most recent? Here's the github: github.com/cjhutto/vaderSentiment – Chromatolysis 8/8, 2017 at 19:56

C

8

/1. You need not remove the stopwords, nltk+vader already does that.

/2. You need not remove the punctuation, as that affects vader's polarity calculations too, apart from the processing overhead. So, go ahead with the punctuation.

    >>> txt = "this is superb!"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
    >>> txt = "this is superb"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

/3.You shall introduce sentence tokenization too, as it would improve the accuracy, and then calculate average polarity for a paragraph based on the sentences.Example here : https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517

/4. The polarity calculations are completely independent of each other, and can use a multiprocessing pool for a small size, say 10, to provide good boost in speed.

polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

Catoptrics answered 9/8, 2017 at 13:21 Comment(3)

Thumbs up for the first two points, for the third are you sure Vader is not optimizing that already, since it seems very basic. for the fourth, I done multiprocessing, memory aware computation, caching and the whole stuff – Envelop 9/8, 2017 at 13:49

@Abderrahimben No,it doesn't. Vader recommends sentence tokenization in its documentation too. The best thing would be try out both approaches i.e. paragraph sentiment vs average sentence sentiment to see how it goes for your data. – Catoptrics 9/8, 2017 at 17:2

@DhruvPathak, where can I see that vader removes stop words – Lathery 27/8, 2021 at 18:4

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.