My text is derived from a social network, so you can imagine it's nature, I think text is clean and minimal as far as I could imagine; after performing following sanitization:
- no urls, no usernames
- no punctuation, no accents
- no numbers
- no stopwords (I think vader does this anyway)
I think run time is linear, and I don’t intend to do any parallelization because of the amount of effort needed to change available code, For a example, for around 1000 texts ranging from ~50 kb to ~150 kb bytes, it takes around
and the running time is around 10 minutes on my machine.
Is there a better way in feeding the algorithm to speed up cooking time? The code is as simple as SentimentIntensityAnalyzer is intended to work, here is the main part
sid = SentimentIntensityAnalyzer()
c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
dump_fetched = c.fetchall()
textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]
gc.collect()
? I guess you could get a small performance increase if you don't use Dataframes which seem to be an overkill here. However you are not going to get much more performance without parallelizing your code or buying a more powerful CPU (and I mean in terms of power of a core, not in the number of cores). – Bahena