Memory Error, performing sentiment analysis large size data

Asked 29/9, 2017 at 14:2 Answered 29/9, 2017 at 17:41

python pandas csv nltk sentiment-analysis

I am trying to perform sentiment analysis on the large set of data from social network. The part of the code works great with small size of data.

The input size less than 20mb has no problem computing. But if the size is more than 20mb I am getting memory error.

Environment: Windows 10, anaconda 3.x with updated version packages.

Code:

def captionsenti(F_name): 
    print ("reading from csv file")
    F1_name="caption_senti.csv"
    df=pd.read_csv(path+F_name+".csv")
    filename=path+F_name+"_"+F1_name
    df1=df['tweetText']   # reading caption from data5 file
    df1=df1.fillna("h") # filling NaN values
    df2=pd.DataFrame()
    sid = SentimentIntensityAnalyzer()
    print ("calculating sentiment")
    for sentence in df1:
        #print(sentence)
        ss = sid.polarity_scores(sentence)  # calculating sentiments
        #print ss
        df2=df2.append(pd.DataFrame({'tweetText':sentence ,'positive':ss['pos'],'negative':ss['neg'],'neutral':ss['neu'],
                                 'compound':ss['compound']},index=[0]))


    df2=df2.join(df.set_index('tweetText'), on='tweetText') # joining two data frames
    df2=df2.drop_duplicates(subset=None, keep='first', inplace=False)
    df2=df2.dropna(how='any') 
    df2=df2[['userID','tweetSource','tweetText','positive','neutral','negative','compound','latitude','longitude']]
    #print df2
    print ("Storing in csv file")
    df2.to_csv(filename,encoding='utf-8',header=True,index=True,chunksize=100)

What extra do I need to include to avoid the memory error Thanks for the help in advance.

Secondary answered 29/9, 2017 at 14:2 Comment(0)

Some general tips that might help you:

1. Load only the columns that you need to memory:

pd.read_csv provide usecols parameters to specify which columns you want to read

df = pd.read_csv(path+F_name+".csv", usecols=['col1', 'col2'])

2. Delete unused variables

if you no longer need a variable, delete it with del variable_name

3. Use memory profiler

Profile the memory memory_profiler. Citing the example's memory log from the documentation, you get a memory profile like the following:

Line #    Mem usage  Increment   Line Contents
==============================================
     3                           @profile
     4      5.97 MB    0.00 MB   def my_func():
     5     13.61 MB    7.64 MB       a = [1] * (10 ** 6)
     6    166.20 MB  152.59 MB       b = [2] * (2 * 10 ** 7)
     7     13.61 MB -152.59 MB       del b
     8     13.61 MB    0.00 MB       return a

Acadian answered 29/9, 2017 at 14:7 Comment(4)

Great .. !! Let me try how that works & will update back .. Thanks – Secondary 29/9, 2017 at 14:9

@SitzBlogz actually yes, you are using it – Acadian 29/9, 2017 at 14:11

@SitzBlogz I will have a closer look otherwise will delete my answer – Acadian 29/9, 2017 at 14:16

@SitzBlogz I updated my answer to give some general tips, try with memory_profiler to see which steps is taking mosts memory – Acadian 29/9, 2017 at 14:28

You don't need extra anything, you need less. Why do you load all the tweets in memory at once? If you just deal with one tweet at a time, you can process terabytes of data with less memory than you'll find in a bottom-end smartphone.

reader = csv.DictReader(open(F1_name))
fieldnames = ["TweetText", "positive", "negative", ...]
writer = csv.DictWriter(open(output_filename, "w"), fieldnames=fieldnames)
writer.writeheader()

for row in reader:
    sentence = row["TweetText"]
    ss = sid.polarity_scores(sentence)
    row['positive'] = ss['pos']
    row['negative'] = ss['neg']
    <etc.>
    writer.writerow(row)

Or something like that. I didn't bother to close your filehandles, but you should. There are all sorts of tweaks and adjustments you can make, but the point is: There's no reason to blow up your memory when you're analyzing one tweet at a time.

Banausic answered 29/9, 2017 at 17:41 Comment(6)

Please can you elaborate the code .. I think I am making some mistake putting in details related to my code. – Secondary 30/9, 2017 at 18:29

Sorry, I couldn't and I won't. I don't even know the details of your code, or which part you are unsure about. I suggest you read the csv.DictWriter documentation (it can be hard to please sometimes), do your best to understand my code sample, then put together your best attempt and ask a new question. Feel free to comment here and ask me to look at it. Anyway I guarantee you this approach will forever solve your memory footprint problem. – Banausic 30/9, 2017 at 19:45

actually the code in the question is pretty much entire block of code .. – Secondary 30/9, 2017 at 19:53

Then rewrite it as I recommend, and ask a question about what you can't figure out. – Banausic 30/9, 2017 at 19:55

Or at least make your attempt available somewhere, and I'll take one shot at helping you fix it. If that's not enough, I'll tell you again to ask a new question. – Banausic 30/9, 2017 at 19:59

Thank you so much .. I did try to edit the code but i got it messed hence asked for elaborated code help after so many hours .. – Secondary 30/9, 2017 at 20:3

1. Load only the columns that you need to memory:

2. Delete unused variables

3. Use memory profiler

Recommended topics

Hot tags