Why is my NLTK function slow when processing the DataFrame?
Asked Answered
H

1

4

I am trying to run through a function with my million lines in a datasets.

  1. I read the data from CSV in a dataframe
  2. I use drop list to drop data i don't need
  3. I pass it through a NLTK function in a for loop.

code:

def nlkt(val):
    val=repr(val)
    clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
    nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
    nonum = [char for char in nopunc if not char.isdigit()]
    words_string = ''.join(nonum)
    return words_string

Now i am calling the above function using a for loop to run through by million records. Even though i am on a heavy weight server with 24 core cpu and 88 GB Ram i see the loop is taking too much time and not using the computational power that is there

I am calling the above function like this

data = pd.read_excel(scrPath + "UserData_Full.xlsx", encoding='utf-8')
droplist = ['Submitter', 'Environment']
data.drop(droplist,axis=1,inplace=True)

#Merging the columns company and detailed description

data['Anylize_Text']= data['Company'].astype(str) + ' ' + data['Detailed_Description'].astype(str)

finallist =[]

for eachlist in data['Anylize_Text']:
    z = nlkt(eachlist)
    finallist.append(z)

The above code works perfectly OK just too slow when we have few million record. It is just a sample record in excel but actual data will be in DB which will run in few hundred millions. Is there any way I can speed up the operation to pass the data through the function faster - use more computational power instead?

Horripilate answered 12/12, 2017 at 10:1 Comment(12)
"Lac" is a regionalism and is not generally understood outside of the Indian cultural sphere.Balas
It's not clear what you are doing with finallist. Does it really need to contain all the sentences, or could you process one at a time? The nlkt function contains a large number of temporary variables so it will consume like 10x the memory while it's processing one call, though it will be freed when it's done.Balas
finallist is used for tokenizing and then passing through neural network model..... removing lacs words and editing accordinglyHorripilate
# prepare tokenizer t = Tokenizer() t.fit_on_texts(finallist) vocab_size = len(t.word_index) + 1 # integer encode the documents encoded_docs = t.texts_to_sequences(finallist) padded_docs = pad_sequences(encoded_docs, maxlen=max_Len, padding='post')Horripilate
I am not worried about memory right now. My main concern is how do i speed up the process of passing the lines of the nlkt functions and speed up the loop. My server is hardly using 1 % of CPU or Memory currently while passing through the for loopHorripilate
Not consuming memory needlessly will help make this faster. Without benchmarks it's hard to tell whether you are exhausting memory (swapping would really kill throughput) but it's an easy thing to simplify.Balas
OK sure anything to speed up...please help me with your suggested change in function...In the memory used section - i have 34GB of memory free though - but please suggest any change. free -m total used free shared buffers cached Mem: 85632 51278 34354Horripilate
Take a look at kaggle.com/alvations/basic-nlp-with-nltk. You should use set() in cases where word order is not important.Announcer
Thank You - great stuffs. A little more explanation will help please - sorry did not understand where do i use set(). Well the idea is we are trying to predict an assigned group from the text of the ticket. Now not sure if sequence is important. Also what I thought was if we can do the functions of nltk within the dataframe itself (data) and not go line by line in for loop - will speed up the process... any direction will be helpfulHorripilate
i have changed to set(stopwords.words('english')) now. Thanks for the suggestion - no mark difference though yetHorripilate
See stackoverflow.com/questions/41674573/…Announcer
It's slow because you're looping through each row 3 times!!Announcer
A
12

Your original nlkt() loops through each row 3 times.

def nlkt(val):
    val=repr(val)
    clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
    nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
    nonum = [char for char in nopunc if not char.isdigit()]
    words_string = ''.join(nonum)
    return words_string

Also, each time you're calling nlkt(), you're re-initializing these again and again.

  • stopwords.words('english')
  • string.punctuation

These should be global.

stoplist = stopwords.words('english') + list(string.punctuation)

Going through things line by line:

val=repr(val)

I'm not sure why you need to do this. But you could easy cast a column to a str type. This should be done outside of your preprocessing function.

Hopefully this is self-explanatory:

>>> import pandas as pd
>>> df = pd.DataFrame([[0, 1, 2], [2, 'xyz', 4], [5, 'abc', 'def']])
>>> df
   0    1    2
0  0    1    2
1  2  xyz    4
2  5  abc  def
>>> df[1]
0      1
1    xyz
2    abc
Name: 1, dtype: object
>>> df[1].astype(str)
0      1
1    xyz
2    abc
Name: 1, dtype: object
>>> list(df[1])
[1, 'xyz', 'abc']
>>> list(df[1].astype(str))
['1', 'xyz', 'abc']

Now going to the next line:

clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]

Using str.split() is awkward, you should use a proper tokenizer. Otherwise, your punctuations might be stuck with the preceding word, e.g.

>>> from nltk.corpus import stopwords
>>> from nltk import word_tokenize
>>> import string
>>> stoplist = stopwords.words('english') + list(string.punctuation)
>>> stoplist = set(stoplist)

>>> text = 'This is foo, bar and doh.'

>>> [word for word in text.split() if word.lower() not in stoplist]
['foo,', 'bar', 'doh.']

>>> [word for word in word_tokenize(text) if word.lower() not in stoplist]
['foo', 'bar', 'doh']

Also checking for .isdigit() should be checked together:

>>> text = 'This is foo, bar, 234, 567 and doh.'
>>> [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]
['foo', 'bar', 'doh']

Putting it all together your nlkt() should look like this:

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]

And you can use the DataFrame.apply:

data['Anylize_Text'].apply(preprocess)
Announcer answered 13/12, 2017 at 8:41 Comment(2)
See also stackoverflow.com/questions/41674573/…Announcer
Very useful and nicely explained answer. Thank you!Goodly

© 2022 - 2024 — McMap. All rights reserved.