NLTK-based text processing with pandas
Asked Answered
T

1

8

The punctuation and numerical,lowercase are not working while using nltk.

My code

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)

Sample Input

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG

Expected Output

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung
Toxinantitoxin answered 1/1, 2018 at 10:59 Comment(5)
Welcome to Stackoverflow. Please take a look at stackoverflow.com/help/how-to-askThermomotor
Also take a look at stackoverflow.com/questions/47769818/…Thermomotor
@Thermomotor I'd be interested in comparing the performance of the stuff in your post vs mine :-)Clamshell
I think yours is faster. Too lazy to benchmark things and the necessary steps heavily depends on the task and how "noisy" the data source is.Thermomotor
If you're interested in benchmarking more preprocessing scripts, I've another from kaggle.com/alvations/basic-nlp-with-nltk which is surely slower but the main goal to explain the steps more than fast =)Thermomotor
C
20

Your function is slow and is incomplete. First, with the issues -

  1. You're not lowercasing your data.
  2. You're not getting rid of digits and punctuation properly.
  3. You're not returning a string (you should join the list using str.join and return it)
  4. Furthermore, a list comprehension with text processing is a prime way to introduce readability issues, not to mention possible redundancies (you may call a function multiple times, for each if condition it appears in.

Next, there are a couple of glaring inefficiencies with your function, especially with the stopword removal code.

  1. Your stopwords structure is a list, and in checks on lists are slow. The first thing to do would be to convert that to a set, making the not in check constant time.

  2. You're using nltk.word_tokenize which is unnecessarily slow.

  3. Lastly, you shouldn't always rely on apply, even if you are working with NLTK where there's rarely any vectorised solution available. There are almost always other ways to do the exact same thing. Oftentimes, even a python loop is faster. But this isn't set in stone.

First, create your enhanced stopwords as a set -

user_defined_stop_words = ['st','rd','hong','kong'] 

i = nltk.corpus.stopwords.words('english')
j = list(string.punctuation) + user_defined_stop_words

stopwords = set(i).union(j)

The next fix is to get rid of the list comprehension and convert this into a multi-line function. This makes things so much easier to work with. Each line of your function should be dedicated to solving a particular task (example, getting rid of digits/punctuation, or getting rid of stopwords, or lowercasing) -

def preprocess(x):
    x = re.sub('[^a-z\s]', '', x.lower())                  # get rid of noise
    x = [w for w in x.split() if w not in set(stopwords)]  # remove stopwords
    return ' '.join(x)                                     # join the list

As an example. This would then be applyied to your column -

df['Clean_addr'] = df['Adj_Addr'].apply(preprocess)

As an alternative, here's an approach that doesn't rely on apply. This should be work well for small sentences.

Load your data into a series -

v = miss_data['Adj_Addr']
v

0            23FLOOR 9 DES VOEUX RD WEST     HONG KONG
1    PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT...
2    C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIE...
Name: Adj_Addr, dtype: object

Now comes the heavy lifting.

  1. Lowercase with str.lower
  2. Remove noise using str.replace
  3. Split words into separate cells using str.split
  4. Apply stopword removal using pd.DataFrame.isin + pd.DataFrame.where
  5. Finally, join the dataframe using agg.

v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

v.where(~v.isin(stopwords) & v.notnull(), '')\
 .agg(' '.join, axis=1)\
 .str.replace('\s+', ' ')\
 .str.strip()

0                                 floor des voeux west
1    pag consulting flat aia central connaught central
2           co city lost studios flat f hillier sheung
dtype: object

To use this on multiple columns, place this code in a function preprocess2 and call apply -

def preprocess2(v):
     v = v.str.lower().str.replace('[^a-z\s]', '').str.split(expand=True)

     return v.where(~v.isin(stopwords) & v.notnull(), '')\
             .agg(' '.join, axis=1)\
             .str.replace('\s+', ' ')\
             .str.strip()

c = ['Col1', 'Col2', ...] # columns to operate
df[c] = df[c].apply(preprocess2, axis=0)

You'll still need an apply call, but with a small number of columns, it shouldn't scale too badly. If you dislike apply, then here's a loopy variant for you -

for _c in c:
    df[_c] = preprocess2(df[_c])

Let's see the difference between our non-loopy version and the original -

s = pd.concat([s] * 100000, ignore_index=True) 

s.size
300000

First, a sanity check -

preprocess2(s).eq(s.apply(preprocess)).all()
True

Now come the timings.

%timeit preprocess2(s)   
1 loop, best of 3: 13.8 s per loop

%timeit s.apply(preprocess)
1 loop, best of 3: 9.72 s per loop

This is surprising, because apply is seldom faster than a non-loopy solution. But this makes sense in this case because we've optimised preprocess quite a bit, and string operations in pandas are seldom vectorised (they usually are, but the performance gain isn't as much as you'd expect).

Let's see if we can do better, bypassing the apply, using np.vectorize

preprocess3 = np.vectorize(preprocess)

%timeit preprocess3(s)
1 loop, best of 3: 9.65 s per loop

Which is identical to apply but happens to be a bit faster because of the reduced overhead around the "hidden" loop.

Clamshell answered 1/1, 2018 at 11:52 Comment(3)
Just to clarify - converting the list comprehension to a multi-line function has nothing to do with speed, and more for code readability, right? Because from what I understand, list comprehensions - put very simplistically - are an abstraction of for loops in C, which are must faster than for loops in Python.Stribling
@Stribling Yes, mainly readability in this case. One slight correction, python for loops and list comprehension syntax are directly implemented in C code (at least, for the Cpython implementation) and so is quite fast.Clamshell
hi -@cs95 i read your answer which is quite accurate, but i have another question in same scenario, what if i have more than one Text columns in the dataframe, how can we tokenize and pad_sequences those columns and reassign them to the Dataframe again. Here is my question posted, stackoverflow.com/questions/67769093/…Fewell

© 2022 - 2024 — McMap. All rights reserved.