How to apply pos_tag_sents() to pandas dataframe efficiently

Asked 16/1, 2017 at 10:46 Answered 7/2, 2017 at 4:29

Solved python python-3.x pandas nltk pos-tagger

In situations where you wish to POS tag a column of text stored in a pandas dataframe with 1 sentence per row the majority of implementations on SO use the apply method

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

The NLTK documentation recommends using the pos_tag_sents() for efficient tagging of more than one sentence.

Does that apply to this example and if so would the code be as simple as changing pso_tag to pos_tag_sents or does NLTK mean text sources of paragraphs

As mentioned in the comments pos_tag_sents() aims to reduce the loading of the preceptor each time but the issue is how to do this and still produce a column in a pandas dataframe?

Link to Sample Dataset 20kRows

Middlebrow answered 16/1, 2017 at 10:46 Comment(6)

how many rows do you have? – Tola 2/2, 2017 at 22:2

20,000 rows would be the number of rows – Middlebrow 2/2, 2017 at 22:12

That's not a problem. Just extract the column as a list of strings, process it and then add to the column back to to the dataframe. – Tola 2/2, 2017 at 22:14

Could you provide a coded example? – Middlebrow 2/2, 2017 at 22:20

Could you provide the data example? Just dump your dataframe.head() into a csv file ;P – Tola 2/2, 2017 at 22:53

Added a link to a sample csv file now – Middlebrow 6/2, 2017 at 18:52

Input

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

TL;DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response   

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat   

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ...

In Long:

First, you can extract the Text column to a list of string:

texts = df['Text'].tolist()

Then you can apply the word_tokenize function:

map(word_tokenize, texts)

Note that, @Boud's suggested is almost the same, using df.apply:

df['Text'].apply(word_tokenize)

Then you dump the tokenized text into a list of list of string:

df['Text'].apply(word_tokenize).tolist()

Then you can use pos_tag_sents:

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

Then you add the column back to the DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

Tola answered 7/2, 2017 at 4:29 Comment(2)

Your 'TL;DR' is longer than the 'In Long' version :) – Zorn 7/12, 2018 at 23:14

@Louis Yang -- Funny! And, yes, longer than the 'In Long' section. But I just stepped though it and works exactly. – Circumfuse 6/11, 2019 at 5:43

By applying pos_tag on each row, the Perceptron model will be loaded each time (costly operation, as it reads a pickle from disk).

If you instead get all the rows and send them to pos_tag_sents (which takes list(list(str))), the model is loaded once and used for all.

See the source.

Mcbrayer answered 16/1, 2017 at 10:57 Comment(2)

Would you be able to provide an example to use pos_tag_sents() with a pandas dataframe column as the source and overall destination so that the sentence and tagged output are on the same row? – Middlebrow 16/1, 2017 at 11:12

I would be stabbing in the dark, as I'm not that familiar with Pandas. Maybe something like pos_tag_sents(map(word_tokenize, dfData['SourceText'].values())). – Mcbrayer 16/1, 2017 at 11:21

Assign this to your new column instead:

dfData['POSTags'] = pos_tag_sents(dfData['SourceText'].apply(word_tokenize).tolist())

Washedout answered 3/2, 2017 at 21:56 Comment(0)

Recommended topics

Hot tags