How to apply Polyglot Detector function to dataframe

Asked 24/7, 2018 at 16:8 Answered 31/10, 2019 at 9:30

Assuming I have a column called df.Text which contains text (more that 1 sentence) and I want to use polyglot Detector to detect the language and store the value in a new column df['Text-Lang'] how do I ensure I also capture the other details like code and confidence

testEng ="This is English"
lang = Detector(testEng)
print(lang.language)

returns

name: English code: en confidence: 94.0 read bytes: 1920

but

df['Text-Lang','Text-LangConfidence']= df.Text.apply(Detector)

ends with

AttributeError: 'float' object has no attribute 'encode' and Detector is not able to detect the language reliably.

Am I applying the Detector function incorrectly or storing the output incorrectly or something else?

Ludly answered 24/7, 2018 at 16:8 Comment(5)

change datatype of your column to str and try again – Morphinism 24/7, 2018 at 16:10

That didn't resolve it, My datatypes are object already. This code from polyglot.detect import Detector testEng ="This is English" lang = Detector(testEng) print(lang) will produce this output Prediction is reliable: True Language 1: name: English code: en confidence: 94.0 read bytes: 1920 Language 2: name: un code: un confidence: 0.0 read bytes: 0 Language 3: name: un code: un confidence: 0.0 read bytes: 0 – Ludly 25/7, 2018 at 13:24

Is there a way to incorporate that type of output into my dataframe? – Ludly 25/7, 2018 at 13:31

it looks like a dictionary, can work, do place it in edited question for proper look though and also what exactly are you trying to do – Morphinism 25/7, 2018 at 16:29

can you include a sample of the contents of df.Text? – Pippo 1/8, 2018 at 15:30

First, if you only need polyglot for language detection, you'd better use pycld2 directly, that is what used behind the scenes. It has much cleaner API.

Saying that, the error you state comes from one of the values in your Text column, which is a real number. So you will have to convert values like that into strings.

The next problem you will stumble upon is minimal text length. polyglot will throw exception if the text is too short. You have to silence the exception by passing quiet=True.

Now, applying Detector will return an object. So you will have to parse it to extract the information you want. To extract language names, you will have to import icu module (it is a dependency of polyglot, so you have it installed already):

import icu
df.Text = df.Text.astype(str)
df['poly_obj'] = df.Text.apply(lambda x: Detector(x, quiet=True))
df['Text-lang'] = df['poly_obj'].apply(lambda x: icu.Locale.getDisplayName(x.language.locale))
df['Text-LangConfidence'] = df['poly_obj'].apply( lambda x: x.language.confidence)

After that you can drop the poly_obj column.

Thirlage answered 5/8, 2018 at 15:28 Comment(3)

How do I deal with this error: input contains invalid UTF-8 around byte 191 (of 248). I presume this is because I have chinese / japanses characters? I thought Polyglot would need that to determine the language. – Cutlor 10/10, 2018 at 15:37

@Cutlor : Did you ever get an answer to that question? I have the same issue and can't get around it. – Singly 30/12, 2020 at 18:4

The problem is that some of the characters are not valid symbols (probably control characters). You need to sanitize them. Replace second line to df.Text = df.Text.astype(str).apply(lambda x: ''.join([ch for ch in x if ch.isprintable()])) – Thirlage 14/1, 2021 at 9:32

You can try this:

testEng ="This is English"
lang = Detector(testEng)
df['Text-Lang']=lang.language.code
df['Text-LangConfidence']=leng.language.confidence

Inhibition answered 31/10, 2019 at 9:30 Comment(0)

Recommended topics

Hot tags