How do I detect what language a text is written in using NLTK?
The examples I've seen use nltk.detect
, but when I've installed it on my mac, I cannot find this package.
How do I detect what language a text is written in using NLTK?
The examples I've seen use nltk.detect
, but when I've installed it on my mac, I cannot find this package.
Have you come across the following code snippet?
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)
from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active
Or the following demo file?
This library is not from NLTK either but certainly helps.
$ sudo pip install langdetect
Supported Python versions 2.6, 2.7, 3.x.
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
https://pypi.python.org/pypi/langdetect?
P.S.: Don't expect this to work correctly always:
>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'
detect("You made it home!")
is giving me "fr". I'm wondering if there is anything better. –
Chibouk >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571401485770536, en:0.14285811674731527] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571403121803622, fr:0.14285888197332486] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.999995562246093]
–
Chibouk import DetectorFactory DetectorFactory.seed = 0
–
Coelom Although this is not in the NLTK, I have had great results with another Python-based library :
https://github.com/saffsd/langid.py
This is very simple to import and includes a large number of languages in its model.
Super late but, you could use textcat
classifier in nltk
, here. This paper discusses the algorithm.
It returns a country code in ISO 639-3, so I would use pycountry
to get the full name.
For example, load the libraries
import nltk
import pycountry
from nltk.stem import SnowballStemmer
Now let's look at two phrases, and guess
their language:
phrase_one = "good morning"
phrase_two = "goeie more"
tc = nltk.classify.textcat.TextCat()
guess_one = tc.guess_language(phrase_one)
guess_two = tc.guess_language(phrase_two)
guess_one_name = pycountry.languages.get(alpha_3=guess_one).name
guess_two_name = pycountry.languages.get(alpha_3=guess_two).name
print(guess_one_name)
print(guess_two_name)
English
Afrikaans
You could then pass them into other nltk
functions, for example:
stemmer = SnowballStemmer(guess_one_name.lower())
s1 = "walking"
print(stemmer.stem(s1))
walk
Disclaimer obviously this will not always work, especially for sparse data
Extreme example
guess_example = tc.guess_language("hello")
print(pycountry.languages.get(alpha_3=guess_example).name)
Konkani (individual language)
polyglot.detect can detect the language:
from polyglot.detect import Detector
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
print(Detector(foreign).language)
name: Spanish code: es confidence: 98.0 read bytes: 865
© 2022 - 2024 — McMap. All rights reserved.
langid
andlangdetect
libraries do the trick and are super easy to use: github.com/hb20007/hands-on-nltk-tutorial/blob/master/… – Miguelmiguelalangdetect
is not very reliable (e.g. check github.com/Mimino666/langdetect/issues/51 for instance) andlangid
choked on a test Japanese string when I tested it. YMMV. In 2019, if you are not tied to NLTK, I'd recommend you take a look atcld2
,cld3
orfastText
instead. – Discordant