NLTK available languages for stopwords
Asked Answered
A

3

23

I'm wondering where I can find the full list of supported langs (and their keys) for the NLTK stopwords.

I find a list in https://pypi.org/project/stop-words/ but it does not contain the keys for each country. So, it is not clear if you can retrieve the list by simply stopwords.words("Bulgarian"). In fact, that will throw an error.

I checked in the NLTK site and there are 4 documents matching "stopwords" but none of them describes that. https://www.nltk.org/search.html?q=stopwords&check_keywords=yes&area=default

And nothing is sayd in their book: http://www.nltk.org/book/ch02.html#stopwords_index_term

So, do you know where can I find the list of keys?

Annecorinne answered 7/2, 2019 at 12:55 Comment(1)
Falsehoods programmers believe about languages: A "language" is somehow connected to a "country". Somehow the fact that languages spoken in the USA include English, Spanish, Navajo, Cherokee, etc doesn't register, let alone the fact that there are no languages named "Belgian" or "Belizese".Cumberland
M
9

First check if you have downloaded nltk packages.
If not you can download it using below:

import nltk
nltk.download()

After this you can find stopword language files in below path.

C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords

There are 21 languages supported by it (I installed nltk few days back, so this number must be up to date). You can pass filename as parameter in

nltk.corpus.stopwords.words('langauage')

Mella answered 7/2, 2019 at 13:14 Comment(1)
Great! Thanks, I didn't know about the location. I was able to use some languages but nothers not :)Annecorinne
R
18

When you import the stopwords using:

from nltk.corpus import stopwords
english_stopwords = stopwords.words(language)

you are retrieving the stopwords based upon the fileid (language). In order to see all available stopword languages, you can retrieve the list of fileids using:

from nltk.corpus import stopwords
print(stopwords.fileids())

in the case of nltk v3.4.5, this returns 23 languages:

['arabic', 
 'azerbaijani', 
 'danish', 
 'dutch', 
 'english', 
 'finnish', 
 'french', 
 'german', 
 'greek',
 'hungarian', 
 'indonesian', 
 'italian', 
 'kazakh', 
 'nepali', 
 'norwegian', 
 'portuguese', 
 'romanian', 
 'russian', 
 'slovene', 
 'spanish', 
 'swedish', 
 'tajik', 
 'turkish']
Rambler answered 9/2, 2020 at 18:14 Comment(0)
M
9

First check if you have downloaded nltk packages.
If not you can download it using below:

import nltk
nltk.download()

After this you can find stopword language files in below path.

C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords

There are 21 languages supported by it (I installed nltk few days back, so this number must be up to date). You can pass filename as parameter in

nltk.corpus.stopwords.words('langauage')

Mella answered 7/2, 2019 at 13:14 Comment(1)
Great! Thanks, I didn't know about the location. I was able to use some languages but nothers not :)Annecorinne
C
7
os.listdir('/root/nltk_data/corpora/stopwords/')

['hungarian',
 'swedish',
 'kazakh',
 'norwegian',
 'finnish',
 'arabic',
 'indonesian',
 'portuguese',
 'turkish',
 'azerbaijani',
 'slovene',
 'spanish',
 'danish',
 'nepali',
 'romanian',
 'greek',
 'dutch',
 'README',
 'tajik',
 'german',
 'english',
 'russian',
 'french',
 'italian']
Cand answered 25/9, 2019 at 21:59 Comment(2)
downvoted: this approach is not cross-platform, nor environment compatible.Thrifty
This isn't a good approach. Also, README isn't a language.Zebapda

© 2022 - 2024 — McMap. All rights reserved.