Does nltk contain Arabic stop word, if not how can I add it?

Asked 6/3, 2017 at 11:58 Answered 26/7, 2022 at 18:1

I tried this but it doesn't work

from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
print(stopwords_list)

Update [January 2018]: The nltk data repository has included Arabic stopwords since October, 2017, so this issue no longer arises. The above code will work as expected.

Accumbent answered 6/3, 2017 at 11:58 Comment(2)

The declaration of the source code encoding has nothing to do with the data you use (load/import), it is completely unrelated to your problem. – Falcongentle 6/3, 2017 at 12:55

Yes I know, but i need this for another thing – Accumbent 6/3, 2017 at 13:40

As of October, 2017, the nltk includes a collection of Arabic stopwords. If you ran nltk.download() after that date, this issue will not arise. If you have been a user of nltk for some time and you now lack the Arabic stopwords, use nltk.download() to update your stopwords corpus.

If you call nltk.download() without arguments, you'll find that the stopwords corpus is shown as "out of date" (in red). Download the current version that includes Arabic.
Alternately, you can simply update the stopwords corpus by running the following code once, from the interactive prompt:
```
>>> import nltk
>>> nltk.download("stopwords")
```

Note:

Looking words up in a list is really slow. Use a set, not a list. E.g.,

arb_stopwords = set(nltk.corpus.stopwords.words("arabic"))

Original answer (still applicable to languages that are not included)

Why don't you just check what the stopwords collection contains:

>>> from nltk.corpus import stopwords
>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish',
 'turkish']

So no, there's no list for Arabic. I'm not sure what you mean by "add it", but the stopwords lists are just lists of words. They don't even do morphological analysis, or other things you might want in an inflecting language. So if you have (or can put together) a list of Arabic stopwords, just put them in a set()¹ and you're one step ahead of where you'd be if your code worked.

Void answered 6/3, 2017 at 13:45 Comment(3)

Sorry what you mean "put them in set()"..? – Accumbent 6/3, 2017 at 14:14

Make a list my_stopwords_list, then write stopwords = set(my_stopwords_list). And look up set() in the Python docs. – Void 6/3, 2017 at 22:55

Hi @alexis. stopwords now have an Arabic stop words, if you want to update your answer. Best Regrards. – Astounding 1/1, 2018 at 9:40

There's an Arabic stopword list here:

https://github.com/mohataher/arabic-stop-words/blob/master/list.txt

If you save this file in your nltk_data directory with the filename arabic you will then be able to call it with nltk using your code above, which was:

from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')

(Note that the possible locations of your nltk_data directory can be seen by typing nltk.data.path in your Python interpreter).

You can also use alexis' suggestion to check if it is found.

Do heed his advice to convert the the stopwords list to a set: stopwords_set = set(stopwords.words('arabic')), as it can make a real difference to performance.

Delude answered 6/3, 2017 at 16:3 Comment(6)

IOError: No such file or directory: u'C:\\Users\\Lamiaa\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\arabic' i get this error – Accumbent 6/3, 2017 at 18:41

One at a time, try putting it in every one of the directories listed when you type nltk.data.path – Delude 6/3, 2017 at 20:4

If that doesn't work, try putting this at the top of your file: import nltk nltk.data.path.append(u'C:\Users\Lamiaa\AppData\Roaming\nltk_data\corpora\s‌topwords') – Delude 6/3, 2017 at 20:8

Nice that you found a stopwords list, but 1) Don't drop the file into the nltk corpus area, read it from your own folder with nltk.corpus.WordListCorpusReader. (Adapt this answer). 2) Write your path as a "raw" string. You've got embedded newlines. – Void 6/3, 2017 at 22:53

@Void Could you explain why it's a bad idea not to put additional stopword files in the nltk corpus area? Are they in danger of being overwritten when nltk is updated? – Delude 7/3, 2017 at 15:54

Yes, among other things. The downloader will show you the stopwords corpus as "out of date" (or used to) because of the extra files. But mainly it's for the same reason that you shouldn't hack the nltk source itself to add new corpora: Keep your code in your project folders, and let libraries manage their own resources. – Void 7/3, 2017 at 23:43

You should use this library called Arabic stop words here is the pip for it:

pip install Arabic-Stopwords

just install it it should be imported after you type:

import arabicstopwords.arabicstopwords as stp

It is much better than the one in the nltk

Unlearn answered 26/7, 2022 at 18:1 Comment(1)

You're correct. NTLK didn't even remove the word هذا. – Levitt 24/7, 2023 at 18:48

Original answer (still applicable to languages that are not included)

Recommended topics

Hot tags