Does nltk contain Arabic stop word, if not how can I add it?
Asked Answered
A

3

6

I tried this but it doesn't work

from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
print(stopwords_list)

Update [January 2018]: The nltk data repository has included Arabic stopwords since October, 2017, so this issue no longer arises. The above code will work as expected.

Accumbent answered 6/3, 2017 at 11:58 Comment(2)
The declaration of the source code encoding has nothing to do with the data you use (load/import), it is completely unrelated to your problem.Falcongentle
Yes I know, but i need this for another thingAccumbent
V
7

As of October, 2017, the nltk includes a collection of Arabic stopwords. If you ran nltk.download() after that date, this issue will not arise. If you have been a user of nltk for some time and you now lack the Arabic stopwords, use nltk.download() to update your stopwords corpus.

  1. If you call nltk.download() without arguments, you'll find that the stopwords corpus is shown as "out of date" (in red). Download the current version that includes Arabic.

  2. Alternately, you can simply update the stopwords corpus by running the following code once, from the interactive prompt:

    >>> import nltk
    >>> nltk.download("stopwords")
    

Note:

Looking words up in a list is really slow. Use a set, not a list. E.g.,

arb_stopwords = set(nltk.corpus.stopwords.words("arabic"))

Original answer (still applicable to languages that are not included)

Why don't you just check what the stopwords collection contains:

>>> from nltk.corpus import stopwords
>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish',
 'turkish']

So no, there's no list for Arabic. I'm not sure what you mean by "add it", but the stopwords lists are just lists of words. They don't even do morphological analysis, or other things you might want in an inflecting language. So if you have (or can put together) a list of Arabic stopwords, just put them in a set()¹ and you're one step ahead of where you'd be if your code worked.

Void answered 6/3, 2017 at 13:45 Comment(3)
Sorry what you mean "put them in set()"..?Accumbent
Make a list my_stopwords_list, then write stopwords = set(my_stopwords_list). And look up set() in the Python docs.Void
Hi @alexis. stopwords now have an Arabic stop words, if you want to update your answer. Best Regrards.Astounding
D
5

There's an Arabic stopword list here:

https://github.com/mohataher/arabic-stop-words/blob/master/list.txt

If you save this file in your nltk_data directory with the filename arabic you will then be able to call it with nltk using your code above, which was:

from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')

(Note that the possible locations of your nltk_data directory can be seen by typing nltk.data.path in your Python interpreter).

You can also use alexis' suggestion to check if it is found.

Do heed his advice to convert the the stopwords list to a set: stopwords_set = set(stopwords.words('arabic')), as it can make a real difference to performance.

Delude answered 6/3, 2017 at 16:3 Comment(6)
IOError: No such file or directory: u'C:\\Users\\Lamiaa\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\arabic' i get this errorAccumbent
One at a time, try putting it in every one of the directories listed when you type nltk.data.pathDelude
If that doesn't work, try putting this at the top of your file: import nltk nltk.data.path.append(u'C:\Users\Lamiaa\AppData\Roaming\nltk_data\corpora\s‌​topwords')Delude
Nice that you found a stopwords list, but 1) Don't drop the file into the nltk corpus area, read it from your own folder with nltk.corpus.WordListCorpusReader. (Adapt this answer). 2) Write your path as a "raw" string. You've got embedded newlines.Void
@Void Could you explain why it's a bad idea not to put additional stopword files in the nltk corpus area? Are they in danger of being overwritten when nltk is updated?Delude
Yes, among other things. The downloader will show you the stopwords corpus as "out of date" (or used to) because of the extra files. But mainly it's for the same reason that you shouldn't hack the nltk source itself to add new corpora: Keep your code in your project folders, and let libraries manage their own resources.Void
U
2

You should use this library called Arabic stop words here is the pip for it:

pip install Arabic-Stopwords

just install it it should be imported after you type:

import arabicstopwords.arabicstopwords as stp

It is much better than the one in the nltk

Unlearn answered 26/7, 2022 at 18:1 Comment(1)
You're correct. NTLK didn't even remove the word هذا.Levitt

© 2022 - 2025 — McMap. All rights reserved.