Stopword removal with NLTK
Asked Answered
A

6

79

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.

Anoint answered 2/10, 2013 at 5:29 Comment(2)
possible duplicate of "Stop words" list for English?Terza
If you don't know which words can be operators, there's no way to specify a list of stopwords. Otherwise, you should remove the stopwords you want to keep from the nltk list in @Terza 's answer and that should do it.Burlington
C
73

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:
    # use word
Califate answered 8/6, 2014 at 13:45 Comment(0)
T
144

There is an in-built stopword list in NLTK made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> stop = set(stopwords.words('english'))
>>> sentence = "this is a foo bar sentence"
>>> print([i for i in sentence.lower().split() if i not in stop])
['foo', 'bar', 'sentence']
>>> [i for i in word_tokenize(sentence.lower()) if i not in stop] 
['foo', 'bar', 'sentence']

I recommend looking at using tf-idf to remove stopwords, see Effects of Stemming on the term frequency?

Terza answered 2/10, 2013 at 8:41 Comment(3)
@alves I am already using the above method for my task. I just wanted to know which might be the words which will act as operators from stopwords list.Anoint
The "ideal" stopword list depends on the nature of the task. so you have to ask yourself, what is the ultimate goal of your task? and then ask a linguist what to filter out to achieve my goal. Otherwise you can also stick with the statistical methods, e.g. tf-idf filter.Terza
by the way, using stop as a list may be slow. I suggest that convert it into a set so that not in will be much cheaper.Headmaster
C
73

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:
    # use word
Califate answered 8/6, 2014 at 13:45 Comment(0)
A
35

@alvas's answer does the job but it can be done way faster. Assuming that you have documents: a list of strings.

from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation 

for doc in documents:
    list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]

Notice that due to the fact that here you are searching in a set (not in a list) the speed would be theoretically len(stop_words)/2 times faster, which is significant if you need to operate through many documents.

For 5000 documents of approximately 300 words each the difference is between 1.8 seconds for my example and 20 seconds for @alvas's.

P.S. in most of the cases you need to divide the text into words to perform some other classification tasks for which tf-idf is used. So most probably it would be better to use stemmer as well:

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

and to use [porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words] inside of a loop.

Acis answered 9/9, 2015 at 1:27 Comment(0)
L
14

@alvas has a good answer. But again it depends on the nature of the task, for example in your application you want to consider all conjunction e.g. and, or, but, if, while and all determiner e.g. the, a, some, most, every, no as stop words considering all others parts of speech as legitimate, then you might want to look into this solution which use Part-of-Speech Tagset to discard words, Check table 5.1:

import nltk

STOP_TYPES = ['DET', 'CNJ']

text = "some data here "
tokens = nltk.pos_tag(nltk.word_tokenize(text))
good_words = [w for w, wtype in tokens if wtype not in STOP_TYPES]
Lobster answered 13/6, 2014 at 21:37 Comment(0)
D
6

You can use string.punctuation with built-in NLTK stopwords list:

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation

words = tokenize(text)
wordsWOStopwords = removeStopWords(words)

def tokenize(text):
        sents = sent_tokenize(text)
        return [word_tokenize(sent) for sent in sents]

def removeStopWords(words):
        customStopWords = set(stopwords.words('english')+list(punctuation))
        return [word for word in words if word not in customStopWords]

NLTK stopwords complete list

Dwaindwaine answered 28/2, 2018 at 17:56 Comment(0)
J
0

STOPWORDS REMOVAL FROM STRING

Here I added Custom stopword list also

nltk.download('stopwords')
from nltk.corpus import stopwords                    # Stop words

stop_words = set(stopwords.words('english'))
stop_words.update(list(set(['zero'    , 'one'     , 'two'      ,
               'three'   , 'four'    , 'five'     ,
               'six'     , 'seven'   , 'eight'    ,
               'nine'    , 'ten'     ,
               
               'may'     , 'also'    , 'across'   ,
               'among'   , 'beside'  , 'however'  ,
               'yet'     , 'within'  ,
               
               'jan'     ,  'feb'    , 'mar'      ,
               'apr'     ,  'may'    , 'jun'      ,
               'jul'     ,  'aug'    , 'sep'      ,
               'oct'     ,  'nov'    , 'dec'      ,
               
               'january' , 'february', 'march'    ,
               'april'   , 'may'     , 'june'     ,
               'july'    , 'august'  , 'september',
               'october' , 'november', 'december' ,
               
               'summer'  , 'winter'  , 'fall'     ,
               'spring'                          

               "a"         , "about"     ,   "above"  , "after"   ,
               "again"     , "against"   ,   "ain"    , "aren't"  ,
               "all"       , "am"        ,   "an"     , "and"     ,
               "any"       , "are"       ,   "aren"   ,  "as"     ,
               "at"        ,
               
               "be"        , "because"   ,   "been"   , "before"  ,
               "being"     , "below"     ,   "between", "both"    ,
               "but"       , "by"        ,                  
               
               "can"       , "couldn"    , "couldn't" , "could"   ,
               
               "d"         , "did"       , "didn"     , "didn't"  ,
               "do"        , "does"      , "doesn"    , "doesn't" ,
               "doing"     , "don"       , "don't"    , "down"    ,
               "during"    ,
               
               "each"      ,  
               
               "few"       , "for"      , "from"      , "further" ,
               
               "had"       , "hadn"     , "hadn't"    , "has"     ,
               "hasn"      , "hasn't"   , "have"      , "haven"   ,
               "haven't"   , "having"   , "he"        , "her"     ,
               "here"      , "hers"     , "herself"   , "him"     ,
               "himself"   , "his"      , "how"       ,
               "he'd"      , "he'll"    , "he's"      , "here's"  ,
               "how's"     ,
               
               "i"         , "if"       , "in"        , "into"    ,
               "is"        , "isn"      , "isn't"     , "it"      ,
               "it's"      , "its"      , "itself"    , "i'd"     ,
               "i'll"      , "i'm"      , "i've"      ,
               
               "just"      ,
               
               "ll"        , "let's"    ,
               
               "m"         , "ma"       ,"me"         ,
               "mightn"    , "mightn't" , "more"      , "most"    ,
               "mustn"     , "mustn't"  , "my"        , "myself"  ,
               "needn"     , "needn't"  , "no"        , "nor"     ,
               "not"       , "now"      ,
               
               "o"         , "of"       , "off"       , "on"      ,
               "once"      , "only"     , "or"        , "other"   ,
               "our"       , "ours"     , "ourselves" , "out"     ,
               "over"      , "own"      , "ought"     ,
               
               "re"        ,
               
               "s"         , "same"     , "shan"      , "shan't"   ,
               "she"       , "she's"    , "should"    , "should've",
               "shouldn"   , "shouldn't", "so"        , "some"     ,
               "such"      , "she'd"    , "she'll"    ,
               
               "t"         , "than"     , "that"      , "that'll"  ,
               "the"       , "their"    , "theirs"    , "them"     ,
               "themselves", "then"     , "there"     , "these"    ,
               "they"      , "this"     , "those"     , "through"  ,
               "to"        , "too"      , "that's"    , "there's"  ,
               "they'd"    , "they'll"  , "they're"   , "they've"  ,
               
               "under"     , "until"    , "up"        ,
               
               "ve"        , "very"     ,
               
               "was"       , "wasn"     , "wasn't"    , "we"       ,
               "were"      , "weren"    , "weren't"   , "what"     ,
               "when"      , "where"    , "which"     , "while"    ,
               "who"       , "whom"     , "why"       , "will"     ,
               "with"      , "won"      , "won't"     , "wouldn"   ,
               "wouldn't"  , "we'd"     , "we'll"     , "we're"    ,
               "we've"     , "what's"   , "when's"    , "where's"  ,
               "who's"     , "why's"    , "would"     ,
               
               "y"         , "you"      , "you'd"     , "you'll"   ,
               "you're"    , "you've"   , "your"      , "yours"    , "yourself",
               "yourselves",
               
               'a',"able", "abst", "accordance", "according", "accordingly", "across", "act", "actually"          ,
               "added", "adj", "affected", "affecting", "affects", "afterwards", "ah",      "almost"          ,
               "alone", "along", "already", "also", "although", "always", "among", "amongst", "anyone"        ,  
               "announce", "another", "anybody", "anyhow", "anymore",  "anything", "anyway", "anyways"        ,
               "anywhere", "apparently", "approximately", "arent", "arise", "around", "aside", "ask"          ,
               "asking", "auth", "available", "away", "awfully", "a's", "ain't", "allow", "allows", "apart"   ,
               "appear", "appreciate", "appropriate", "associated"                                            ,
               
               "b", "back", "became", "become", "becomes", "becoming", "beforehand", "begin", "beginning"     ,
               "beginnings", "begins", "behind", "believe", "beside", "besides", "beyond", "biol", "brief"    ,
               "briefly"                                                                                      ,
               
               "c", "ca", "came", "cannot", "can't", "cause", "causes", "certain", "certainly", "co", "com"   ,
               "come", "comes", "contain", "containing", "contains", "couldnt"                                ,
               
               'd',"date", "different", "done", "downwards", "due"                                                ,
               
               "e", "ed", "edu", "effect", "eg", "eight", "eighty", "either", "else", "elsewhere", "end"      ,
               "ending", "enough", "especially", "et", "etc", "even", "ever", "every", "everybody","except"   ,
               "everyone", "everything", "everywhere", "ex"                                                   ,  
               
               "f", "far", "ff", "fifth", "first", "five", "fix", "followed", "following", "follows", "four"  ,
               "former", "formerly", "forth", "found",  "furthermore"                                         ,
               
               "g", "gave", "get", "gets", "getting", "give", "given", "gives",  "go", "goes", "got","gone"   ,  
               "gotten", "giving"                                                                             ,
               
               "h", "happens", "hardly", "hed", "hence", "hereafter", "hereby", "herein", "heres", "however"  ,
               "hereupon", "hes", "hi", "hid", "hither", "home", "howbeit",  "hundred"                        ,
               
               "id", "ie", "im", "immediately", "importance", "important", "inc", "indeed", "itd", "index"    ,
               'i',"information", "instead", "invention",   "it'll", "inward", "immediate"                        ,
               
               "j",
               
               "k", "keep", "keeps", "kept", "kg", "km", "know", "known", "knows"                             ,
               
               "l", "largely", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "ltd",    
               "let", "lets", "like", "liked", "likely", "line", "little", "'ll", "look", "looking", "looks"  ,  
               
               'm',"made", "mainly", "make", "makes", "many", "maybe", "mean", "means", "meantime", "merely", "mg",
               "might", "million", "miss", "ml", "moreover", "mostly", "mr", "mrs", "much", "mug", "must"     ,
               "meanwhile", "may"                                                                             ,
               
               "n", "na", "name", "namely", "nay", "nd", "near", "nearly", "necessarily", "necessary", "need" ,
               "needs", "neither", "never", "nevertheless", "new", "next", "nine", "ninety", "nobody", "non"  ,
               "none", "nonetheless", "noone", "normally", "nos", "noted", "nothing", "nowhere", "n2", "nc"   ,
               "nd", "ne", "ng", "ni", "nj", "nl", "nn", "nr", "ns", "nt", "ny"                               ,
               
               'o',"obtain", "obtained", "obviously", "often", "oh", "ok", "okay", "old", "omitted", "one", "ones",
               "onto", "ord", "others", "otherwise", "outside", "overall", "owing",  "oa", "ob", "oc", "od"   ,
               "of", "og", "oi", "oj", "ol", "om", "on", "oo", "oq", "or", "os", "ot", "ou", "ow", "ox", "oz" ,
               
               "p", "page", "pages", "part", "particular", "particularly", "past", "per", "perhaps", "placed" ,
               "please", "plus", "poorly", "possible", "possibly", "potentially", "pp", "predominantly"       ,
               "present", "previously", "primarily", "probably", "promptly", "proud", "provides", "put"       ,
               "p1", "p2", "p3", "pc", "pd", "pe", "pf", "ph", "pi", "pj", "pk", "pl", "pm", "pn", "po", "pq" ,
               "pr", "ps", "pt", "pu", "py"                                                                   ,
               
               "q", "que", "quickly", "quite", "qv",  "qj", "qu"                                              ,
               
               'r',"readily", "really", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards" ,
               "related", "relatively", "research", "respectively", "resulted", "resulting", "results", "run" ,
               "right",  "r2", "ra", "rc", "rd", "rf", "rh", "ri", "rj", "rl", "rm", "rn", "ro", "rq", "rr"   ,
               "rs", "rt", "ru", "rv", "ry" "r", "ran", "rather", "rd"                                        ,                                                                  
               
               's',"said", "saw", "say", "saying", "says", "sec", "section", "see", "seeing", "seem", "seemed"    ,
               "seeming", "seems", "seen", "self", "selves", "sent", "seven", "several", "shall", "shed"      ,
               "shes", "show", "showed", "shown", "showns", "shows", "significant", "significantly"           ,
               "similar", "similarly", "since", "six", "slightly", "somebody", "somehow", "someone", "soon"   ,
               "somewhat", "somewhere", "specifically", "specified", "specify", "specifying", "still", "stop" ,
               "strongly", "sub", "substantially", "successfully", "sufficiently", "suggest", "sup", "sure"   ,
               "s2", "sa", "sc", "sd", "se", "sf", "si", "sj", "sl", "sm", "sn", "sp", "sq", "sr", "ss", "st" ,
               "sy", "sz",   "sorry", "sometime", "somethan", "something", "sometimes"                        ,
               
               't',"take", "taken", "taking", "tell", "tends", "thank", "thanx", "that've", "thence", "thereafter",
               "thereby", "therefore", "therein", "there'll", "thereof", "therere", "thereto", "thereupon"    ,
               "there've", "theyd", "theyre", "think", "thou", "though", "thoughh", "thousand", "throug"      ,
               "throughout", "thru", "thus", "til", "tip", "together", "took", "toward", "towards", "tried"   ,
               "tries", "truly", "try", "trying", "ts", "twice", "two", "thats",  "thanks",  "th",  "thered"  ,
               "theres" "t1", "t2", "t3", "tb", "tc", "td", "te", "tf", "th", "ti", "tj", "tl", "tm", "tn"    ,
               "tp", "tq", "tr", "ts", "tt", "tv", "tx"                                                       ,                                                                                        
               
               "u", "un", "unfortunately", "unless", "unlike", "unlikely", "unto", "upon", "ups", "us", "use" ,
               "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ue", "ui", "uj", "uk" ,
               "um", "un", "uo", "ur", "ut",
               
               "v", "value", "various", "'ve", "via", "viz", "vol", "vols", "vs", "va", "vd", "vj", "vo", "vq",
               "vt", "vu"                                                                                     ,
               
               "w", "want", "wants", "wasnt", "way", "wed", "welcome", "went", "werent", "whatever", "what'll",
               "whats", "whence", "whenever", "whereas", "whereby", "wherein", "wheres", "wherever", "whether",  
               "whim", "whither", "whod", "whoever", "whole", "who'll", "whomever", "whos", "whose", "widely" ,
               "whereupon", "willing", "wish", "within", "without", "wont", "words", "world", "wouldnt", "www",
               "wi", "wa", "wo",
               
               "x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx",
               
               "yes", "yet", "youd", "youre", "y2", "yj", "yl", "yr", "ys", "yt",
               
               "z", "zero", "zi", "zz"
               
               "best", "better", "c'mon", "c's", "cant", "changes", "clearly", "concerning", "consequently", "consider", "considering", "corresponding", "course", "currently", "definitely", "described", "despite", "entirely", "exactly", "example", "going", "greetings", "hello", "help", "hopefully", "ignored", "inasmuch", "indicate", "indicated", "indicates", "inner", "insofar", "it'd", "keep", "keeps", "novel", "presumably", "reasonably", "second", "secondly", "sensible", "serious", "seriously", "sure", "t's", "third", "thorough", "thoroughly", "three", "well", "wonder", "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas",                   "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "co", "op", "research-articl", "pagecount", "cit", "ibid", "les", "le", "au", "que", "est", "pas", "vol", "el", "los", "pp", "u201d", "well-b", "http", "volumtype", "par",
               "0o", "0s", "3a", "3b", "3d", "6b", "6o",
               "a1", "a2", "a3", "a4", "ab", "ac", "ad", "ae", "af", "ag", "aj", "al", "an", "ao", "ap", "ar", "av", "aw", "ax", "ay", "az",
               "b1", "b2", "b3", "ba", "bc", "bd", "be", "bi", "bj", "bk", "bl", "bn", "bp", "br", "bs", "bt", "bu", "bx",
               "c1", "c2", "c3", "cc", "cd", "ce", "cf", "cg", "ch", "ci", "cj", "cl", "cm", "cn", "cp", "cq", "cr", "cs", "ct", "cu", "cv", "cx", "cy", "cz",
               "d2", "da", "dc", "dd", "de", "df", "di", "dj", "dk", "dl", "do", "dp", "dr", "ds", "dt", "du", "dx", "dy",
               "e2", "e3", "ea", "ec", "ed", "ee", "ef", "ei", "ej", "el", "em", "en", "eo", "ep", "eq", "er", "es", "et", "eu", "ev", "ex", "ey",
               "f2", "fa", "fc", "ff", "fi", "fj", "fl", "fn", "fo", "fr", "fs", "ft", "fu", "fy",
               "ga", "ge", "gi", "gj", "gl", "go", "gr", "gs", "gy",
               "h2", "h3", "hh", "hi", "hj", "ho", "hr", "hs", "hu", "hy",
               "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ic", "ie", "ig", "ih", "ii", "ij", "il", "in", "io", "ip", "iq", "ir", "iv", "ix", "iy", "iz",
               "jj", "jr", "js", "jt", "ju",
               "ke", "kg", "kj", "km", "ko",
               "l2", "la", "lb", "lc", "lf", "lj", "ln", "lo", "lr", "ls", "lt",
               "m2", "ml", "mn", "mo", "ms", "mt", "mu",
               
               'i',  'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii','ix', 'x',
               'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx',
                'xxi', 'xxii', 'xxiii', 'xxiv', 'xxv', 'xxvi', 'xxvii', 'xxviii', 'xxix', 'xxx',
                'xxxi', 'xxxii', 'xxxiii', 'xxxiv', 'xxxv', 'xxxvi', 'xxxvii', 'xxxviii', 'xxxix', 'xl',
               'xli', 'xlii', 'xliii', 'xliv', 'xlv', 'xlvi', 'xlvii', 'xlviii', 'xlix', 'l',
               'li', 'lii', 'liii', 'liv', 'lv', 'lvi', 'lvii', 'lviii', 'lix', 'lx',
               'lxi', 'lxii', 'lxiii', 'lxiv', 'lxv', 'lxvi', 'lxvii', 'lxviii', 'lxix', 'lxx',
                'lxxi', 'lxxii', 'lxxiii', 'lxxiv', 'lxxv', 'lxxvi', 'lxxvii', 'lxxviii', 'lxxix', 'lxxx',
                'lxxxi', 'lxxxii', 'lxxxiii', 'lxxxiv', 'lxxxv', 'lxxxvi', 'lxxxvii', 'lxxxviii', 'lxxxix', 'xc',
                'xci', 'xcii', 'xciii', 'xciv', 'xcv', 'xcvi', 'xcvii', 'xcviii', 'xcix', 'c',
               
                "one", "first", "two", "second", "three", "third",
                "four", "fourth", "five", "fifth", "six",  "sixth", "seven",
                "seventh", "eight", "eighth", "nine", "ninth", "ten",
                "tenth", "eleven", "eleventh", "twelve", "twelfth", "thirteen",
                "thirteenth", "fourteen", "fourteenth", "fifteen", "fifteenth",
                "sixteen", "sixteenth",  "seventeen", "seventeenth", "eighteen",
                "eighteenth", "nineteen", "nineteenth", "twenty", "twentieth",
                "one", "22nd", "second", "nd", "st", "rd", "th",
               
                "1","2","3","4","5","6","7","8","9","10th","11th","12th","13th","14th","15th",
                "16th","17th","18th","19th","20th","21st","22nd","23rd","24th","25th","26th","27th",
                "28th","29th","30th","31st","32nd","33rd","34th","35th","36th","37th","38th","39th",
                "40th","41st","42nd","43rd","44th","45th","46th","47th","48th","49th","50th","51st",
                "52nd","53rd","54th","55th","56th","57th","58th","59th","60th","61st","62nd","63rd",
                "64th","65th","66th","67th","68th","69th","70th","71st","72nd","73rd","74th","75th",
                "76th","77th","78th","79th","80th","81st","82nd","83rd","84th","85th","86th","87th",
                "88th","89th","90th", "91st", "92nd", "93rd", "94th", "95th", "96th","97th", "98th",
                "99th","100th","thirty","forty","fifty","thirty","thirtieth","forty","fortieth",
                "fifty", "fiftiethiftieth","sixty","sixtieth","seventy","seventieth", "eighty",
                "eightieth", "ninety", "ninetieth","one", "hundred", "100th", "hundredth",
                "order","state","page","file",
                
                "'d","'ll",  "'m",  "'re",  "'s",  "'ve",  'a',  
                'about',  'above',  'across',  'after',  'afterwards',  'again',  'against',  'all',  
                'almost',  'alone',  'along',  'already',  'also',  'although',  'always',  'am',  
                'among',  'amongst',  'amount',  'an',  'and',  'another',  'any',  'anyhow',  'anyone',  
                'anything',  'anyway',  'anywhere',  'are',  'around',  'as',  'at',  'back',  'be',
                'became',  'because',  'become',  'becomes',  'becoming',  'been',  'before',  'beforehand',
                'behind',  'being',  'below',  'beside',  'besides',  'between',  'beyond',  'both',
                'bottom',  'but',  'by',  'ca',  'call',  'can',  'cannot',  'could',  'did',  'do',  'does',
                'doing',  'done',  'down',  'due',  'during',  'each',  'eight',  'either',  'eleven',
                'else',  'elsewhere',  'empty',  'enough',  'even',  'ever',  'every',  'everyone',
                'everything',  'everywhere',  'except',  'few',  'fifteen',  'fifty',  'first',
                'five',  'for',  'former',  'formerly',  'forty',  'four',  'from',  'front',  'full',
                'further',  'get',  'give',  'go',  'had',  'has',  'have',  'he',  'hence',  'her',
                'here',  'hereafter',  'hereby',  'herein',  'hereupon',  'hers',  'herself',  'him',  'himself',
                'his',  'how',  'however',  'hundred',  'i',  'if',  'in',  'indeed',  'into',  'is',  'it',
                'its',  'itself',  'just',  'keep',  'last',  'latter',  'latterly',  'least',  'less',  'made',
                'make',  'many',  'may',  'me',  'meanwhile',  'might',  'mine',  'more',  'moreover',  'most',
                'mostly',  'move',  'much',  'must',  'my',  'myself',  "n't",  'name',  'namely',  'neither',
                'never',  'nevertheless',  'next',  'nine',  'no',  'nobody',  'none',  'noone',  'nor',  'not',
                'nothing',  'now',  'nowhere',  'n‘t',  'n’t',  'of',  'off',  'often',  'on',  'once',  'one',
                'only',  'onto',  'or',  'other',  'others',  'otherwise',  'our',  'ours',  'ourselves',  'out',
                'over',  'own',  'part',  'per',  'perhaps',  'please',  'put',  'quite',  'rather',  're',  'really',
                'regarding',  'same',  'say',  'see',  'seem',  'seemed',  'seeming',  'seems',  'serious',  'several',
                'she',  'should',  'show',  'side',  'since',  'six',  'sixty',  'so',  'some',  'somehow',  'someone',
                'something',  'sometime',  'sometimes',  'somewhere',  'still',  'such',  'take',  'ten',  'than',
                'that',  'the',  'their',  'them',  'themselves',  'then',  'thence',  'there',  'thereafter',
                'thereby',  'therefore',  'therein',  'thereupon',  'these',  'they',  'third',  'this',  'those',
                'though',  'three',  'through',  'throughout',  'thru',  'thus',  'to',  'together',  'too',  'top',
                'toward',  'towards',  'twelve',  'twenty',  'two',  'under',  'unless',  'until',  'up',  'upon',  'us',
                'used',  'using',  'various',  'very',  'via',  'was',  'we',  'well',  'were',  'what',  'whatever',  'when',
                'whence',  'whenever',  'where',  'whereafter',  'whereas',  'whereby',  'wherein',  'whereupon',  'wherever',
                'whether',  'which',  'while',  'whither',  'who',  'whoever',  'whole',  'whom',  'whose',  'why',  'will',
                'with',  'within',  'without',  'would',  'yet',  'you',  'your',  'yours',  'yourself',  'yourselves',  '‘d',
                '‘ll',  '‘m',  '‘re',  '‘s',  '‘ve',  '’d',  '’ll',  '’m',  '’re',  '’s',  '’ve'

                       
                       ])))



import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

stop_words = stopwords.words("english")

sentence = "PDF.co is a website that contains different tools to read, write and process PDF documents"
words = word_tokenize(sentence)

sentence_wo_stopwords = [word for word in words if not word in stop_words]

print(" ".join(sentence_wo_stopwords))
Jailhouse answered 28/4, 2021 at 7:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.