spaCy and scikit-learn vectorizer
Asked Answered
B

2

9

I wrote a lemma tokenizer using spaCy for scikit-learn based on their example, it works OK standalone:

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

However, using it in GridSearchCV gives errors, a self contained example is below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)

### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'

The error does not appear when I load spacy outside of constructor of the tokenizer, then the GridSearchCV runs:

spacynlp = spacy.load('en')
    class LemmaTokenizer(object):
        def __call__(self, doc):
            nlpdoc = spacynlp(doc)
            nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
            return nlpdoc

But this means that every of my n_jobs from the GridSearchCV will access and call the same spacynlp object, it is shared among these jobs, which leaves the questions:

  1. Is the spacynlp object from spacy.load('en') safe to be used by multiple jobs in GridSearchCV?
  2. Is this the correct way to implement calls to spacy inside a tokenizer for scikit-learn?
Bibbs answered 19/7, 2017 at 16:38 Comment(0)
L
1

You are wasting time by running Spacy for each parameter setting in the grid. The memory overhead is also significant. You should run all data through Spacy once and save it to disk, then use a simplified vectoriser that reads in pre-lemmatised data. Look at the tokenizer, analyser and preprocessor parameters of TfidfVectorizer. There are plenty of examples on stack overflow that show how to build a custom vectoriser.

Lenny answered 20/7, 2017 at 10:51 Comment(5)
These are good points, and this might very well be what to do instead. However I would ultimately like to have the spaCy tokenization with different options (such as POS) as part of the hyper-parameter gridsearch, hence my questions.Bibbs
You can do that too. Store your data as a list of dicts like this: [{"token": "cats", "lemma": "cat"}, {...}]. That's basically what Spacy sentences are, converted to JSON. Write a pipeline step that takes this as input and has a parameter to output either a token or a lemma and there you have it- tokenisation is a part of your grid search.Lenny
"You are wasting your time", "there are plenty of examples". This answer is not really that useful.Miquelmiquela
Feel free to suggest an improvement. The edit button is at the bottom of the postLenny
"There are plenty of examples on stack overflow that show how to build a custom vectoriser" it'd be helpful to link to at least one of these examplesColumbous
K
2

Based on the comments of the post of mbatchkarov, I tried to run all my documents in a pandas series through Spacy once for tokenization and lemmatization and save it to disk first. Then, I load in the the lemmatized spacy Doc objects, extract a list of tokens for every document and supply it as input to a pipeline consisting of a simplified TfidfVectorizer and a DecisionTreeClassifier. I run the pipeline with GridSearchCV and extract the best estimator and respective params.

See an example:

from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("de_core_news_sm") # define your language model

# adjust attributes to your liking:
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)

for doc in nlp.pipe(df['articleDocument'].str.lower()):
    doc_bin.add(doc)

# either save DocBin to a bytes object, or...
#bytes_data = doc_bin.to_bytes()

# save DocBin to a file on disc
file_name_spacy = 'output/preprocessed_documents.spacy'
doc_bin.to_disk(file_name_spacy)

#Load DocBin at later time or on different system from disc or bytes object
#doc_bin = DocBin().from_bytes(bytes_data)
doc_bin = DocBin().from_disk(file_name_spacy)

docs = list(doc_bin.get_docs(nlp.vocab))
print(len(docs))

tokenized_lemmatized_texts = [[token.lemma_ for token in doc 
                               if not token.is_stop and not token.is_punct and not token.is_space and not token.like_url and not token.like_email] 
                               for doc in docs]

# classifier to use
clf = tree.DecisionTreeClassifier()

# just some random target response
y = np.random.randint(2, size=len(docs))


vectorizer = TfidfVectorizer(ngram_range=(1, 1), lowercase=False, tokenizer=lambda x: x, max_features=3000)

pipeline = Pipeline([('vect', vectorizer), ('dectree', clf)])
parameters = {'dectree__max_depth':[4, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
gs_clf.fit(tokenized_lemmatized_texts, y)
print(gs_clf.best_estimator_.get_params()['dectree'])

Some further useful resources:

Kathrinkathrine answered 24/12, 2021 at 13:4 Comment(0)
L
1

You are wasting time by running Spacy for each parameter setting in the grid. The memory overhead is also significant. You should run all data through Spacy once and save it to disk, then use a simplified vectoriser that reads in pre-lemmatised data. Look at the tokenizer, analyser and preprocessor parameters of TfidfVectorizer. There are plenty of examples on stack overflow that show how to build a custom vectoriser.

Lenny answered 20/7, 2017 at 10:51 Comment(5)
These are good points, and this might very well be what to do instead. However I would ultimately like to have the spaCy tokenization with different options (such as POS) as part of the hyper-parameter gridsearch, hence my questions.Bibbs
You can do that too. Store your data as a list of dicts like this: [{"token": "cats", "lemma": "cat"}, {...}]. That's basically what Spacy sentences are, converted to JSON. Write a pipeline step that takes this as input and has a parameter to output either a token or a lemma and there you have it- tokenisation is a part of your grid search.Lenny
"You are wasting your time", "there are plenty of examples". This answer is not really that useful.Miquelmiquela
Feel free to suggest an improvement. The edit button is at the bottom of the postLenny
"There are plenty of examples on stack overflow that show how to build a custom vectoriser" it'd be helpful to link to at least one of these examplesColumbous

© 2022 - 2024 — McMap. All rights reserved.