GridSearch for doc2vec model built using gensim

def train_doc2vec( self, X: List[List[str]], epochs: int=10, learning_rate: float=0.0002) -> gensim.models.doc2vec: tagged_documents = list() for idx, w in enumerate(X): td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)]) tagged_documents.append(td) model = Doc2Vec(**self.params_doc2vec) model.build_vocab(tagged_documents) for epoch in range(epochs): model.train(tagged_documents, total_examples=model.corpus_count, epochs=model.epochs) # decrease the learning rate model.alpha -= learning_rate # fix the learning rate, no decay model.min_alpha = model.alpha return model

Independently by the correctness of the code, I will try to answer to your question on how to perform a tuning of hyper-parameters. You have to start defining a set of hyper-parameters that will define your hyper-parameter grid search. For each set of hyper-parameters

Hset1=(par1Value1,par2Value1,...,par3Value1)

you train your model on the training set and you use an independent validation set to measure your accuracy (or whatever metrics you wish to use). You store this value (e.g. A_Hset1). When you do this for all the possible set of hyper-parameters you will have a set of measures

(A_Hset1,A_Hset2,A_Hset3...A_HsetK).

Each one of those measure tells you how good is your model for each set of hyper-parameters so your set of of optimal hyper-parameters

H_setOptimal= HsetX | A_setX=max(A_Hset1,A_Hset2,A_Hset3...A_HsetK)

In order to have a fair comparisons you should train the model always on the same data and use always the same validation set.

I'm not an advanced Python user so probably you can find better suggestions around, but what I would do is to create a list of dictionaries, where each dictionary contain a set of hyper-parameters that you want to test:

grid_search=[{"par1":"val1","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val2","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val3","par2":"val1","par3":"val1",..., "res"=""},
             ,...,
             {"par1":"valn","par2":"valn","par3":"valn",..., "res"=""}]

So that you can store your results in the "res" field of the corresponding dictionary and track the performances for each set of parameter.

for set in grid_search:
  #insert here your training and accuracy evaluation using the
  #parameters in set
  
  set["res"]= the_Accuracy_for_HyperPar_in_set

I hope it helps.

Recommended topics

Hot tags