I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not 'y'.
I found some questions here related to what I am trying to do but all of the solutions are proposed for supervised models but none for unsupervised like mine.
Here is the code where I am training my doc2vec model:
def train_doc2vec(
self,
X: List[List[str]],
epochs: int=10,
learning_rate: float=0.0002) -> gensim.models.doc2vec:
tagged_documents = list()
for idx, w in enumerate(X):
td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)])
tagged_documents.append(td)
model = Doc2Vec(**self.params_doc2vec)
model.build_vocab(tagged_documents)
for epoch in range(epochs):
model.train(tagged_documents,
total_examples=model.corpus_count,
epochs=model.epochs)
# decrease the learning rate
model.alpha -= learning_rate
# fix the learning rate, no decay
model.min_alpha = model.alpha
return model
I need suggestions on how to proceed and find best hyperparameters for my trained model using GridSearch or any suggestions about some other technique. Help is much appreciated.
train()
multiple times is very broken, and will only get more broken once you start trying different combinations ofepochs
,alpha
, andlearning_rate
. Where did you copy this logic from? – Garnettgarnettetrain()
only once, with your desired number ofepochs
. The current code is a mess that (among other things) is actually doing 10*10 training passes and sends the learning-rate all-over-the-place (down and up again) during training. If it's helping, it's pure dumb luck – and something like (possibly) just using 100epochs
in non-broken code would do better. – Garnettgarnettemin_count
to discard more rare words. – Garnettgarnette