BERT performing worse than word2vec

I am trying to use BERT for a document ranking problem. My task is pretty straightforward. I have to do a similarity ranking for an input document. The only issue here is that I don’t have labels - so it’s more of a qualitative analysis.

I am on my way to try a bunch of document representation techniques - word2vec, para2vec and BERT mainly.

For BERT, i came across Hugging face - Pytorch library. I fine tuned the bert-base-uncased model, with around 150,000 documents. I ran it for 5 epochs, with a batch size of 16 and max seq length 128. However, if I compare the performance of Bert representation vs word2vec representations, for some reason word2vec is performing better for me right now. For BERT, I used the last four layers for getting the representation.

I am not too sure why the fine tuned model didn’t work. I read up this paper, and this other link also that said that BERT performs well when fine tuned for a classification task. However, since I don’t have the labels, I fined tuned it as it's done in the paper - in an unsupervised manner.

Also, my documents vary a lot in their length. So I’m sending them sentence wise right now. In the end I have to average over the word embeddings anyway to get the sentence embedding. Any ideas on a better method? I also read here - that there are different ways of pooling over the word embeddings to get a fixed embedding. Wondering if there is a comparison of which pooling technique works better?

Any help on training BERT better or a better pooling method will be greatly appreciated!

Recommended topics

Hot tags