Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Asked 26/12, 2014 at 17:25 Answered 3/12, 2018 at 6:21

python character-encoding gensim word2vec kaggle

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
Traceback (most recent call last):
  File "prog_w2v.py", line 7, in <module>
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
    header = utils.to_unicode(fin.readline())
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim
import time
start = time.time()    
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
end = time.time()   
print end-start,"   seconds"

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

Ashbey answered 26/12, 2014 at 17:25 Comment(0)

You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

Carlyn answered 12/5, 2015 at 21:52 Comment(0)

If you save your model with:

model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')

Then loading word2vec with load_word2vec_format method would cause the issue. To make it work, you should use:

wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')

The same thing also happens when you save the model with:

 model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)

And then, want to load with KeyedVectors.load method. In this case, use:

wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

Bisect answered 6/1, 2018 at 20:13 Comment(0)

As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. But, you can simply use the flag unicode_errors='ignore' to skip this issue and load the model as you want.

import gensim  

model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')

By default, this flag is set to 'strict': unicode_errors='strict'.

According to the documentation, the following is given as the reason as to why errors like this occur.

unicode_errors : str, optional default 'strict', is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.

All of the above answers are helpful, if we really can keep track of how each model was saved. But what if we have a bunch of models, that we need to load, and create a general method for it? We can use the above flag to do so.

I myself have experienced instances where I train multiple models using the original word2vec.c file, but when I try to load it into gensim, some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient.

Wallacewallach answered 3/12, 2018 at 6:21 Comment(1)

Are there any downsides of using the unicode_errors parameter with ignore? – Daren 19/1 at 15:6

If you saved your model with save(), you must use load()

load_word2vec_format is for the model generated by google, not for the model generated by gensim

Biquarterly answered 20/1, 2015 at 18:43 Comment(0)

Recommended topics

Hot tags