SpaCy: how to load Google news word2vec vectors?
Asked Answered
C

4

23

I've tried several methods of loading the google news word2vec vectors (https://code.google.com/archive/p/word2vec/):

en_nlp = spacy.load('en',vector=False)
en_nlp.vocab.load_vectors_from_bin_loc('GoogleNews-vectors-negative300.bin')

The above gives:

MemoryError: Error assigning 18446744072820359357 bytes

I've also tried with the .gz packed vectors; or by loading and saving them with gensim to a new format:

from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('googlenews2.txt')

This file then contains the words and their word vectors on each line. I tried to load them with:

en_nlp.vocab.load_vectors('googlenews2.txt')

but it returns "0".

What is the correct way to do this?

Update:

I can load my own created file into spacy. I use a test.txt file with "string 0.0 0.0 ...." on each line. Then zip this txt with .bzip2 to test.txt.bz2. Then I create a spacy compatible binary file:

spacy.vocab.write_binary_vectors('test.txt.bz2', 'test.bin')

That I can load into spacy:

nlp.vocab.load_vectors_from_bin_loc('test.bin')

This works! However, when I do the same process for the googlenews2.txt, I get the following error:

lib/python3.6/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1279)()

OSError: 
Cymric answered 7/2, 2017 at 15:50 Comment(0)
C
26

For spacy 1.x, load Google news vectors into gensim and convert to a new format (each line in .txt contains a single vector: string, vec):

from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.wv.save_word2vec_format('googlenews.txt')

Remove the first line of the .txt:

tail -n +2 googlenews.txt > googlenews.new && mv -f googlenews.new googlenews.txt

Compress the txt as .bz2:

bzip2 googlenews.txt

Create a SpaCy compatible binary file:

spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')

Move the googlenews.bin to /lib/python/site-packages/spacy/data/en_google-1.0.0/vocab/googlenews.bin of your python environment.

Then load the wordvectors:

import spacy
nlp = spacy.load('en',vectors='en_google')

or load them after later:

nlp.vocab.load_vectors_from_bin_loc('googlenews.bin')
Cymric answered 8/2, 2017 at 14:9 Comment(5)
Make sure you call it "vec.bin", so like: /lib/python/site-packages/spacy/data/en_google-1.0.0/vocab/vec.binConventual
Actually, it loads, but I cannot see any difference in the vector between the original glove and the ones created like that. Something isn't right (using the same code).Conventual
only loading them later did make a change.Conventual
the spacy methods used here are deprecated in spacy 2.0. See github.com/explosion/spaCy/issues/1046Extremely
Just wanted to chime in and say that you should NOT use the method described in this answer with later spaCy versions, it will not work. Please check the init vectors section of the docs. spacy.io/api/cli#init-vectorsHorsefly
P
12

I know that this question has already been answered, but I am going to offer a simpler solution. This solution will load google news vectors into a blank spacy nlp object.

import gensim
import spacy

# Path to google news vectors
google_news_path = "path\to\google\news\\GoogleNews-vectors-negative300.bin.gz"

# Load google news vecs in gensim
model = gensim.models.KeyedVectors.load_word2vec_format(gn_path, binary=True)

# Init blank english spacy nlp object
nlp = spacy.blank('en')

# Loop through range of all indexes, get words associated with each index.
# The words in the keys list will correspond to the order of the google embed matrix
keys = []
for idx in range(3000000):
    keys.append(model.index2word[idx])

# Set the vectors for our nlp object to the google news vectors
nlp.vocab.vectors = spacy.vocab.Vectors(data=model.syn0, keys=keys)

>>> nlp.vocab.vectors.shape
(3000000, 300)
Priam answered 29/4, 2018 at 20:48 Comment(2)
Thanks for this. I used keys=model.vocab.keys() since the original order doesn't matter to me.Consensus
Why you create the list keys ? I think this is a more simply way: model_spacy.vocab.vectors = spacy.vocab.Vectors(data=model_google.syn0, keys=model_google.index2word)Sagittarius
C
2

I am using spaCy v2.0.10.

Create a SpaCy compatible binary file:

spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')

I want to highlight that the specific code in the accepted answer is not working now. I encountered "AttributeError: ..." when I run the code.

This has changed in spaCy v2. write_binary_vectors was removed in v2. From spaCy documentations, the current way to do this is as follows:

$ python -m spacy init-model en /path/to/output -v /path/to/vectors.bin.tar.gz
Cleavage answered 1/5, 2018 at 6:41 Comment(2)
This does not work for me. I get "AssertionError: f"Priam
I believe this is due to the positional argument "freqs_loc" link to documentationPriam
M
2

it is much easier to use the gensim api for dowloading the word2vec compressed model by google, it will be stored in /home/"your_username"/gensim-data/word2vec-google-news-300/ . Load the vectors and play ball. I have 16GB of RAM which is more than enough to handle the model

import gensim.downloader as api

model = api.load("word2vec-google-news-300")  # download the model and return as object ready for use
word_vectors = model.wv #load the vectors from the model
Misdo answered 3/10, 2018 at 14:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.