Load Pretrained glove vectors in python
Asked Answered
D

14

50

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't know how to do it when it is a text file format.

Dissolute answered 13/6, 2016 at 15:1 Comment(0)
S
101

glove model files are in a word - vector format. You can open the textfile to verify this. Here is a small snippet of code you can use to load a pretrained glove file:

import numpy as np

def load_glove_model(File):
    print("Loading Glove Model")
    glove_model = {}
    with open(File,'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            glove_model[word] = embedding
    print(f"{len(glove_model)} words loaded!")
    return glove_model

You can then access the word vectors by simply using the gloveModel variable.

print(gloveModel['hello'])

Scriabin answered 6/7, 2016 at 17:40 Comment(6)
I'm wondering if there is a faster way of doing this. I'm using code similar to that above, but it would take around 27 hours to process the whole 6billion token embeddings. Any ideas of how to do this faster?Bike
@EdwardBurgin, it is taking me 1 minute to complete the whole file. please share the "similar code" that u are referring to in your comment.Twoseater
$ python test_glove.py Loading Glove Model Done. 400000 words loaded! Traceback (most recent call last): File "test_glove.py", line 16, in <module> print(model['hello']) NameError: name 'model' is not definedSpeaker
@MonaJalal Do model = loadGloveModel("filename.txt") then print statement will work fine.Lachus
This doesn't work for me on Python 3 using the 2.8B Twitter pretrained GloVe vectors because Python doesn't handle "\xC2\x85" properly.Gasbag
@Gasbag add f = open(gloveFile,'r', encoding='utf-8') to read the glove file and it will workGrating
B
55

You can do it much faster with pandas:

import pandas as pd
import csv

words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

Then to get the vector for a word:

def vec(w):
  return words.loc[w].as_matrix()

And to find the closest word to a vector:

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name
Buseck answered 26/8, 2017 at 9:36 Comment(4)
Although, the time to load the model reduces by almost half but the access time increases by 1000x. loc against dict access. I think, personally i would prefer lower access time, coz that will be affecting the training time. since the model making is single time effort, its better to invest the time there and save it once and for all. do correct me if i m wrong.Twoseater
You should use a couple more arguments in read_table: na_values=None, keep_default_na=False. Otherwise it will consider many valid strings (e.g. 'null', 'NA', etc) as nan floating point values.Rex
read_table is deprecated. Use read_csv with the same parameters instead.Pronouncement
Access time speeds up after converting it to dictPrisage
T
47

I suggest using gensim to do everything. You can read the file, and also benefit from having a lot of methods already implemented on this great package.

Suppose you generated GloVe vectors using the C++ program and that your "-save-file" parameter is "vectors". Glove executable will generate you two files, "vectors.bin" and "vectors.txt".

Use glove2word2vec to convert GloVe vectors in text format into the word2vec text format:

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")

Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Now you can use gensim word2vec methods (for example, similarity) as you'd like.

Tuberous answered 24/11, 2017 at 1:45 Comment(2)
It looks like glove2word2vec give warning This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function. I guess gensim function needs to be updatedSonde
This warning is gone in version 3.8.3 of gensim. glove2word2vec() is 1000% the way to go.Derosa
K
11

I found this approach faster.

import pandas as pd

df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}

Save the dictionary:

import pickle
with open('glove.840B.300d.pkl', 'wb') as fp:
    pickle.dump(glove, fp)
Kilt answered 29/8, 2018 at 5:45 Comment(1)
Yes much faster approach as compared to othersPrisage
B
6

Here's a one liner if all you want is the embedding matrix

np.loadtxt(path, usecols=range(1, dim+1), comments=None)

where path is path to your downloaded GloVe file and dim is the dimension of the word embedding.

If you want both the words and corresponding vectors you can do

glove = np.loadtxt(path, dtype='str', comments=None)

and seperate the words and vectors as follows

words = glove[:, 0]
vectors = glove[:, 1:].astype('float')
Bottali answered 20/3, 2018 at 16:34 Comment(0)
K
5

Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).

What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).

Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.

With this code you convert your embedding text file to the two new files:

def convert_to_binary(embedding_path):
    f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
    wv = []

    with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
        count = 0
        for line in f:
            splitlines = line.split()
            vocab_write.write(splitlines[0].strip())
            vocab_write.write("\n")
            wv.append([float(val) for val in splitlines[1:]])
        count += 1

    np.save(embedding_path + ".npy", np.array(wv))

And with this method you load it efficiently into your memory:

def load_word_emb_binary(embedding_file_name_w_o_suffix):
    print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))

    with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
        index2word = [line.strip() for line in f_in]

    wv = np.load(embedding_file_name_w_o_suffix + '.npy')
    word_embedding_map = {}
    for i, w in enumerate(index2word):
        word_embedding_map[w] = wv[i]

    return word_embedding_map

Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.

Koralle answered 10/1, 2020 at 11:17 Comment(0)
C
3

Python3 version which also handles bigrams and trigrams:

import numpy as np


def load_glove_model(glove_file):
    print("Loading Glove Model")
    f = open(glove_file, 'r')
    model = {}
    vector_size = 300
    for line in f:
        split_line = line.split()
        word = " ".join(split_line[0:len(split_line) - vector_size])
        embedding = np.array([float(val) for val in split_line[-vector_size:]])
        model[word] = embedding
    print("Done.\n" + str(len(model)) + " words loaded!")
    return model
Chavez answered 20/12, 2018 at 1:29 Comment(1)
could you add a short description about how it handles the bigrams, please?Woe
L
0
import os
import numpy as np

# store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
  # is just a space-separated text file in the format:
  # word vec[0] vec[1] vec[2] ...
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))
Lentz answered 1/7, 2019 at 12:8 Comment(0)
D
0

This code takes some time to store glove embeddings on shelf, but loading it is quite faster as compared to other approaches.

import os
import numpy as np
from contextlib import closing
import shelve

def store_glove_to_shelf(glove_file_path,shelf):
    print('Loading Glove')
    with open(os.path.join(glove_file_path)) as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            shelf[word] = vec

shelf_file_name = "glove_embeddings"
glove_file_path = "glove/glove.840B.300d.txt"

# Storing glove embeddings to shelf for faster load
with closing(shelve.open(shelf_file_name + '.shelf', 'c')) as shelf:
    store_glove_to_shelf(glove_file_path,shelf)
    print("Stored glove embeddings from {} to {}".format(glove_file_path,shelf_file_name+'.shelf'))

# To reuse the glove embeddings stored in shelf
with closing(shelve.open(shelf_file_name + '.shelf')) as embeddings_index:
    # USE embeddings_index here , which is a dictionary
    print("Loaded glove embeddings from {}".format(shelf_file_name+'.shelf'))
    print("Found glove embeddings with {} words".format(len(embeddings_index)))
Directive answered 10/6, 2020 at 7:24 Comment(0)
B
0

Each corpus need to start with a line containing the vocab size and the vector size in that order.

Open the .txt file of the glove model and enter the dimension of the vector at the first line by pressing Enter first:

Example, for glove.6B.50d.txt, just add 400000 50 in the first line.

Then use gensim to transform that raw .txt vector file to gensim vector format:

import gensim

word_vectors = gensim.models.KeyedVectors.load_word2vec_format('path/glove.6B.50d.txt', binary=False)
word_vectors.save('path/glove_gensim.txt')
Borroff answered 15/11, 2021 at 17:12 Comment(0)
M
0

Some of the other approaches here required more storage space (e.g. to split files) or were quite slow to run on my personal laptop. I tried shelf db but it seemed to blow up in storage size. Here's an "in-place" approach with one-time file-read time cost and very low additional storage cost. We treat the original text file as a database and just store the position location for each of the words. This works really well when you're, e.g., investigating properties of word vectors.

# First create a map from words to position in the file
def get_db_mapping(fname):
    char_ct = 0    # cumulative position in file
    pos_map = dict()

    with open(fname + ".txt", 'r', encoding='utf-8') as f:
        for line in tqdm(f):
            new_len = len(line)     # len of line

            # get the word
            splitlines = line.split()
            word = splitlines[0].strip()

            # store and increment counter
            pos_map[word] = char_ct
            char_ct += new_len

    # write dict
    with open(fname + '.db', 'wb') as handle:
        pickle.dump(pos_map, handle)


class Embedding:
"""Small wrapper so that we can use [] notation to fetch word vectors.
It would be better to just have the file pointer and the pos_map as part
of this class, but that's not how I wrote it initially."""
    def __init__(self, emb_fn):
        self.emb_fn = emb_fn

    def __getitem__(self, item):
        return self.emb_fn(item)


def load_db_mapping(fname, cache_size=1000) -> Embedding:
    """Creates a function closure that wraps access to the db mapping
    and the text file that functions as db. Returns them as an
    Embedding object"""
    # get the two state objects: mapping and file pointer
    with open(fname + '.db', 'rb') as handle:
        pos_map = pickle.load(handle)
    f = open(fname + ".txt", 'r', encoding='utf-8')

    @lru_cache(maxsize=cache_size)
    def get_vector(word: str):
        pos = pos_map[word]
        f.seek(pos, 0)

        # special logic needed because of small count errors
        fail_ct = 0
        read_word = ""
        while fail_ct < 5 and read_word != word:
            fail_ct += 1
            l = f.readline()
            try:
                splitlines = l.split()
                read_word = splitlines[0].strip()
            except:
                continue
        if read_word != word:
            raise ValueError('word not found')

        # actually return
        return np.array([float(val) for val in splitlines[1:]])

    return Embedding(get_vector)

# to run
k_glove_vector_name = 'glove.42B.300d'   # omit .txt
get_db_mapping(k_glove_vector_name)      # run only once; creates .db
word_embedding = load_db_mapping(k_glove_vector_name)
word_embedding['hello']
Megrims answered 4/12, 2021 at 20:0 Comment(0)
R
0

a tool with an easy implementation of GloVe is zeulgma

https://pypi.org/project/zeugma/

from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')

the implementation is really very easy

Reiss answered 5/8, 2022 at 21:11 Comment(0)
E
0
def create_embedding_matrix(word_to_index):
# word_to_index is dictionary containing "word:token" pairs
nb_words = len(word_to_index)+1

embeddings_index = {}
with open('C:/Users/jayde/Desktop/IISc/DLNLP/Assignment1/glove.840B.300d/glove.840B.300d.txt', encoding="utf-8", errors='ignore') as f:
    for line in f:
        values = line.split()
        word = ''.join(values[:-300])
        coefs = np.asarray(values[-300:], dtype='float32')
        embeddings_index[word] = coefs

embedding_matrix = np.zeros((nb_words, 300))

for word, i in word_to_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

return embedding_matrix

emb_matrix = create_embedding_matrix(vocab_to_int)

Entomophilous answered 12/10, 2022 at 4:56 Comment(0)
M
-1
EMBEDDING_LIFE = 'path/to/your/glove.txt'

def get_coefs(word,*arr): 
      return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector
Mahlstick answered 6/2, 2018 at 16:26 Comment(2)
Please provide a comment to your answer. Why is it better than already accepted one ?Wiggle
this is coming from kaggle and it blows up on some glove files, e.g. 800B.300dNationalize

© 2022 - 2024 — McMap. All rights reserved.