I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't know how to do it when it is a text file format.
glove model files are in a word - vector format. You can open the textfile to verify this. Here is a small snippet of code you can use to load a pretrained glove file:
import numpy as np
def load_glove_model(File):
print("Loading Glove Model")
glove_model = {}
with open(File,'r') as f:
for line in f:
split_line = line.split()
word = split_line[0]
embedding = np.array(split_line[1:], dtype=np.float64)
glove_model[word] = embedding
print(f"{len(glove_model)} words loaded!")
return glove_model
You can then access the word vectors by simply using the gloveModel variable.
print(gloveModel['hello'])
model = loadGloveModel("filename.txt")
then print statement will work fine. –
Lachus "\xC2\x85"
properly. –
Gasbag f = open(gloveFile,'r', encoding='utf-8')
to read the glove file and it will work –
Grating You can do it much faster with pandas:
import pandas as pd
import csv
words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
Then to get the vector for a word:
def vec(w):
return words.loc[w].as_matrix()
And to find the closest word to a vector:
words_matrix = words.as_matrix()
def find_closest_word(v):
diff = words_matrix - v
delta = np.sum(diff * diff, axis=1)
i = np.argmin(delta)
return words.iloc[i].name
read_table
: na_values=None, keep_default_na=False
. Otherwise it will consider many valid strings (e.g. 'null', 'NA', etc) as nan
floating point values. –
Rex read_table
is deprecated. Use read_csv
with the same parameters instead. –
Pronouncement I suggest using gensim to do everything. You can read the file, and also benefit from having a lot of methods already implemented on this great package.
Suppose you generated GloVe vectors using the C++ program and that your "-save-file" parameter is "vectors". Glove executable will generate you two files, "vectors.bin" and "vectors.txt".
Use glove2word2vec to convert GloVe vectors in text format into the word2vec text format:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")
Finally, read the word2vec txt to a gensim model using KeyedVectors:
from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)
Now you can use gensim word2vec methods (for example, similarity) as you'd like.
This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
. I guess gensim function needs to be updated –
Sonde glove2word2vec()
is 1000% the way to go. –
Derosa I found this approach faster.
import pandas as pd
df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}
Save the dictionary:
import pickle
with open('glove.840B.300d.pkl', 'wb') as fp:
pickle.dump(glove, fp)
Here's a one liner if all you want is the embedding matrix
np.loadtxt(path, usecols=range(1, dim+1), comments=None)
where path
is path to your downloaded GloVe file and dim
is the dimension of the word embedding.
If you want both the words and corresponding vectors you can do
glove = np.loadtxt(path, dtype='str', comments=None)
and seperate the words and vectors as follows
words = glove[:, 0]
vectors = glove[:, 1:].astype('float')
Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).
What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).
Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.
With this code you convert your embedding text file to the two new files:
def convert_to_binary(embedding_path):
f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
wv = []
with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
count = 0
for line in f:
splitlines = line.split()
vocab_write.write(splitlines[0].strip())
vocab_write.write("\n")
wv.append([float(val) for val in splitlines[1:]])
count += 1
np.save(embedding_path + ".npy", np.array(wv))
And with this method you load it efficiently into your memory:
def load_word_emb_binary(embedding_file_name_w_o_suffix):
print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))
with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
index2word = [line.strip() for line in f_in]
wv = np.load(embedding_file_name_w_o_suffix + '.npy')
word_embedding_map = {}
for i, w in enumerate(index2word):
word_embedding_map[w] = wv[i]
return word_embedding_map
Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.
Python3 version which also handles bigrams and trigrams:
import numpy as np
def load_glove_model(glove_file):
print("Loading Glove Model")
f = open(glove_file, 'r')
model = {}
vector_size = 300
for line in f:
split_line = line.split()
word = " ".join(split_line[0:len(split_line) - vector_size])
embedding = np.array([float(val) for val in split_line[-vector_size:]])
model[word] = embedding
print("Done.\n" + str(len(model)) + " words loaded!")
return model
import os
import numpy as np
# store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
# is just a space-separated text file in the format:
# word vec[0] vec[1] vec[2] ...
for line in f:
values = line.split()
word = values[0]
vec = np.asarray(values[1:], dtype='float32')
word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))
This code takes some time to store glove embeddings on shelf, but loading it is quite faster as compared to other approaches.
import os
import numpy as np
from contextlib import closing
import shelve
def store_glove_to_shelf(glove_file_path,shelf):
print('Loading Glove')
with open(os.path.join(glove_file_path)) as f:
for line in f:
values = line.split()
word = values[0]
vec = np.asarray(values[1:], dtype='float32')
shelf[word] = vec
shelf_file_name = "glove_embeddings"
glove_file_path = "glove/glove.840B.300d.txt"
# Storing glove embeddings to shelf for faster load
with closing(shelve.open(shelf_file_name + '.shelf', 'c')) as shelf:
store_glove_to_shelf(glove_file_path,shelf)
print("Stored glove embeddings from {} to {}".format(glove_file_path,shelf_file_name+'.shelf'))
# To reuse the glove embeddings stored in shelf
with closing(shelve.open(shelf_file_name + '.shelf')) as embeddings_index:
# USE embeddings_index here , which is a dictionary
print("Loaded glove embeddings from {}".format(shelf_file_name+'.shelf'))
print("Found glove embeddings with {} words".format(len(embeddings_index)))
Each corpus need to start with a line containing the vocab size and the vector size in that order.
Open the .txt file of the glove model and enter the dimension of the vector at the first line by pressing Enter first:
Example, for glove.6B.50d.txt
, just add 400000 50
in the first line.
Then use gensim to transform that raw .txt vector file to gensim vector format:
import gensim
word_vectors = gensim.models.KeyedVectors.load_word2vec_format('path/glove.6B.50d.txt', binary=False)
word_vectors.save('path/glove_gensim.txt')
Some of the other approaches here required more storage space (e.g. to split files) or were quite slow to run on my personal laptop. I tried shelf db but it seemed to blow up in storage size. Here's an "in-place" approach with one-time file-read time cost and very low additional storage cost. We treat the original text file as a database and just store the position location for each of the words. This works really well when you're, e.g., investigating properties of word vectors.
# First create a map from words to position in the file
def get_db_mapping(fname):
char_ct = 0 # cumulative position in file
pos_map = dict()
with open(fname + ".txt", 'r', encoding='utf-8') as f:
for line in tqdm(f):
new_len = len(line) # len of line
# get the word
splitlines = line.split()
word = splitlines[0].strip()
# store and increment counter
pos_map[word] = char_ct
char_ct += new_len
# write dict
with open(fname + '.db', 'wb') as handle:
pickle.dump(pos_map, handle)
class Embedding:
"""Small wrapper so that we can use [] notation to fetch word vectors.
It would be better to just have the file pointer and the pos_map as part
of this class, but that's not how I wrote it initially."""
def __init__(self, emb_fn):
self.emb_fn = emb_fn
def __getitem__(self, item):
return self.emb_fn(item)
def load_db_mapping(fname, cache_size=1000) -> Embedding:
"""Creates a function closure that wraps access to the db mapping
and the text file that functions as db. Returns them as an
Embedding object"""
# get the two state objects: mapping and file pointer
with open(fname + '.db', 'rb') as handle:
pos_map = pickle.load(handle)
f = open(fname + ".txt", 'r', encoding='utf-8')
@lru_cache(maxsize=cache_size)
def get_vector(word: str):
pos = pos_map[word]
f.seek(pos, 0)
# special logic needed because of small count errors
fail_ct = 0
read_word = ""
while fail_ct < 5 and read_word != word:
fail_ct += 1
l = f.readline()
try:
splitlines = l.split()
read_word = splitlines[0].strip()
except:
continue
if read_word != word:
raise ValueError('word not found')
# actually return
return np.array([float(val) for val in splitlines[1:]])
return Embedding(get_vector)
# to run
k_glove_vector_name = 'glove.42B.300d' # omit .txt
get_db_mapping(k_glove_vector_name) # run only once; creates .db
word_embedding = load_db_mapping(k_glove_vector_name)
word_embedding['hello']
a tool with an easy implementation of GloVe is zeulgma
https://pypi.org/project/zeugma/
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')
the implementation is really very easy
def create_embedding_matrix(word_to_index):
# word_to_index is dictionary containing "word:token" pairs
nb_words = len(word_to_index)+1
embeddings_index = {}
with open('C:/Users/jayde/Desktop/IISc/DLNLP/Assignment1/glove.840B.300d/glove.840B.300d.txt', encoding="utf-8", errors='ignore') as f:
for line in f:
values = line.split()
word = ''.join(values[:-300])
coefs = np.asarray(values[-300:], dtype='float32')
embeddings_index[word] = coefs
embedding_matrix = np.zeros((nb_words, 300))
for word, i in word_to_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
return embedding_matrix
emb_matrix = create_embedding_matrix(vocab_to_int)
EMBEDDING_LIFE = 'path/to/your/glove.txt'
def get_coefs(word,*arr):
return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector
© 2022 - 2024 — McMap. All rights reserved.