Get trouble to load glove 840B 300d vector

Asked 3/3, 2018 at 11:54 Answered 14/2, 2019 at 6:37

It seems the format is, for every line, the string is like 'word number number .....'. So it easy to split it. But when I split them with the script below

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

I load the glove 840B 300d.txt. but get error and I print the splitLine I got

['contact', '[email protected]', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794' ... ]

['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', ...]

Please notice that this script works fine in glove.6b.*

Rosarosabel answered 3/3, 2018 at 11:54 Comment(6)

Looks like a problem with the downloaded file. See this answer as an example - https://mcmap.net/q/637034/-using-word2vec-to-classify-words-in-categories – Ovolo 3/3, 2018 at 12:54

Actually, I find all of the lines that will cause error, except for '.'*n , others are ['in', 'emailing', 'Email', 'email', 'At', 'at', 'by', 'to', 'in', 'or', '•', 'Contact','contact', 'is', 'on'] – Rosarosabel 3/3, 2018 at 13:36

Right and I don't see this line in my glove text file – Ovolo 3/3, 2018 at 14:37

Could you please tell me the size of your file in zip or just txt size? – Rosarosabel 3/3, 2018 at 17:47

glove.6B.zip is 862182613 bytes – Ovolo 3/3, 2018 at 17:49

Do you have glove.840B version? This script works fine in 6B version. – Rosarosabel 3/3, 2018 at 17:52

The code works fine for files: glove.6B.*d.txt, glove.42B.*d.txt, but not glove.6B.300d.txt. This is because glove.6B.300d.txt contains spaces in a word. For example, it has a word like this: '. . .' and there are spaces between those dots. I solve this problem by changing this line:

splitLine = line.split()

into

splitLine = line.split(' ')

So you code must be like this:

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r', encoding='utf8')
    model = {}
    for line in f:
        splitLine = line.split(' ')
        word = splitLine[0]
        embedding = np.asarray(splitLine[1:], dtype='float32')
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

Kazantzakis answered 10/5, 2018 at 12:35 Comment(0)

I think the following may help:

def process_glove_line(line, dim):
    word = None
    embedding = None

    try:
        splitLine = line.split()
        word = " ".join(splitLine[:len(splitLine)-dim])
        embedding = np.array([float(val) for val in splitLine[-dim:]])
    except:
        print(line)

    return word, embedding

def load_glove_model(glove_filepath, dim):
    with open(glove_filepath, encoding="utf8" ) as f:
        content = f.readlines()
        model = {}
        for line in content:
            word, embedding = process_glove_line(line, dim)
            if embedding is not None:
                model[word] = embedding
        return model

model= load_glove_model("glove.840B.300d.txt", 300)

Budding answered 14/2, 2019 at 6:37 Comment(0)

Recommended topics

Hot tags