Get trouble to load glove 840B 300d vector
Asked Answered
R

2

6

It seems the format is, for every line, the string is like 'word number number .....'. So it easy to split it. But when I split them with the script below

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

I load the glove 840B 300d.txt. but get error and I print the splitLine I got

['contact', '[email protected]', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794' ... ]

or

['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', ...]

Please notice that this script works fine in glove.6b.*

Rosarosabel answered 3/3, 2018 at 11:54 Comment(6)
Looks like a problem with the downloaded file. See this answer as an example - https://mcmap.net/q/637034/-using-word2vec-to-classify-words-in-categoriesOvolo
Actually, I find all of the lines that will cause error, except for '.'*n , others are ['in', 'emailing', 'Email', 'email', 'At', 'at', 'by', 'to', 'in', 'or', '•', 'Contact','contact', 'is', 'on']Rosarosabel
Right and I don't see this line in my glove text fileOvolo
Could you please tell me the size of your file in zip or just txt size?Rosarosabel
glove.6B.zip is 862182613 bytesOvolo
Do you have glove.840B version? This script works fine in 6B version.Rosarosabel
K
6

The code works fine for files: glove.6B.*d.txt, glove.42B.*d.txt, but not glove.6B.300d.txt. This is because glove.6B.300d.txt contains spaces in a word. For example, it has a word like this: '. . .' and there are spaces between those dots. I solve this problem by changing this line:

splitLine = line.split()

into

splitLine = line.split(' ')

So you code must be like this:

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r', encoding='utf8')
    model = {}
    for line in f:
        splitLine = line.split(' ')
        word = splitLine[0]
        embedding = np.asarray(splitLine[1:], dtype='float32')
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model
Kazantzakis answered 10/5, 2018 at 12:35 Comment(0)
B
1

I think the following may help:

def process_glove_line(line, dim):
    word = None
    embedding = None

    try:
        splitLine = line.split()
        word = " ".join(splitLine[:len(splitLine)-dim])
        embedding = np.array([float(val) for val in splitLine[-dim:]])
    except:
        print(line)

    return word, embedding

def load_glove_model(glove_filepath, dim):
    with open(glove_filepath, encoding="utf8" ) as f:
        content = f.readlines()
        model = {}
        for line in content:
            word, embedding = process_glove_line(line, dim)
            if embedding is not None:
                model[word] = embedding
        return model

model= load_glove_model("glove.840B.300d.txt", 300)
Budding answered 14/2, 2019 at 6:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.