How to use GloVe word-embeddings file on Google colaboratory
Asked Answered
S

4

8

I have downloaded the data with wget

!wget http://nlp.stanford.edu/data/glove.6B.zip
 - ‘glove.6B.zip’ saved [862182613/862182613]

It is saved as zip and I would like to use glove.6B.300d.txt file from the zip file. What I want to achieve is :

embeddings_index = {}
with io.open('glove.6B.300d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:],dtype='float32')
        embeddings_index[word] = coefs

Of course I am having this error:

IOErrorTraceback (most recent call last)
<ipython-input-47-d07cafc85c1c> in <module>()
      1 embeddings_index = {}
----> 2 with io.open('glove.6B.300d.txt', encoding='utf8') as f:
      3     for line in f:
      4         values = line.split()
      5         word = values[0]

IOError: [Errno 2] No such file or directory: 'glove.6B.300d.txt'

How can I unzip and use that file in my code above on Google colab?

Solifluction answered 27/4, 2018 at 10:16 Comment(1)
It's better to download and unzip from here: nlp.stanford.edu/projects/gloveChurrigueresque
E
4

Its simple, checkout this older post from SO.

import zipfile
zip_ref = zipfile.ZipFile(path_to_zip_file, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
Ending answered 27/4, 2018 at 10:20 Comment(7)
I would like to do this on Google colab. I do not think so glove zip is saved into my computer.Solifluction
Assuming that zipfile gets into current directory as mentioned by wget command, just specify glove.6B.zip as path -- I think it should workEnding
<pre> File "<ipython-input-60-785ab10a0dbb>", line 2 zip_ref = zipfile.ZipFile(glove.6B.zip, 'r') ^ SyntaxError: invalid syntax}<code>Solifluction
This needs to be corrected to zipfile.ZipFile("glove.6B.zip", 'r') you've note specified the " for the file nameEnding
Oh thank you and directory_to_extract_to line to extract on my computer. And instead of extracall how can I speciy one file?Solifluction
Try this zip_ref.extractall("."), once done you can use os.listdir function to check what all files have been extracted to current directoryEnding
Thanks a lot for your guidance! For now I am not experiencing any error. I am accepting your answer as a correct answer. Have a nice day!Solifluction
L
30

One more way you could do is as follows.

1. Download the zip file

!wget http://nlp.stanford.edu/data/glove.6B.zip

post downloading the zip file it is saved in the /content directory of google Collab.

2. Unzip it

!unzip glove*.zip

3. Get the exact path of where the embedding vectors are extracted using

!ls
!pwd

4. Index the vectors

print('Indexing word vectors.')

embeddings_index = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

5. Fuse with google - drive

!pip install --upgrade pip
!pip install -U -q pydrive
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null

!apt-get -y install -qq google-drive-ocamlfuse fuse

from google.colab import auth
auth.authenticate_user()
# Generate creds for the Drive FUSE library.
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

!mkdir -p drive
!google-drive-ocamlfuse drive

6. Save the indexed vectors to google drive for re-use

import pickle
pickle.dump({'embeddings_index' : embeddings_index } , open('drive/path/to/your/file/location', 'wb'))

If you have already downloaded the zip file in the local system, just extract it and upload the required dimension file to google drive -> fuse gdrive -> give the appropriate path and then use it / make an index of it, etc.

also, another way would be if already downloaded in the local system via code in collab

from google.colab import files
files.upload()

select the file and use it as in step 3 onwards.

This is how you can work with glove word embedding in google collaboratory. hope it helps.

Lordship answered 3/9, 2018 at 10:42 Comment(1)
If I already have the file and I try to upload it to colab, it takes a lot of time even for the 50D file. Is there another way? Also if I download the glove files using the wget method directly to colab, do I have to download everytime I open and close the notebook?Tomato
E
4

Its simple, checkout this older post from SO.

import zipfile
zip_ref = zipfile.ZipFile(path_to_zip_file, 'r')
zip_ref.extractall(directory_to_extract_to)
zip_ref.close()
Ending answered 27/4, 2018 at 10:20 Comment(7)
I would like to do this on Google colab. I do not think so glove zip is saved into my computer.Solifluction
Assuming that zipfile gets into current directory as mentioned by wget command, just specify glove.6B.zip as path -- I think it should workEnding
<pre> File "<ipython-input-60-785ab10a0dbb>", line 2 zip_ref = zipfile.ZipFile(glove.6B.zip, 'r') ^ SyntaxError: invalid syntax}<code>Solifluction
This needs to be corrected to zipfile.ZipFile("glove.6B.zip", 'r') you've note specified the " for the file nameEnding
Oh thank you and directory_to_extract_to line to extract on my computer. And instead of extracall how can I speciy one file?Solifluction
Try this zip_ref.extractall("."), once done you can use os.listdir function to check what all files have been extracted to current directoryEnding
Thanks a lot for your guidance! For now I am not experiencing any error. I am accepting your answer as a correct answer. Have a nice day!Solifluction
D
2

If you have Google Drive, you can:

  1. Mount your Google Drive so that it can be used from Colab notebook

    from google.colab import drive
    drive.mount('/content/gdrive')
    
  2. Download glove.6B.zip and extract it to a place of your choice on your Google Drive, for example

    "My Drive/Place/Of/Your/Choice/glove.6B.300d.txt"
    
  3. Open the file directly from your Colab notebook

    with io.open('/content/gdrive/Place/Of/Your/Choice/glove.6B.300d.txt', encoding='utf8') as f:
    
Denni answered 9/11, 2018 at 14:36 Comment(0)
N
0

The top answer is fine.

Just a little addition from myside and it will start working if you get an error.

import zipfile
Nyaya answered 9/6, 2022 at 9:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.