Visualize Gensim Word2vec Embeddings in Tensorboard Projector
Asked Answered
B

4

13

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's live model of the embedding projector. One problem. It didn't work. It told me that the tensors were improperly formatted. So, being a beginner, I thought I would ask some people with more experience about possible solutions.
Equivalent to my code:

import gensim
corpus = [["words","in","sentence","one"],["words","in","sentence","two"]]
model = gensim.models.Word2Vec(iter = 5,size = 64)
model.build_vocab(corpus)
# save memory
vectors = model.wv
del model
vectors.save_word2vec_format("vect.txt",binary = False)

That creates the model, saves the vectors, and then prints the results out nice and pretty in a tab delimited file with values for all of the dimensions. I understand how to do what I'm doing, I just can't figure out what's wrong with the way I put it in tensorflow, as the documentation regarding that is pretty scarce as far as I can tell.
One idea that has been presented to me is implementing the appropriate tensorflow code, but I don’t know how to code that, just import files in the live demo.

Edit: I have a new problem now. The object I have my vectors in is non-iterable because gensim apparently decided to make its own data structures that are non-compatible with what I'm trying to do.
Ok. Done with that too! Thanks for your help!

Buerger answered 23/5, 2018 at 15:50 Comment(3)
Would it be possible to add the tensorflow part of the code? ThanksLeighannleighland
Not sure. Never tried, and besides, I can’t figure out how to create an embedding projector visualization using tensorflow. Although I will edit the question accordingly.Buerger
If one of the answers below answered your question, the way this site works works, you'd "accept" the answer, more here: What should I do when someone answers my question?. But only if your question really has been answered. If not, consider adding more details to the question.Larcher
L
17

What you are describing is possible. What you have to keep in mind is that Tensorboard reads from saved tensorflow binaries which represent your variables on disk.

More information on saving and restoring tensorflow graph and variables here

The main task is therefore to get the embeddings as saved tf variables.

Assumptions:

  • in the following code embeddings is a python dict {word:np.array (np.shape==[embedding_size])}

  • python version is 3.5+

  • used libraries are numpy as np, tensorflow as tf

  • the directory to store the tf variables is model_dir/


Step 1: Stack the embeddings to get a single np.array

embeddings_vectors = np.stack(list(embeddings.values(), axis=0))
# shape [n_words, embedding_size]

Step 2: Save the tf.Variable on disk

# Create some variables.
emb = tf.Variable(embeddings_vectors, name='word_embeddings')

# Add an op to initialize the variable.
init_op = tf.global_variables_initializer()

# Add ops to save and restore all the variables.
saver = tf.train.Saver()

# Later, launch the model, initialize the variables and save the
# variables to disk.
with tf.Session() as sess:
   sess.run(init_op)

# Save the variables to disk.
   save_path = saver.save(sess, "model_dir/model.ckpt")
   print("Model saved in path: %s" % save_path)

model_dir should contain files checkpoint, model.ckpt-1.data-00000-of-00001, model.ckpt-1.index, model.ckpt-1.meta


Step 3: Generate a metadata.tsv

To have a beautiful labeled cloud of embeddings, you can provide tensorboard with metadata as Tab-Separated Values (tsv) (cf. here).

words = '\n'.join(list(embeddings.keys()))

with open(os.path.join('model_dir', 'metadata.tsv'), 'w') as f:
   f.write(words)

# .tsv file written in model_dir/metadata.tsv

Step 4: Visualize

Run $ tensorboard --logdir model_dir -> Projector.

To load metadata, the magic happens here:

load_meta


As a reminder, some word2vec embedding projections are also available on http://projector.tensorflow.org/

Larcher answered 24/5, 2018 at 0:3 Comment(8)
Thanks! Will try when I get back on my PC!Buerger
Actually I do have a question. Does this method require column headers on all the features?Buerger
No it does not, all you need to feed is the stacked tensor of embeddingsLarcher
I meant the .tsv, but I suppose that also answers that question in a way.Buerger
There's a typo in your solution. Step 1 should be embeddings_vectors = np.stack(list(embeddings.values()), axis=0) or otherwise it won't run ( there will be syntax error due to a missing ")" ).Bunin
Don't forget to add the encoding="utf-8" when you write the 'metadata.tsv', otherwise you may get errors depending on the characters you have, so it should be: with open(os.path.join('model_dir', 'metadata.tsv'), 'w', encoding="utf-8") as f:Denmark
Thank you for posting all those steps to answer this post. I have a question about 'the directory to store the tf variables is model_dir/'. I am very new to this so I am not sure how to set up a directory to meet this requirement. Do you mind elaborating more on that?Zion
@Zion any directory can be used. I just chose to name it model_dir for the sake of example. You will see how model_dir is referenced in the different steps. If you do not like the name of it, feel free to change all references to model_dir to what suits you bestLarcher
T
3

Gensim actually has the official way to do this.

Documentation about it

Thunderstruck answered 20/8, 2019 at 1:41 Comment(4)
Thanks a lot. This worked. However, the metadata file it creates is basically just the vocabulary. In my case, I wanted to a few covariates (movie genres in the movie lens dataset). So I ended up reading it's metadata.tsv and adding my own columns to it, before loading it into tensorboard.Yeast
I'm glad it helped you!Thunderstruck
Marco, can you add an example? the documentation host works like windows 3.11 (year 1993)Corry
Sorry. Are you able to reproduce the documentation Gonzalo?Thunderstruck
C
0

The above answers didn't work for me. What I found out pretty useful was this script (will be added to gensim in the future) Source

To transform the data to metadata:

model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
    with open( metadatafp, 'w+') as metadata:
         for word in model.index2word:
           encoded=word.encode('utf-8')
           metadata.write(encoded + '\n')
           vector_row = '\t'.join(map(str, model[word]))
           tensors.write(vector_row + '\n')

Or follow this gist

Corry answered 28/1, 2020 at 14:57 Comment(0)
W
0

the gemsim provide convert method word2vec to tf projector file

python -m gensim.scripts.word2vec2tensor -i ~w2v_model_file -o output_folder                                

add in projector wesite, upload the metadata enter image description here

Whereabouts answered 30/11, 2021 at 11:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.