How to vectorize whole text using fasttext?
Asked Answered
F

3

9

To get vector of a word, I can use:

model["word"]

but if I want to get the vector of a sentence, I need to either sum vectors of all words or get average of all vectors.

Does FastText provide a method to do this?

Farther answered 17/4, 2017 at 16:6 Comment(1)
if you have any idea about implementation in java !Orgiastic
W
10

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

This has been clearly mentioned in the README of fasttext repo. https://github.com/facebookresearch/fastText

Wrecker answered 18/4, 2017 at 5:44 Comment(10)
is their another implementation using java .Orgiastic
AFAIK, fasttext supports only CLI for now. But, I was able to find a library that was the pythonic interface of fasttext. You can google to see if you can find one in java.Wrecker
i found one github.com/vinhkhuc/JFastText but has the same question of @Andrey. i should got the line by for loop then another loop for words to getvector for each one . but how can i got total . i couldn't find like the line you postedOrgiastic
jft.runCmd(new String[] { "supervised", "-input", "src/test/resources/data/labeled_data.txt", "-output", "src/test/resources/models/supervised.model" }); This snippet has been picked from the library you mentioned. You can use the command 'print-vectors' just like this, but you will have to figure out how to pass in the parameters as I don't know much about running commands from java code.Wrecker
thanks for replying , i should deal with data line by line as i'm using this in real time i think it will be false if i used the whole file once time , Right ?Orgiastic
No, the purpose of this 'print-vectors' command is to give you the vectors of all the lines in a file. If you see the command again 'text.txt' is a file that contains preprocessed data (i.e. one paragraph per line). You just have to put all your sentences in a file in the format specified and pass in that file to 'print-vectors' as an option.Wrecker
you mean that every call print-vectors means for each line not for all the lines in the file in once timeOrgiastic
Okay this is getting really difficult to explain :P I'll try to explain in more simple words. When you call print-vectors, you provide it a file (your input file with lots of paragraphs or sentences and one line of the file is treated as one paragraph). You can have as many paragraphs in a file as you like. You have to call print-vectors only once and it will output the vectors of all the lines in the input file. I suggest you go through the Fasttext docs, everything has been mentioned there nicely. :)Wrecker
Many thanks Aanchal for helping and sorry for late reply . Still want to make sure that i got it well : i will call print-vectors only once and it will output vectors for all lines in a file like for-loop ? I'm really appreciate your help and patienceOrgiastic
@AanchalSharma Thanks a lot for your great answer. Please let me know if you know an answer for this: #46923566Cribbs
M
3

You can use python wrapper also. Install it using official install guide from here: https://fasttext.cc/docs/en/python-module.html#installation

And after that:

import fasttext
model = fasttext.load_model('model.bin')
vect = model.get_sentence_vector("some string") # 1 sentence
vect2 = [model.get_sentence_vector(el.replace('\n', '')) for el in text] # for text
Meanie answered 17/6, 2020 at 11:35 Comment(1)
Note, this is pretty difficult to make work on windows computers. I'll suggest using gensimUnnamed
C
0

To get vector for a sentence using fasttext, try the following command

$ echo "Your Sentence Here" | ./fasttext print-sentence-vectors model.bin

For an example on this, refer Learn Word Representations In Fasttext

Camelback answered 7/9, 2017 at 12:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.