It took me a little time to create an environment to test your code. But what I did and worked for me in Windows was installing fastText in Cygwin. I hope this answer, somehow be useful for someone with a similar issue.
Environment
Winwdows 10
CYGWIN_NT-10.0 DESKTOP-RR909JI 2.10.0(0.325/5/3) 2018-02-02 15:16 x86_64
gcc-g++: 7.3 | gcc-core 7.3
Python 2.7 | Python2-Cython 0.25.2 | python2pip | Python2-devel
pip install fastText
Files
user@DESKTOP-RR909JI ~/projects
$ file *
data.txt: ASCII text
data.train.txt: Big-endian UTF-16 Unicode text
fasttext_ie.py: Python script, ASCII text executable
model.bin: data
wiki.simple.vec: UTF-8 Unicode text, with very long lines
fastest_ie.py
#!/usr/bin/python
import fasttext
fasttext.supervised('data.txt','model', label_prefix='__label__', dim=300, epoch=50, min_count=1, ws=3, minn=4, pretrained_vectors='wiki.simple.vec')
I've downloaded the pre-trained word vectors (wiki.simple.vec) from here.
I've copied your input example in data.txt
and made a version with UTF-16 data.train.txt
After executing your code snippet, it took a while but a file was generated, but it only happened with the ASCII text file:
user@DESKTOP-RR909JI ~/projects
$ ls -ltrh model.bin
-rw-r--r-- 1 user user 129M jun. 28 00:56 model.bin
it has lots of strings:
qateel
olympiques
lesothosaurus
delillo
satrapi
conferencing
numan
echinodermata
haast
tangerines
duat
vesey
rotaviruses
velox
chepstow
capitale
rock/pop
belasco
sardanapalus
jadis
macintyre
When trying with UTF-16
It didn't generate the file, but also didn't finished the process, it just kept running without finalization.
So we can say, it failed.
Despite fastText says UTF-8 it's supported:
where data.txt is a training file containing UTF-8 encoded text. By
default the word vectors will take into account character n-grams from
3 to 6 characters. At the end of optimization the program will save
two files: model.bin and model.vec. model.vec is a text file
containing the word vectors, one per line. model.bin is a binary file
containing the parameters of the model along with the dictionary and
all hyper parameters. The binary file can be used later to compute
word vectors or to restart the optimization.
It could happend that version I'm installing through Cygwin is somehow, different.
And also after reading this question in stackoverflow I would like to ask: Have you tried changing the file to ASCII and test what happen?
All my files were in the same root directory.
I don't know fastText but I wanted to execute your code, which works. I had issues with the gcc libraries, I had to install same version for g++ and core.
"myfolder\\nfolder\\tfolder"
) or use a "raw" string with an "r" in front of the string constant (r"myfolder\nfolder\tfolder"
) – Adkisson