fasttext cannot load training txt file
Asked Answered
W

2

6

I am trying to train a fasttext classifier in windows using fasttext python package. I have a utf8 file with lines like

__label__type1 sample sentence 1
__label__type2 sample sentence 2
__label__type1 sample sentence 3 

When I run

fasttext.supervised('data.train.txt','model', label_prefix='__label__', dim=300, epoch=50, min_count=1, ws=3, minn=4, pretrained_vectors='wiki.simple.vec')

I got the following error

File "fasttext\fasttext.pyx", line 256, in fasttext.fasttext.supervised (fasttext/fasttext.cpp:7265)
  File "fasttext\fasttext.pyx", line 182, in fasttext.fasttext.train_wrapper (fasttext/fasttext.cpp:5279)
ValueError: fastText: cannot load data.train.txt

And when I check the file types in my directory, I got

__pycache__:     directory
data.train.txt:  UTF-8 Unicode text, with very long lines, with CRLF line terminators
train.py:        Python script, ASCII text executable, with CRLF line terminators
wiki.simple.vec: UTF-8 Unicode text, with very long lines, with CRLF line terminators

Also, when I try to train the same the classifier with the same training file in MacOs it works fine. I am trying to understand why that txt file cannot be read.

Thanks!

Whangee answered 18/6, 2018 at 9:37 Comment(9)
Not sure what maybe causing this behavior but fasttext was only built for Linux/MacOS. It requires good C++ support. Check the official github repo. You can also check out this answerKelt
Can you add more code and examples (amybe upload your files as well)? I've tried it with wiki.simple.vec and a sample of data.train.txt and it works perfectly: >>> fasttext.supervised('data.train.txt','model', label_prefix='label', dim=300, epoch=50, min_count=1, ws=3, minn=4, pretrained_vectors='wiki.simple.vec') <fasttext.model.SupervisedModel object at 0x05CE63D0>Mahon
Also, if you're not used to Windows, are you sure the cwd is pointing to the correct place? You could try using os.path.join() and constructing a full, absolute path to your .txt file and see if that's still the problem. Python isn't natively a part of Windows the way it is modern MacOS, so it might behave differently than you are used to. techibee.com/python/get-current-directory-using-python/2790Hooked
@user8212173, thank you, I know it is not supported officially, took the wheel from lfd.uci.edu/~gohlke/pythonlibs .it successfully built in my windows machineWhangee
@stanjer thank you, in my laptop (macos), it works perfectly as well, with the same txt file and same pretrained vectors. cannot upload the files as they belong to my company, but the format is the same as I have shown above. also tried changing the label prefix to 'label' instead of 'label' still got the cannot load txt file errorWhangee
@JeffEllen thanks, turned out that the issue is about the paths, previously I tried using full paths, which didnt work. Folder names were starting with n and t, in the absolute path, I had "myfolder\nfolder\tfolder". It turned out that \n and \t were creating the issue, when I changed the initial letters it is solved.Whangee
If you need to use a windows path that includes \n and \t (or \b or \a, or a few others), here are two ways to do it: Either double the backslashes ("myfolder\\nfolder\\tfolder") or use a "raw" string with an "r" in front of the string constant (r"myfolder\nfolder\tfolder")Adkisson
Thanks @DanielMartinWhangee
That will get you through this time, but I think use of the os module will be better for you in the long term. See my answer below.Hooked
H
3

TL;DR: Use the os module to safely construct paths, especially in Python 2

The error indicated that the file can't be loaded. Since the only difference between your environments is the operating system, then the clue is that you're not properly locating the file, because each OS handles paths differently. I feel this is a mistake most python programmers make at least once, because it's unexpected.

You can hardcode paths, but then you'll have a problem down the road if you ever use things cross platform. In my case, sometimes I develop something quickly in Windows, but then deploy large scale on a *nix platform.

I suggest instead getting used to using the os module, because it will work across platforms. said in a comment that they had a path of "myfolder\nfolder\tfolder"; by trying to construct their own strings for a path instead of using the os module.. on windows even if the folder's didn't start with the newline \n and the tab \t it still wouldn't have worked, because windows paths need to escape the slash (\). Use os, and you don't have to know that.

>>> import os
>>> os.getcwd()
'C:\\Python27'
>>> os.path.abspath(os.sep)
'C:\\'
>>> os.chdir(os.path.join(os.path.abspath(os.sep, "Users", "Jeff"))
>>> os.getcwd()
'C:\\Users\\Jeff'

Usually, you'll be using relative paths from your project root, not absolute paths. Those are easier, the root of the current OS is what's a little trickier (you can find that answer here)

(I'm providing the full answer as we figured out from the comments)

Edit: Maybe python 3 has something this link says is better than os, pathlib. I've never used python 3 so I can't say.

Hooked answered 27/6, 2018 at 23:47 Comment(3)
Seems reasonable. But we don't know the environment of the user. I've created something similar with Cygwin and it works without absolute paths.Fairhaired
@Miguel Your comment makes no sense. I never said he should use absolute paths. My answer says you usually shouldn't. My answer is completely independent of the user environment, that's the point of the os module. The original poster specifically said in a comment that they had a path of "myfolder\nfolder\tfolder"; On windows this fails not because the folders start with the newline \n and the tab \t they fail because windows paths need to escape the slash (\). Use os, and you don't have to know that. I'll add this example to my answer.Hooked
Well, actually I was trying to say, that you do not define the path if the file is in the root folder, that is O.S independent.Fairhaired
F
0

It took me a little time to create an environment to test your code. But what I did and worked for me in Windows was installing fastText in Cygwin. I hope this answer, somehow be useful for someone with a similar issue.

Environment

  • Winwdows 10

  • CYGWIN_NT-10.0 DESKTOP-RR909JI 2.10.0(0.325/5/3) 2018-02-02 15:16 x86_64

  • gcc-g++: 7.3 | gcc-core 7.3

  • Python 2.7 | Python2-Cython 0.25.2 | python2pip | Python2-devel

  • pip install fastText

Files

user@DESKTOP-RR909JI ~/projects
$ file *
data.txt:         ASCII text
data.train.txt:   Big-endian UTF-16 Unicode text
fasttext_ie.py:   Python script, ASCII text executable
model.bin:        data
wiki.simple.vec:  UTF-8 Unicode text, with very long lines 

fastest_ie.py

#!/usr/bin/python
import fasttext

fasttext.supervised('data.txt','model', label_prefix='__label__', dim=300, epoch=50, min_count=1, ws=3, minn=4, pretrained_vectors='wiki.simple.vec')

I've downloaded the pre-trained word vectors (wiki.simple.vec) from here. I've copied your input example in data.txt and made a version with UTF-16 data.train.txt

After executing your code snippet, it took a while but a file was generated, but it only happened with the ASCII text file:

user@DESKTOP-RR909JI ~/projects
$ ls -ltrh model.bin
-rw-r--r-- 1 user user 129M jun. 28 00:56 model.bin

it has lots of strings:

qateel
olympiques
lesothosaurus
delillo
satrapi
conferencing
numan
echinodermata
haast
tangerines
duat
vesey
rotaviruses
velox
chepstow
capitale
rock/pop
belasco
sardanapalus
jadis
macintyre

When trying with UTF-16

It didn't generate the file, but also didn't finished the process, it just kept running without finalization.

So we can say, it failed.

Despite fastText says UTF-8 it's supported:

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

It could happend that version I'm installing through Cygwin is somehow, different.

And also after reading this question in stackoverflow I would like to ask: Have you tried changing the file to ASCII and test what happen?

All my files were in the same root directory.

I don't know fastText but I wanted to execute your code, which works. I had issues with the gcc libraries, I had to install same version for g++ and core.

Fairhaired answered 28/6, 2018 at 5:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.