I download a file from the OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by selecting Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS).
When I run:
dos2unix CRS\ 2013\ data.txt
I see:
dos2unix: Binary symbol 0x0004 found at line 1703
dos2unix: Skipping binary file CRS 2013 data.txt
I check the encoding of the file with:
file --mime-encoding CRS\ 2013\ data.txt
and see:
CRS 2013 data.txt: utf-16le
I do:
iconv -l | grep utf-16le
which doesn't return anything so I do:
iconv -l | grep UTF-16LE
which returns:
UTF-16LE//
Then I run:
iconv --verbose -f UTF-16LE -t UTF-8 CRS\ 2013\ data.txt -o crs_2013_data_temp.txt
and check:
file --mime-encoding crs_2013_data_temp.txt
and see:
crs_2013_data_temp.txt: utf-8
Then I try:
dos2unix crs_2013_data_temp.txt
and get:
dos2unix: Binary symbol 0x04 found at line 1703
dos2unix: Skipping binary file crs_2013_data_temp.txt
I then try to force it:
dos2unix -f crs_2013_data_temp.txt
It works i.e., dos2unix completes the conversion without bailing out/complaining but when I open the file I see entries like "FoÄŤa and ÄŚajniÄŤe".
My question is why? Is it because the BOM is not visible to dos2unix? Because it's missing? Have I not done the conversion right? How do I convert this file (correctly?) so that I can read it.
file --mime-encoding CRS\ 2013\ data.txt
returnsutf-16le
anddos2unix
attempts to convert the file until it finds the first binary symbol anddos2unix
can only detect if a file is in the UTF-16 format if the file has a BOM? – Pearlpearla