dos2unix: Binary symbol 0x04 found at line 1703
Asked Answered
P

3

12

I download a file from the OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by selecting Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS).

When I run:

dos2unix CRS\ 2013\ data.txt

I see:

dos2unix: Binary symbol 0x0004 found at line 1703
dos2unix: Skipping binary file CRS 2013 data.txt

I check the encoding of the file with:

file --mime-encoding CRS\ 2013\ data.txt

and see:

CRS 2013 data.txt: utf-16le

I do:

iconv -l | grep utf-16le

which doesn't return anything so I do:

iconv -l | grep UTF-16LE

which returns:

UTF-16LE//

Then I run:

iconv --verbose -f UTF-16LE -t UTF-8 CRS\ 2013\ data.txt -o crs_2013_data_temp.txt

and check:

file --mime-encoding crs_2013_data_temp.txt

and see:

crs_2013_data_temp.txt: utf-8

Then I try:

dos2unix crs_2013_data_temp.txt

and get:

dos2unix: Binary symbol 0x04 found at line 1703
dos2unix: Skipping binary file crs_2013_data_temp.txt

I then try to force it:

dos2unix -f crs_2013_data_temp.txt

It works i.e., dos2unix completes the conversion without bailing out/complaining but when I open the file I see entries like "FoÄŤa and ÄŚajniÄŤe".

My question is why? Is it because the BOM is not visible to dos2unix? Because it's missing? Have I not done the conversion right? How do I convert this file (correctly?) so that I can read it.

Pearlpearla answered 28/4, 2015 at 15:11 Comment(0)
A
6

That 0x0004 character you are seeing in your file has nothing at all to do with the BOM (which is fine, by the way) -- it's an EOT (End of Transmission) character from the C0 control set, and has been at that codepoint since 7-bit ASCII was the new hotness. (It's also the familiar Control-D Unix EOF sequence.)

Unfortunately, the pre-dos2unix way of applying tr to the file to strip the carriage returns won't work directly since the file is UTF-16; since iconv works for you, though, you can use it to convert to UTF-8 (which tr will work on), and then run this tr command:

tr -d '\r' < crs_2013_data_temp.txt > crs_2013_data_unix.txt

in order to get the text file into the Unix line ending convention. You will have to keep an eye on whatever tools you're feeding the file to, though, to make sure that they don't choke on the Ctrl-D/EOT character; if they do, you can use

tr -d '\004' < crs_2013_data_unix.txt > crs_2013_data_clean.txt

to get rid of it.

As to how it got there in the first place? I blame the Belgians for letting it sneak into the data they gave the OECD, which they probably keyed in with cat - > file or some other similarly underwhelming means. Also, some text editors try to be a bit too helpful by hiding control characters, even though other tools will bail out when they see them as they think you just stuffed a binary file in that was pretending to be text for a while.

Adeline answered 28/4, 2015 at 16:3 Comment(5)
How do I/you know that the BOM is OK? Is it because: file --mime-encoding CRS\ 2013\ data.txt returns utf-16le and dos2unix attempts to convert the file until it finds the first binary symbol and dos2unix can only detect if a file is in the UTF-16 format if the file has a BOM?Pearlpearla
I tried both of these commands and then tried to dos2unix the crs_2013_data_clean.txt file and discovered another 9 binary symbols (0x03, 0x1c, 0x1d, 0x00,0x01,0x02,0x05,0x19 and 0x13). After I stripped them out using the command that you suggested, dos2unix finally worked. At this stage, should I be using the -m flag with dos2unix to add the BOM?Pearlpearla
@user4842454 -- the BOM is OK -- I verified this by manually inspecting the file in vim. You don't need to run dos2unix on it any longer after the first tr command I gave you, by the way -- it's equivalent to dos2unix for a UTF-8, ISO-8859-X, or ASCII file.Adeline
I tried :setlocal bomb? in vim and got bomb 1,1 Top, is that it? OK (regarding not needing to run dos2unix after removing the carriage returns with the tr command). Do I need to strip out the remaining binary symbols (END OF TEXT (\003,0x03), INFORMATION SEPARATOR FOUR (\034, 0x1c))? I am asking because running dos2unix after running the second tr command alerted me to the presence of these additional binary symbols and if I do need to strip them out, how would I find out that they are present otherwise?Pearlpearla
@user4842454 -- it depends entirely on whether the tools you are feeding them to are fazed by the occasional control character in the data. (And your results in VIM show that the BOM is just fine.)Adeline
S
2

I think this command is OK for your problem:

cat file | tr -d "\r" > new_file
Stymie answered 23/6, 2017 at 6:31 Comment(0)
A
0

That's how I solved:

find . -type f -exec sed -i 's/\r//' {} \;
Ataghan answered 2/7, 2018 at 11:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.