dos2unix modifies binary files - why
Asked Answered
C

2

5

By default it is not supposed to affect binary files.

I tested it in a folder with images and although most images were not affected, a few were. If dos2unix cannot tell a binary file from a text file, must I resort to specifically including and/or excluding certain file extensions for it to work properly?

NOTE: when I run file image.jpg on any of the jpgs, whether it got modified or not, the result is:

JPEG image data, JFIF standard 1.01
Cleodell answered 14/12, 2015 at 0:3 Comment(7)
What makes you think it's not supposed to affect binary files 'by default''? It just replaces CR/LF sequences with LFs.Draff
"cannot tell a binary file from a text file" there is no difference, every file is binary.Moribund
https://mcmap.net/q/82865/-convert-dos2unix-line-endings-for-all-files-in-a-directory claims it does, but I suppose I should check the man page.Cleodell
@Moribund thanks, that's very useful. Unfortunately, I need to make sure image files are not altered because that will corrupt them, whereas changing newlines in php, js, phtml, and txt files will not corrupt those files.Cleodell
alright, then I suggest checking for the right file extension before you apply dos2unix.Moribund
from man dos2unix on centos6.6: Binary files are automatically skipped, unless conversion is forced.Medwin
@Medwin any idea why dos2unix wouldn't work? I'm on Centos 6.7Cleodell
M
6

This is a relevant part of the source code of dos2unix program:

if ((ipFlag->Force == 0) &&
      (TempChar < 32) &&
      (TempChar != 0x0a) &&  /* Not an LF */
      (TempChar != 0x0d) &&  /* Not a CR */
      (TempChar != 0x09) &&  /* Not a TAB */
      (TempChar != 0x0c)) {  /* Not a form feed */
        RetVal = -1; 
        ipFlag->status |= BINARY_FILE ;
        if (ipFlag->verbose) {
          if ((ipFlag->stdio_mode) && (!ipFlag->error)) ipFlag->error = 1;
          d2u_fprintf(stderr, "%s: ", progname);
          d2u_fprintf(stderr, _("Binary symbol 0x00%02X found at line %u\n"),TempChar, line_nr);
        }
        break;
      } 

It seems that if the file has other control character it is considered as a binary file and is skipped, otherwise it is processed as a text file. So if the binary file (e.g. an image) doesn't contain these characters, it will be corrupted.

Medwin answered 14/12, 2015 at 0:30 Comment(4)
file properly identifies these files when dos2unix does not. Even when I changed file extensions of php files, file still identified them as php text files. Even if I rename a php file to give it a jpg extension, file still identifies it properly as PHP script text.Cleodell
Both dos2unix and file do not use file name as a source to determine file type. The code snippet shows how dos2unix determines the type of the file. The file program is more complex, first it tries to utilize knowledge of operating system for special files (e.g. device files) and after that it checks contents of the file, see man file for details. Actually you can omit file name completely, e.g. head /dev/zero | file -. For your task you might need to specify file names explicitly.Medwin
I could have used a combination of find, file, and dos2unix to accomplish my feat, but instead I've gone with a simpler file + dos2unix limited to certain file extensions like so: find -type f -regex ".*\.php\|.*\.js\|.*\.xml\|.*\.phtml\|.*\.css" -exec dos2unix "{}" \; Your answer addresses my question very directly, though. Thanks to you and Matteo I understand these two tools (file vs dos2unix) better now.Cleodell
I've written a line ending conversion utility to circumvent my dissatisfactions with dos2unix. Besides checking whether a file contains non-text characters, it also checks file extensions and avoids modifying files with 77 of the most common extensions you'd be likely to find inside a source tree. jpg is one of them. You may check it out : github.com/mdolidon/endlinesIncombustible
P
4

There's no such a thing as a "binary" or "text" file in line of principle - all files are just a sequence of bytes.

Most programs that try to detect them just use some kind of heuristic to rule out files which contain characters unusual for text (typically, characters < 32) or do not contain characters that are typically found in text (for example, whitespace, as shown in @Andrey's answer).

This is just a kindness done to you to avoid accidental mistakes, but "without warranty of any kind", since it's entirely possible to have "binary" files which employ just the ASCII characters (it's easy to build, say, PPM and COM files which pass the test above).

Pontificate answered 14/12, 2015 at 0:39 Comment(5)
Still it's strange that file detects the difference so much better than dos2unix. I suppose it comes with a performance hit, but I would rather wait a few more seconds that corrupt thousands of "binary" files.Cleodell
@ButtleButkus: it's not strange at all, file is an entirely different beast; it has a complex system for detecting specific file types (it actually is a frontend for libmagic), matching known file type signatures and, only if this kind of type of detection fails, it falls back to heuristics (and, if everything fails, it just writes "data").Pontificate
it is strange to me that dos2unix does not apply a similarly complex system. But I guess that's just the kind of world we live in.Cleodell
@ButtleButkus: if you need to use the file machinery you are supposed to invoke it from your script, before calling dos2unix; no need to duplicate functionality.Pontificate
I suppose you're right, Matteo. I've opted to just apply dos2unix to certain file extensions instead, which is close enough for me.Cleodell

© 2022 - 2024 — McMap. All rights reserved.