How to find a windows end of line (EOL) character
Asked Answered
M

8

13

I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.

So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?

I've tried creating some test data and running

sed -r 's/\r\n//' testdata.txt

But that appears to match regardless of whether dos2unix has been run or not.

Thanks.

Mcmann answered 17/3, 2011 at 23:30 Comment(0)
O
21

The file(1) utility knows the difference:

$ file * | grep ASCII
2:                                       ASCII text
3:                                       ASCII English text
a:                                       ASCII C program text
blah:                                    ASCII Java program text
foo.js:                                  ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc:              ASCII text, with very long lines
windows:                                 ASCII text, with CRLF line terminators

file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.

Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)

Outgrow answered 17/3, 2011 at 23:42 Comment(2)
What if the result says "ASCII text, with very long lines, with no line terminators"Mcmann
Heh, one very long line without any line terminators may be an awkward input to paste(1), but perhaps file(1) is giving up too quickly? Maybe the lines are longer than the area it inspects. (A quick glance at file's source (src/file.h) looks like it inspects 256 kilobytes (HOWMANY), so it sounds like your input is missing line terminators for a very long line indeed.)Outgrow
C
4
#!/bin/bash
for i in $(find . -type f); do
        if file $i | grep CRLF ; then
                echo $i
                file $i
                #dos2unix "$i"
        fi
done

Uncomment "#dos2unix "$i"" when you are ready to convert them.

Charlie answered 17/11, 2011 at 10:33 Comment(0)
M
3

You can find out using file:

file /mnt/c/BOOT.INI 
/mnt/c/BOOT.INI: ASCII text, with CRLF line terminators

CRLF is the significant value here.

Mundell answered 17/3, 2011 at 23:44 Comment(0)
C
2

If you expect the exit code to be different from sed, it won't be. It will perform a substitution or not depending on the match. The exit code will be true unless there's an error.

You can get a usable exit code from grep, however.

#!/bin/bash
for f in *
do
    if head -n 10 "$f" | grep -qs $'\r'
    then
        dos2unix "$f"
    fi
done
Claxton answered 18/3, 2011 at 1:50 Comment(0)
D
2

grep recursive, with file pattern filter

grep -Pnr --include=*file.sh '\r$' .

output file name, line number and line itself

./test/file.sh:2:here is windows line break
Donar answered 22/10, 2015 at 13:4 Comment(0)
F
2

You can use dos2unix's -i option to get information about DOS Unix Mac line breaks (in that order), BOMs, and text/binary without converting the file.

$ dos2unix -i *.txt
    6       0       0  no_bom    text    dos.txt
    0       6       0  no_bom    text    unix.txt
    0       0       6  no_bom    text    mac.txt
    6       6       6  no_bom    text    mixed.txt
   50       0       0  UTF-16LE  text    utf16le.txt
    0      50       0  no_bom    text    utf8unix.txt
   50       0       0  UTF-8     text    utf8dos.txt

With the "c" flag dos2unix will report files that would be converted, iow files have have DOS line breaks. To report all txt files with DOS line breaks you could do this:

$ dos2unix -ic *.txt
dos.txt
mixed.txt
utf16le.txt
utf8dos.txt

To convert only these files you simply do:

dos2unix -ic *.txt | xargs dos2unix

If you need to go recursive over directories you do:

find -name '*.txt' | xargs dos2unix -ic | xargs dos2unix

See also the man page of dos2unix.

Freud answered 23/10, 2015 at 7:6 Comment(0)
I
1

As stated above the 'file' solution works. Maybe the following code snippet may help.

#!/bin/ksh
EOL_UNKNOWN="Unknown"       # Unknown EOL
EOL_MAC="Mac"               # File EOL Classic Apple Mac  (CR)
EOL_UNIX="Unix"             # File EOL UNIX               (LF)
EOL_WINDOWS="Windows"       # File EOL Windows            (CRLF)
SVN_PROPFILE="name-of-file" # Filename to check.
...

# Finds the EOL used in the requested File
# $1 Name of the file (requested filename)
# $r EOL_FILE set to enumerated EOL-values.
getEolFile() {
    EOL_FILE=$EOL_UNKNOWN

    # Check for EOL-windows
    EOL_CHECK=`file $1 | grep "ASCII text, with CRLF line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_WINDOWS
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text, with CR line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_MAC
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_UNIX
       return
    fi

    return
   } # getFileEOL   
   ...

   # Using this snippet
   getEolFile $SVN_PROPFILE
   echo "Found EOL: $EOL_FILE"
   exit -1
Ingeringersoll answered 22/1, 2012 at 8:43 Comment(0)
C
1

Thanks for the tip to use file(1) command, however it does need a bit more refinement. I had the situation where not only plain text files but also some ".sh" scripts had the wrong eol. And "file" reports them as follows regardless of eol:

xxx/y/z.sh: application/x-shellscript

So the "file -e soft" option was needed (at least for Linux):

bash$ find xxx -exec file -e soft {} \; | grep CRLF

This finds all the files with DOS eol in directory xxx and subdirs.

Crinose answered 29/5, 2012 at 11:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.