Why does my tool output overwrite itself and how do I fix it?
Asked Answered
B

3

22

The intent of this question is to be a canonical that covers all sorts of questions whose answer boils down to "you have DOS line endings being fed into a Unix tool". Anyone with a related question should find a clear explanation of why they were pointed here as well as tools that can solve their problem, plus pros/cons/caveats of the possible solutions. Some of the existing questions on this topic have accepted answers that only say "run this tool" with little explanation or are just plain dangerous and should never be used.

Now to a typical question that would result in a referral here:


I have a file containing 1 line:

what isgoingon

and when I print it using this awk script to reverse the order of the fields:

awk '{print $2, $1}' file

instead of seeing the output I expect:

isgoingon what

I get the field that should be at the end of the line appearing at the start of the line and overwriting some text:

 whatngon

or I get the output split onto 2 lines:

isgoingon
 what

What could the problem be and how do I fix it?

Barbette answered 19/8, 2017 at 14:8 Comment(2)
Thanks for creating this question. The most useful one as it is the most common mistake! Should be linked by default to all awk and sed questions.Megrims
This is very similar in spirit to stackoverflow.com/questions/39527571/… - do we need multiple canonicals?Particularize
B
28

The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF, and you are running a UNIX tool on it, so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file, while LF is \n and appears as $ with cat -vE.

So your input file wasn't really just:

what isgoingon

it was actually:

what isgoingon\r\n

as you can see with cat -vE:

$ cat -vE file
what isgoingon^M$

and od -c:

$ od -c file
0000000   w   h   a   t       i   s   g   o   i   n   g   o   n  \r  \n
0000020

so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:

<what> <isgoingon\r>

Note the \r at the end of the second field. \r means carriage return which is literally an instruction to return the cursor to the start of the line. So when you do:

print $2, $1

awk will print it to the terminal, which will print isgoingon and return the cursor to the start of the line before printing a space followed by what, which is why the what appears to overwrite the start of isgoingon.

Solution

To fix the problem, do either of these:

dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file

Apparently dos2unix is aka fromdos in some UNIX variants (e.g. Ubuntu).

Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line. (More details below.)

Notes

Handling DOS line endings with awk

GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:

gawk -v RS='\r\n' '...' file

but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.

CSV data containing newlines

One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:

"field1","field2.1
field2.2","field3"

is really:

"field1","field2.1\nfield2.2","field3"\r\n

so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:

gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file

Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.

Awk's default FS

Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:

$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'

$

That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:

$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$
Barbette answered 19/8, 2017 at 14:12 Comment(4)
I understand your remark about being careful with tr -d '\r', but out of professional curiosity: did you ever encounter a Windows CSV file that had an intended payload of a '\r' somewhere?Jacquline
I wrote File::Edit::Portable to make reading and writing files across platforms seamless.Wicklow
@Jacquline I have, just yesterday. That csv file was faulty of course but it had firstname\rlastnames and first\nlasts.Radmilla
@JamesBrown that was the reason for my question to @EdMorton. I have to process lots of input data and finding a solitary \r in the data makes my validation routines go "beep". I had one case (not lying!) where somebody used \r as column and \n as line separator years ago. :-)Jacquline
F
4

You can use the \R backslash sequence in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.

So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.

Given:

$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \r  \n
0000020

Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):

$ perl -pe 's/\R$/\n/' file | od -c
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017

(Note the \r between the two words is correctly left alone)

If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.

With straight POSIX tools, your best bet is likely awk like so:

$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017

Things that kinda work (but know your limitations):

tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):

$ tr -d "\r" < file | od -c
0000000    w   h   a   t   i   s   g   o   i   n   g   o   n  \n        
0000016

GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.

GNU sed only:

$ sed 's/\x0D//' file | od -c   # also sed 's/\r//'
0000000    w   h   a   t  \r   i   s   g   o   i   n   g   o   n  \n    
0000017

The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.

Fleuron answered 19/8, 2017 at 16:44 Comment(6)
Seems to me like using \R would only be useful if you have to operate on input that you don't know what the end of line string is but you can guarantee that the other possible end of line characters cannot appear in the input. I mean if I have input files that use \r\n line endings and can contain \v and \n within fields (which I expect I could produce with Excel) then I could have a 1-field record that is "foo\v\nbar"\r\n so how would I use \R to identify lines? I can identify lines as strings separated by \r\n but not by \R\n since that latter would include \v\n mid-record.Barbette
Sorry for the multiple comments, I just can't figure out why I'd want to use \R and I definitely don't understand what is going on here: 1) od -c < file outputs " f o o \v \n b a r " \r \n 2) perl -pe 's/\r$/\n/' file | od -c outputs " f o o \v \n b a r " \n \n 3) perl -pe 's/\R$/\n/' file | od -c1 outputs " f o o \n \n b a r " \n. As I expected using \R messes up the \v\n mid-record but why does \r\n become \n\n when using \r$ in the regexp but just \n when using \R$? Where'd the 2nd \n go?Barbette
@EdMorton: 2 - The single \n is being treated as line separator/record separator by Perl even if quoted. The \v is being treated as an extra line separator in the regex s/\R$/\n/ so you get \n\n for the replacement of the sequence \v\n. The \n in the sequence \r\n is again being treated as a line separator. The s/\R$/\n/ treats \r\n as a single line separator so you get a single \n. If you want to treat "foo\v\nbar"\r\n as a single record, you would need either a CSV parser or a more complete regex that describes that.Fleuron
@EdMorton: 3 - The attempt of \R is to be a 'generic newline' useful for UTF-X, XML, or generic text with unknown line endings. You can use verbs to control what is included. Assuming that you have set your tool to properly read lines, the regex \R$ will remove any of the characters contained in \R that were not included in line processing of the tool. Note that the PCRE \v character class is different that the ANSI C character definition of \v. The character class \v is equivalent to /[\n\cK\f\r\x85\x{2028}\x{2029}]/Fleuron
It's all a little too different from BREs and EREs for my tastes and I feel like it's a bad idea to guess, possibly incorrectly, at what might be line endings but could appear elsewhere in your input but I suppose it must be useful in some situations or "they" wouldn't have come up with it.Thanks for the explanations.Barbette
\R isn't a shorthand character class but an alias for an alternation of different newline sequences inside an atomic group. (That's why you can't write something like [\R].)Frap
G
2

Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.

If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).

There is a similar dos2unix deb package available for Debian based systems.

From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.

This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!

tr -d '\r' < infile > outfile
Goy answered 19/8, 2017 at 14:26 Comment(2)
The form tr -d '\r' < infile > outfile will destroy all \r that meant to be in the file and not part of the Windows line ending. It is better to do sed 's/\r$//' since that limits the replacement to line endings.Fleuron
@Fleuron Good point. Hence the improved safety of using dos2unix.Goy

© 2022 - 2024 — McMap. All rights reserved.