unixawkseddos2unix

Why does my tool output overwrite itself and how do I fix it?


The intent of this question is to be a canonical that covers all sorts of questions whose answer boils down to "you have DOS line endings being fed into a Unix tool". Anyone with a related question should find a clear explanation of why they were pointed here as well as tools that can solve their problem, plus pros/cons/caveats of the possible solutions. Some of the existing questions on this topic have accepted answers that only say "run this tool" with little explanation or are just plain dangerous and should never be used.

Now to a typical question that would result in a referral here:


I have a file containing 1 line:

what isgoingon

and when I print it using this awk script to reverse the order of the fields:

awk '{print $2, $1}' file

instead of seeing the output I expect:

isgoingon what

I get the field that should be at the end of the line appearing at the start of the line and overwriting some text:

 whatngon

or I get the output split onto 2 lines:

isgoingon
 what

What could the problem be and how do I fix it?


Solution

  • The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF, and you are running a UNIX tool on it, so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file, while LF is \n and appears as $ with cat -vE.

    So your input file wasn't really just:

    what isgoingon
    

    it was actually:

    what isgoingon\r\n
    

    as you can see with cat -vE:

    $ cat -vE file
    what isgoingon^M$
    

    and od -c:

    $ od -c file
    0000000   w   h   a   t       i   s   g   o   i   n   g   o   n  \r  \n
    0000020
    

    so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:

    <what> <isgoingon\r>
    

    Note the \r at the end of the second field. \r means carriage return which is literally an instruction to return the cursor to the start of the line. So when you do:

    print $2, $1
    

    awk will print it to the terminal, which will print isgoingon and return the cursor to the start of the line before printing a space followed by what, which is why the what appears to overwrite the start of isgoingon.

    Solution

    To fix the problem, do either of these:

    dos2unix file
    sed 's/\r$//' file
    awk '{sub(/\r$/,"")}1' file
    perl -pe 's/\r$//' file
    

    Apparently dos2unix is aka fromdos in some UNIX variants (e.g. Ubuntu).

    Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line. (More details below.)

    Notes

    Handling DOS line endings with awk

    GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:

    gawk -v RS='\r\n' '...' file
    

    but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.

    CSV data containing newlines

    One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:

    "field1","field2.1
    field2.2","field3"
    

    is really:

    "field1","field2.1\nfield2.2","field3"\r\n
    

    so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:

    gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
    

    Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.

    Awk's default FS

    Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:

    $ printf 'x y \n'
    x y
    $ printf 'x y \n' | awk '{print $NF}'
    y
    $
    $ printf 'x y \r\n'
    x y
    $ printf 'x y \r\n' | awk '{print $NF}'
    
    $
    

    That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:

    $ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
    ^M$