What's the best way of doing dos2unix on a 500k line file, in Windows? [closed]
Asked Answered
C

7

6

Question says it all, I've got a 500,000 line file that gets generated as part of an automated build process on a Windows box and it's riddled with ^M's. When it goes out the door it needs to *nix friendly, what's the best approach here, is there a handy snippet of code that could do this for me? Or do I need to write a little C# or Java app?

Couture answered 24/11, 2008 at 0:41 Comment(0)
M
11

Here is a Perl one-liner, taken from http://www.technocage.com/~caskey/dos2unix/

#!/usr/bin/perl -pi
s/\r\n/\n/;

You can run it as follows:

perl dos2unix.pl < file.dos > file.unix

Or, you can run it also in this way (the conversion is done in-place):

perl -pi dos2unix.pl file.dos

And here is my (naive) C version:

#include <stdio.h>

int main(void)
{
   int c;
   while( (c = fgetc(stdin)) != EOF )
      if(c != '\r')
         fputc(c, stdout);
   return 0;
}

You should run it with input and output redirection:

dos2unix.exe < file.dos > file.unix
Moneyed answered 24/11, 2008 at 1:1 Comment(2)
Don't worry about performance until you must deal with terabytes :D The C version takes ~ 5 seconds to convert a 65 MB file with 500000 lines of text (on an old Pentium4 with a standard EIDE disk)Moneyed
@Federico, that (naive) C version will remove all CR characters, not just those in a CR-LF pair. But I guess that's why you called it naive. :-)Mustache
E
6

If installing a base cygwin is too heavy, there are a number of standalone dos2unix and unix2dos Windows standalone console-based programs on the net, many with C/C++ source available. If I'm understanding the requirement correctly, either of these solutions would fit nicely into an automated build script.

Evered answered 24/11, 2008 at 2:24 Comment(0)
O
5

If you're on Windows and need something run in a batch script, you can compile a simple C program to do the trick.

#include <stdio.h>

int main() {
    while(1) {
        int c = fgetc(stdin);

        if(c == EOF)
            break;

        if(c == '\r')
            continue;

        fputc(c, stdout);
    }

    return 0;
}

Usage:

myprogram.exe < input > output

Editing in-place would be a bit more difficult. Besides, you may want to keep backups of the originals for some reason (in case you accidentally strip a binary file, for example).

That version removes all CR characters; if you only want to remove the ones that are in a CR-LF pair, you can use (this is the classic one-character-back method :-):

/* XXX Contains a bug -- see comments XXX */

#include <stdio.h>

int main() {
    int lastc = EOF;
    int c;
    while ((c = fgetc(stdin)) != EOF) {
        if ((lastc != '\r') || (c != '\n')) {
            fputc (lastc, stdout);
        }
        lastc = c;
    }
    fputc (lastc, stdout);
    return 0;
}

You can edit the file in-place using mode "r+". Below is a general myd2u program, which accepts file names as arguments. NOTE: This program uses ftruncate to chop off extra characters at the end. If there's any better (standard) way to do this, please edit or comment. Thanks!

#include <stdio.h>

int main(int argc, char **argv) {
    FILE *file;

    if(argc < 2) {
        fprintf(stderr, "Usage: myd2u <files>\n");
        return 1;
    }

    file = fopen(argv[1], "rb+");

    if(!file) {
        perror("");
        return 2;
    }

    long readPos = 0, writePos = 0;
    int lastC = EOF;

    while(1) {
        fseek(file, readPos, SEEK_SET);
        int c = fgetc(file);
        readPos = ftell(file);  /* For good measure. */

        if(c == EOF)
            break;

        if(c == '\n' && lastC == '\r') {
            /* Move back so we override the \r with the \n. */
            --writePos;
        }

        fseek(file, writePos, SEEK_SET);
        fputc(c, file);
        writePos = ftell(file);

        lastC = c;
    }

    ftruncate(fileno(file), writePos); /* Not in C89/C99/ANSI! */

    fclose(file);

    /* 'cus I'm too lazy to make a loop. */
    if(argc > 2)
        main(argc - 1, argv - 1);

    return 0;
}
Obelize answered 24/11, 2008 at 1:7 Comment(3)
@strager, fixed to use ints (required for EOF) and added code to do CRs only in a CR-LF pair - hopefully this'll get you more rep. Oh yes, and upvoted.Mustache
I noticed the correction using int; thanks! I'll leave the second one alone, even if it isn't my style. =]Obelize
The second snippet fails on the empty file, although it's fairly trivial to fix that.Gona
T
4
tr -d '^M' < infile > outfile

You will type ^M as : ctrl+V , Enter

Edit: You can use '\r' instead of manually entering a carriage return, [thanks to @strager]

tr -d '\r' < infile > outfile

Edit 2: 'tr' is a unix utility, you can download a native windows version from http://unxutils.sourceforge.net[thanks to @Rob Kennedy] or use cygwin's unix emulation.

Tacheometer answered 24/11, 2008 at 0:52 Comment(6)
This works nice if you have tr on the dos box. It's fast too.Derryberry
I don't have tr, where can I find it?Couture
Don't want to install Cygwin just for this.Couture
Native, non-Cygwin utilties: unxutils.sourceforge.netPrecocious
You can also write: tr -d '\r' < in > outObelize
Rob, you should've put your own answer in!Couture
D
1

Ftp it from the dos box, to the unix box, as an ascii file, instead of a binary file. Ftp will strip the crlf, and insert a lf. Transfer it back to the dos box as a binary file, and the lf will be retained.

Derryberry answered 24/11, 2008 at 0:50 Comment(4)
I'm not such a fan of this one, seems like it would be a PITA as part of an automated build. Plus, if I don't have a local unix box on the network, I've either got to buy one, or transfer the file over the WAN, twice. Must be possible to do this locally, no?Couture
Neither am I. It requires at least one running FTP server, which is a little overkill for a file conversion.Moneyed
Good answer to get a laugh though!Presage
FTP in ascii mode can also translate between tabs and spaces, depending on the implementation, which would be undesirable.Mustache
C
1

Some text editors, such as UltraEdit/UEStudio have this functionality built-in.

File > Conversions > DOS to UNIX

Claqueur answered 24/11, 2008 at 1:24 Comment(4)
gVim can also do this, loading it automatically in DOS mode, then type ":set filemode=unix" without the quotes (from memory) and saving.Mustache
not useful for an automated process though...Couture
ah, true. UEStudio does actually have a rather good scripting and macro system built in, which would actually let you do this via the command line, but you're right, it's not the best tool for an automated process.Claqueur
not useful for an automated process though = incorrect. both ultraedit/uestudio can run macros from command line on files. It has a very powerful scripting engine that is basically javascript with a few more powerful methods available. ultraedit.com/support/tutorials_power_tips/ultraedit/…Popery
Q
-2

If it is just one file I use notepad++. Nice because it is free. I have cygwin installed and use a one liner script I wrote for multiple files. If your interest in the script leave a comment. (I don't have it available to me a this moment.)

Quality answered 24/11, 2008 at 2:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.