fastest way convert tab-delimited file to csv in linux

J

11

54

I have a tab-delimited file that has over 200 million lines. What's the fastest way in linux to convert this to a csv file? This file does have multiple lines of header information which I'll need to strip out down the road, but the number of lines of header is known. I have seen suggestions for sed and gawk, but I wonder if there is a "preferred" choice.

Just to clarify, there are no embedded tabs in this file.

Jordanson answered 29/3, 2010 at 0:55 Comment(1)

For csv to tsv and embedded delimiter issues, see also answers to #13476035 (also relevant : unix.stackexchange.com/questions/48672/…). – Hemorrhage 26/12, 2015 at 21:18

C

58

If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.

The blank space here is a literal tab:

$ echo "hello   world" | tr "\\t" ","
hello,world

Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

Caraviello answered 29/3, 2010 at 0:58 Comment(5)

More common are embedded commas in the source, which then require wrapping with quotes. Which is troublesome if there are embedded quotes... – Glister 29/3, 2010 at 1:9

Thanks for the tr suggestion. How does it compare to sed with speed? Suppose you wanted to skip the header start at line number x and continue to the rest of the file. Is there a way to implement this with tr? (I should also clarify that there are no embedded commas in the file.) – Jordanson 29/3, 2010 at 1:10

@andrewj: tr should be much faster, as it's just doing character-by-character replacement instead of regex matching. As for skipping header, the easiest thing is to just process in two chunks - if you know the length, head -n <length> input > output; tail -n +<length+1> input | tr ... >> output; if you don't know the length, probably something with grep -n... – Arrestment 29/3, 2010 at 1:13

@andrew, sed has support for transliteration, also you can use address range. – Catamount 29/3, 2010 at 1:37

This is an incomplete answer; " must be converted into "" for CSV, fields must be wrapped in " quotes if they contain quotes or commas or newlines (though newlines will not exist in data presented as TSV). – Dessiatine 15/8, 2013 at 18:16

B

86

If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:

import sys
import csv

tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tabin:
  commaout.writerow(row)

Run it from a shell as follows:

python script.py < input.tsv > output.csv

Boggess answered 29/3, 2010 at 1:21 Comment(8)

Unless you know for sure that there are no embedded commas and no embedded tabs, this is a very reliable way to do it. Even though it probably doesn't meet the criteria for being 'the fastest'. – Armanda 29/3, 2010 at 3:0

It may not be "the fastest" , but it does handles embedded tabs and commas for me. – Kella 8/4, 2014 at 19:27

This rocks. I was writing a cryptic sed script to take care of this in bash - but that can't compete with the completeness of the python csv library. Thank you Ignacio, for this offering. Regarding speed - Ease of use use and reliability way outway speed - this is certainly fast enough. :-) – Durante 28/10, 2014 at 14:35

Ignacio, with your permission I'd like to add your script, with attribution, as tab2csv in my vbin library. – Durante 28/10, 2014 at 14:42

@IgnacioVazquez-Abrams: Thanks, Ignacio. vbin/t2c – Durante 28/10, 2014 at 20:15

Excel 2010 didn't like this tsv->csv conversion for at least one of my .tsv files until specifically quoting all the fields: commaout = csv.writer(sys.stdout, dialect=csv.excel, quotechar='"', quoting=csv.QUOTE_ALL). Possibly due to my tsv not being "properly" quoted for Excel? – Audriaaudrie 14/5, 2016 at 15:55

Indeed, to me that was the best choice! Other options might fail whenever there were tabs or commas inside the file. This Python library is a flexible yet seamless solution! – Ayurveda 12/3, 2017 at 23:49

If you're doing this in windows and getting extra lines between each row try adding lineterminator='\n' to the commaout line: commaout = csv.writer(sys.stdout, lineterminator='\n', dialect=csv.excel) – Remediable 6/8, 2019 at 1:34

C

58

If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.

The blank space here is a literal tab:

$ echo "hello   world" | tr "\\t" ","
hello,world

Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

Caraviello answered 29/3, 2010 at 0:58 Comment(5)

More common are embedded commas in the source, which then require wrapping with quotes. Which is troublesome if there are embedded quotes... – Glister 29/3, 2010 at 1:9

Thanks for the tr suggestion. How does it compare to sed with speed? Suppose you wanted to skip the header start at line number x and continue to the rest of the file. Is there a way to implement this with tr? (I should also clarify that there are no embedded commas in the file.) – Jordanson 29/3, 2010 at 1:10

@andrewj: tr should be much faster, as it's just doing character-by-character replacement instead of regex matching. As for skipping header, the easiest thing is to just process in two chunks - if you know the length, head -n <length> input > output; tail -n +<length+1> input | tr ... >> output; if you don't know the length, probably something with grep -n... – Arrestment 29/3, 2010 at 1:13

@andrew, sed has support for transliteration, also you can use address range. – Catamount 29/3, 2010 at 1:37

This is an incomplete answer; " must be converted into "" for CSV, fields must be wrapped in " quotes if they contain quotes or commas or newlines (though newlines will not exist in data presented as TSV). – Dessiatine 15/8, 2013 at 18:16

B

22

perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv

Perl is generally faster at this sort of thing than the sed, awk, and Python.

Beret answered 30/3, 2010 at 6:28 Comment(4)

best answer for me, only one slightly change, scape double quotes: perl -lpe 's/"/\\"/g; s/^|$/"/g; s/\t/","/g' < ... – Condottiere 10/12, 2014 at 23:55

long live Perl! Exactly the one-liner that I needed. – Adjectival 26/7, 2016 at 22:8

You're a saviour – Wassail 27/2, 2018 at 9:54

dope, have not used this language in years, perfect and fast! – Interdependent 18/7, 2019 at 19:18

H

13

If you want to convert the whole tsv file into a csv file:
```
$ cat data.tsv | tr "\\t" "," > data.csv
```
If you want to omit some fields:
```
$ cat data.tsv | cut -f1,2,3 | tr "\\t" "," > data.csv
```
The above command will convert the data.tsv file to data.csv file containing only the first three fields.

Humorist answered 31/1, 2017 at 9:53 Comment(0)

S

8

sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile

Damn the critics, quote everything, CSV doesn't care.

<tab> is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.

Scarron answered 29/3, 2010 at 1:44 Comment(1)

For tab, you can do sed -e 's/"/\\"/g' -e "s/\t/\",\"/g" -e 's/^/"/' -e 's/$/"/' infile > outfile. – Vallecula 12/1, 2016 at 19:24

C

7

@ignacio-vazquez-abrams 's python solution is great! For people who are looking to parse delimiters other tab, the library actually allows you to set arbitrary delimiter. Here is my modified version to handle pipe-delimited files:

import sys
import csv

pipein = csv.reader(sys.stdin, delimiter='|')
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in pipein:
  commaout.writerow(row)

Companionship answered 22/1, 2015 at 20:58 Comment(0)

C

4

assuming you don't want to change header and assuming you don't have embedded tabs

# cat file
header  header  header
one     two     three

$ awk 'NR>1{$1=$1}1' OFS="," file
header  header  header
one,two,three

NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.

another way if you have blank columns and you care about that.

awk 'NR>1{gsub("\t",",")}1' file

using sed

sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)

Catamount answered 29/3, 2010 at 1:22 Comment(0)

B

3

You can also use xsv for this

xsv input -d '\t' input.tsv > output.csv

In my test on a 300MB tsv file, it was roughly 5x faster than the python solution (2.5s vs. 14s).

Biafra answered 9/5, 2019 at 9:29 Comment(1)

Thanks for suggesting xsv. I've been looking at a fast CSV toolkit for a while. – Oleary 7/6, 2019 at 21:47

W

0

the following awk oneliner supports quoting + quote-escaping

printf "flop\tflap\"" | awk -F '\t' '{ gsub(/"/,"\"\"\"",$i); for(i = 1; i <= NF; i++) { printf "\"%s\"",$i; if( i < NF ) printf "," }; printf "\n" }'

gives

"flop","flap""""

Weekender answered 27/8, 2014 at 10:58 Comment(0)

G

0

right click file, click rename, delete the 't' and put a 'c'. I'm actually not joking, most csv parsers can handle tab delimiters. I had this issue now and for my purposes renaming worked just fine.

Grumpy answered 30/4, 2020 at 22:16 Comment(0)

A

-1

I think it is better not to cat the file because it may create problem in the case of large file. The better way may be

$ tr ',' '\t' < csvfile.csv > tabdelimitedFile.txt

The command will get input from csvfile.csv and store the result as tab seperated in tabdelimitedFile.txt

Acquittance answered 28/4, 2018 at 10:4 Comment(0)

Recommended topics

Hot tags