How to read tab separated file into data.table using fread?

Asked 12/8, 2015 at 8:15 Answered 16/9, 2015 at 23:38

Sample data (emp.data)

Beth  4.00  0
Dan   3.75  0
Kathy 4.00  10
Mark  5.00  20
Mary  5.50  22
Susie 4.25  18

I can read it into a data.frame using read.table, then convert it to data.table:

library(data.table)
df <- read.table("emp.data", col.names = c("Name", "PayRate", "HoursWorked"))
DT <- as.data.table(df, key = HoursWorked)

Calculate the pay (filter out zero hours):

DT[HoursWorked > 0, .(Name, Pay = PayRate * HoursWorked),]

    Name   Pay
1: Kathy  40.0
2:  Mark 100.0
3:  Mary 121.0
4: Susie  76.5

That works fine; however, I consider there's an extra step in converting. Since there's fread() in data.table, why not use it directly?

readDT <- fread("emp.data", header=FALSE, sep="\t")

               V1
1:  Beth  4.00  0
2:  Dan   3.75  0
3: Kathy 4.00  10
4: Mark  5.00  20
5: Mary  5.50  22
6: Susie 4.25  18

 str(readDT)
Classes 'data.table' and 'data.frame':  6 obs. of  1 variable:
 $ V1: chr  "Beth  4.00  0" "Dan   3.75  0" "Kathy 4.00  10" "Mark  5.00  20" ...
 - attr(*, ".internal.selfref")=<externalptr>

The data is recognized as one column; obviously this doesn't work.

Question

How to read this data using fread() properly? (If possible, set the column names as well.)

Nesline answered 12/8, 2015 at 8:15 Comment(12)

Try not to specify the sep and leave it "auto" (let fread decide). In other words just do fread("emp.data", header=FALSE) – Palladic 12/8, 2015 at 8:45

@DavidArenburg Thank you. I got this error:

> readDT <- fread("emp.data", header=FALSE) Error in fread("emp.data", header = FALSE) :    Not positioned correctly after testing format of header row. ch=' '

– Nesline 12/8, 2015 at 8:59

It's hard to tell, can you provide a dput of your data set? Maybe also trying without specifying header – Palladic 12/8, 2015 at 9:5

Is it possible for you to upload a small example data in dropbox or so? – Galliard 12/8, 2015 at 9:5

@Galliard Example data is at the very beginning of the question. – Nesline 12/8, 2015 at 9:6

@Nesline I had seen that, but I don't know whether that example have all the spaces in the correct way as the formatting here might be different than in your original dataset – Galliard 12/8, 2015 at 9:6

@Galliard I just copy the content and paste here. Did I break something in copy and paste? Will it help to get the file from gist: gist.github.com/nickleeh/07c1d0723aab557059a6#file-emp-data – Nesline 12/8, 2015 at 9:9

I haven't tried with your example. Let me try – Galliard 12/8, 2015 at 9:10

I tried fread("awk '{$1=$1}1' emp.data") and it worked for me – Galliard 12/8, 2015 at 9:13

@DavidArenburg I got the same error without specifying header. And the dput of the data:

dput(readDT) structure(list(V1 = c("Beth  4.00  0", "Dan   3.75  0", "Kathy 4.00  10",  "Mark  5.00  20", "Mary  5.50  22", "Susie 4.25  18")), .Names = "V1", row.names = c(NA,  -6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)

– Nesline 12/8, 2015 at 9:16

@Galliard May I ask what's the purpose of the awk command? – Nesline 12/8, 2015 at 9:16

It removes the extra white spaces. – Galliard 12/8, 2015 at 9:18

This has been fixed recently in the devel version, v1.9.5 (will be soon available on CRAN as v1.9.6):

require(data.table) # v1.9.5+
fread("~/Downloads/tmp.txt")
#       V1   V2 V3
# 1:  Beth 4.00  0
# 2:   Dan 3.75  0
# 3: Kathy 4.00 10
# 4:  Mark 5.00 20
# 5:  Mary 5.50 22
# 6: Susie 4.25 18

See README.md in the project page for more info. fread gained strip.white argument (amidst other functionalities / bug fixes) which is by default TRUE.

Update: it also has col.names argument now:

fread("~/Downloads/tmp.txt", col.names = c("Name", "PayRate", "HoursWorked"))
#     Name PayRate HoursWorked
# 1:  Beth    4.00           0
# 2:   Dan    3.75           0
# 3: Kathy    4.00          10
# 4:  Mark    5.00          20
# 5:  Mary    5.50          22
# 6: Susie    4.25          18

Churning answered 16/9, 2015 at 23:38 Comment(2)

Thank you very much for your answer and for making data.table such a great tool. col.names is really handy too! – Nesline 17/9, 2015 at 8:20

if for some reason you need to be explicit just do fread('~/Download/tmp.txt', sep = '\t') – Teletypesetter 7/5, 2018 at 12:38

Using awk to remove the white spaces and then reading with fread worked for me.

 DT <- fread("awk '{$1=$1}1' emp.data")
 DT 
 #      V1   V2 V3
 #1:  Beth 4.00  0
 #2:   Dan 3.75  0
 #3: Kathy 4.00 10
 #4:  Mark 5.00 20
 #5:  Mary 5.50 22
 #6: Susie 4.25 18

 str(DT)
 #Classes ‘data.table’ and 'data.frame':    6 obs. of  3 variables:
 #$ V1: chr  "Beth" "Dan" "Kathy" "Mark" ...
 #$ V2: num  4 3.75 4 5 5.5 4.25
 #$ V3: int  0 0 10 20 22 18
 # - attr(*, ".internal.selfref")=<externalptr>

I was able to replicate the same problem with the OP's code

 fread("emp.data", header=FALSE, sep="\t")
 #               V1
 #1:  Beth  4.00  0
 #2:  Dan   3.75  0
 #3: Kathy 4.00  10
 #4: Mark  5.00  20
 #5: Mary  5.50  22
 #6: Susie 4.25  18

Galliard answered 12/8, 2015 at 9:15 Comment(5)

Thank you. I just read your answer. To my surprise that the read.table can handle this without problem. Maybe read.table is more smart. – Nesline 12/8, 2015 at 9:19

@Nesline read.table/read.csv should handle this. I am not sure why fread has this type of problem. – Galliard 12/8, 2015 at 9:20

I think your approach with awk is fine; however, I'm currently on Windows, and the version of awk on Windows doesn't work. So I tried to clean up with perl: perl -pe 's/ {1,}/,/g' emp.data > emp.data.clean.perl, then read it with fread. I'll check your awk approach when I'm back to linux. – Nesline 12/8, 2015 at 15:48

Wrapping my head around that awk command: So it leverages that an if isn't necessary for a then (defaults to true) and a then isn't necessary for an if (defaults to printing the whole line) and assigning a value to itself actually strips out leading and trailing spaces (still can't believe that one). – Desexualize 15/8, 2015 at 16:58

@ClaytonStanley Sorry, I didn't understand your comment. – Galliard 15/8, 2015 at 19:24

Recommended topics

Hot tags