Why write.csv and read.csv are not consistent? [closed]
Asked Answered
O

1

11

The problem is simple, consider the following example:

m <- head(iris)
write.csv(m, file = 'm.csv')
m1 <- read.csv('m.csv')

The result of this is that m1 is different from the original object m in that it has a new first column named "X". If I really wanted to make them equal, I have to use additional arguments, like in these two examples:

write.csv(m, file = 'm.csv', row.names = FALSE)
# and then
m1 <- read.csv('m.csv')

or

write.csv(m, file = 'm.csv')
m1 <- read.csv('m.csv', row.names = 1)

The question is, what is the reason of this difference? in particular, why if write.csv and read.csv are supposedly intended to stick to the Excel convention, the don't import the same object that was exported in the first place? To me this is a very counter intuitive behavior and highly undesirable.

(this results happens exactly the same if I use the csv2 variants of these functions)

Thanks in advance!


These are the data.frames m and m1 if you prefer not to use R to see the example:

> m
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> m1
  X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1          5.1         3.5          1.4         0.2  setosa
2 2          4.9         3.0          1.4         0.2  setosa
3 3          4.7         3.2          1.3         0.2  setosa
4 4          4.6         3.1          1.5         0.2  setosa
5 5          5.0         3.6          1.4         0.2  setosa
6 6          5.4         3.9          1.7         0.4  setosa
Outhaul answered 20/9, 2012 at 11:54 Comment(15)
Why is why they're inconsistent important? There's no way the defaults will be changed now. Out of curiosity, where does it say that read.csv and write.csv are supposed to use some Excel convention?Kazan
As I said before, I think that it is counter intuitive, but this is just my opinion. In particular, if write.csv and read.csv are a "fast" way to forget about the specifics and "just do what you need", this is very annoying. In my case I always forget about this detail. You can read about this Excel convention with ?write.table.Outhaul
@Outhaul so write yourself your own wrappers that set your preferred defaults. This is after all a programming language.Scrutable
I'm with @Outhaul on this one. Totally undesirable. Both functions should have the same concept of a standard file (read.csv uses the most common format) so we don't have to remember which function uses what, or have to go through the doc each time we use them. It was a bad design in the first place.Retraction
Serendipity? (Don't think I should offer that as an official answer tho...)Fennell
When ?write.table provides an example of writing a CSV to input into Excel (I assume this is the "convention" you mention), it specifically says you need the equivalent of read.csv('m.csv', row.names=1) to read it back into R. Even if lots of people find this counter-intuitive, it's not going to change now (these defaults are probably 10+ years old). Hence, why these defaults were chosen is a moot point, and your question doesn't really have an answer.Kazan
@Retraction right, but you aren't going to get R changed now I would venture. Those functions have been like that for aeons.Fennell
As good as the Q may be, I'm voting to close because unless the usually silent R Core chime in with an official statement, any answers (if an answer can even be supplied) will be opinion & that is OT for SO.Fennell
@JoshuaUlrich I guess so. In the help file it uses the word "convention" explicitly, so there is no need for quotation marks I think, unless the help file itself it wrongly uses this term. My intention is not to ask for changing the functions of course, I just wanted to know so I can tell some students.Outhaul
From svn log src/library/utils/R/write.table.R "r32344 | ripley | 2004-12-27 08:25:32 -0500 (Mon, 27 Dec 2004) | 4 lines; add write.csv[2]" (and in r34879, "allow write.csv(row.names=FALSE)")Shela
@Juan: Sorry, I meant "convention" as an offense to Excel, not to you.Kazan
@flodel: it would be nice if there were a way to do this on-line, but I'm not aware of one. Juan, if you wanted you could post this issue as an answer at #1535521 ...Shela
@BenBolker. That's it! So the reason is that it was December 27th and Ripley was still under a big hungover. That, or he was not happy about the gifts he got for Christmas. Payback.Retraction
Because read.csv was written by the Lilliputians and write.table was written by the Blefuscudians. ;)Checkmate
@BenBolker thanks for the cryptic information! I also posted the issue in he Q you suggested (great Q btw).Outhaul
K
2

Here's my guess...

write.table writes a data.frame to a file and data.frames always have row names, so not writing row names by default would be throwing away information. (Yes, write.table will also write a matrix and matrices don't have to have row names, but data.frames are probably used much more often than matrices.)

read.table returns a data.frame but CSV files don't have any concept of row names, so someone may argue that it's counter-intuitive to assume, by default, that the first column of a CSV is a row name.

Now there may be a way to make these two functions consistent, but I would argue that writing to a text file isn't the best way to output/input data from one R session to another. It's much safer/faster to use save, load, saveRDS, readRDS, etc.

Kazan answered 20/9, 2012 at 12:51 Comment(2)
This is probably the best answer we are going to get, unless Brian Ripley itself comes here and gives us some light! Thanks Joshua.Outhaul
As the functions save, load and related are the best options to keep all the information, I think that in write.csv and read.csv priority should be given to ease of use (which would be achieved by not saving row names by default I think), but keeping the option of using row.names = TRUE while exporting.Outhaul

© 2022 - 2024 — McMap. All rights reserved.