I got a weird result today.
To replicate it, consider the following data frames:
x <- data.frame(x=1:3, y=11:13)
y <- x[1:3, 1:2]
They are supposed to be and actually are identical:
identical(x,y)
# [1] TRUE
Applying t()
to indentical objects should produce the same result, but:
identical(t(x),t(y))
# [1] FALSE
The difference is in the column names:
colnames(t(x))
# NULL
colnames(t(y))
# [1] "1" "2" "3"
Given this, if you want to stack y
by columns, you get what you'd expect:
stack(as.data.frame(t(y)))
# values ind
# 1 1 1
# 2 11 1
# 3 2 2
# 4 12 2
# 5 3 3
# 6 13 3
while:
stack(as.data.frame(t(x)))
# values ind
# 1 1 V1
# 2 11 V1
# 3 2 V2
# 4 12 V2
# 5 3 V3
# 6 13 V3
In the latter case, as.data.frame()
does not find the original column names and automatically generates them.
The culprit is in as.matrix()
, called by t()
:
rownames(as.matrix(x))
# NULL
rownames(as.matrix(y))
# [1] "1" "2" "3"
A workaround is to set rownames.force
:
rownames(as.matrix(x, rownames.force=TRUE))
# [1] "1" "2" "3"
rownames(as.matrix(y, rownames.force=TRUE))
# [1] "1" "2" "3"
identical(t(as.matrix(x, rownames.force=TRUE)),
t(as.matrix(y, rownames.force=TRUE)))
# [1] TRUE
(and rewrite stack(...)
call accordingly.)
My questions are:
Why
as.matrix()
treats differentlyx
andy
andhow can you tell the difference between them?
Note that other info functions do not reveal differences between x, y
:
identical(attributes(x), attributes(y))
# [1] TRUE
identical(str(x), str(y))
# ...
#[1] TRUE
Comments to solutions
Konrad Rudolph gives a concise but effective explanation to the behaviour outlined above (see also mt1022 for more details).
In short Konrad shows that:
a) x
and y
are internally different;
b) "identical
is too is simply too lax by default" to catch this internal difference.
Now, if you take a subset T
of the set S
, which has all the elements of S
, then S
and T
are exactly the same objects. So, if you take a data frame y
, which has all the rows and columns of x
, then x
and y
should be exactly the same objects. Unfortunately x \neq y
!
This behaviour is not only counterintuitive but also obfuscated, that is, the difference is not self evident, but only internal and even the default identical
function can't see it.
Another natural principle is that transposing two identical (matrix-like) objects produces identical objects. Again, this is broken by the fact that, before transposing, identical
is "too lax"; after transposing, the default identical
is enough to see the difference.
IMHO this behaviour (even if it is not a bug) is a misbehaviour for a scientific language like R.
Hopefully this post will drive some attention and the R team will consider to revise it.
row.names
are defined , as they are different indput(x)
, anddput(y
). Maybe they are explicitly added when using[.data.frame
– Dynamoelectricidentical(x, y, attrib.as.set=FALSE)
seems to pick up on differences ( noting the line in?identical
"Note that identical(x, y, FALSE, FALSE, FALSE, FALSE) pickily tests for exact equality." – Dynamoelectric.row_names_info
and as @Generalization pointed out is because of the automatic row names inx
. – Waterscapeas.matrix
removes automatic row names so that they don't end up as row names in the matrix. – Waterscapelength == nrow(x)
vector or as a compact form of typec(NA, -nrow(x))
to avoid creating and carrying aas.character(1:nrow(x))
vector around. When subsetting "x","[.data.frame"
has to create some form ofrow.names
for the subsetted "x". Even ifx[c(1, 2, 3), ]
seems to not need "row.names", something likex[c(2, 3, 1), ]
needs and"[.data.frame"
needs to be consistent regarding its output... – Vedavedaliac(NA, nrow(x))
(the object has "row.names" but there's no need to create1:nrow(x)
). In the second case a "row.names" attribute asc(2, 3, 1)
has to be created. – Vedavedalia