Can we get factor matrices in R?
Asked Answered
B

2

19

It seems not possible to get matrices of factor in R. Is it true? If yes, why? If not, how should I do?

f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
m <- matrix(f,4,5)
is.factor(m) # fail.

m <- factor(m,letters[1:5])
is.factor(m) # oh, yes? 
is.matrix(m) # nope. fail. 

dim(f) <- c(4,5) # aha?
is.factor(f) # yes.. 
is.matrix(f) # yes!

# but then I get a strange behavior
cbind(f,f) # is not a factor anymore
head(f,2) # doesn't give the first 2 rows but the first 2 elements of f
# should I worry about it?
Biebel answered 25/2, 2015 at 15:35 Comment(2)
Why do you need a matrix of factors? Did you think about using data.frame (which supports factors much better) instead of a matrix?Buffalo
@Buffalo I did think about it, yeah. But a data.frame really is too much for what I need. Is it not natural to need a matrix of comparable, homogeneous elements?Biebel
M
28

In this case, it may walk like a duck and even quack like a duck, but f from:

f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
dim(f) <- c(4,5)

really isn't a matrix, even though is.matrix() claims that it strictly is one. To be a matrix as far as is.matrix() is concerned, f only needs to be a vector and have a dim attribute. By adding the attribute to f you pass the test. As you have seen, however, once you start using f as a matrix, it quickly loses the features that make it a factor (you end up working with the levels or the dimensions get lost).

There are really only matrices and arrays for the atomic vector types:

  1. logical,
  2. integer,
  3. real,
  4. complex,
  5. string (or character), and
  6. raw

plus, as @hadley reminds me, you can also have list matrices and arrays (by setting the dim attribute on a list object. See, for example, the Matrices & Arrays section of Hadley's book, Advanced R.)

Anything outside those types would be coerced to some lower type via as.vector(). This happens in matrix(f, nrow = 3) not because f is atomic according to is.atomic() (which returns TRUE for f because it is internally stored as an integer and typeof(f) returns "integer"), but because it has a class attribute. This sets the OBJECT bit on the internal representation of f and anything that has a class is supposed to be coerced to one of the atomic types via as.vector():

matrix <- function(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
                   dimnames = NULL) {
    if (is.object(data) || !is.atomic(data)) 
        data <- as.vector(data)
....

Adding dimensions via dim<-() is a quick way to create an array without duplicating the object, but this bypasses some of the checks and balances that R would do if you coerced f to a matrix via the other methods

matrix(f, nrow = 3) # or
as.matrix(f)

This gets found out when you try to use basic functions that work on matrices or use method dispatch. Note that after assigning dimensions to f, f still is of class "factor":

> class(f)
[1] "factor"

which explains the head() behaviour; you are not getting the head.matrix behaviour because f is not a matrix, at least as far as the S3 mechanism is concerned:

> debug(head.matrix)
> head(f) # we don't enter the debugger
[1] d c a d b d
Levels: a b c d e
> undebug(head.matrix)

and the head.default method calls [ for which there is a factor method, and hence the observed behaviour:

> debugonce(`[.factor`)
> head(f)
debugging in: `[.factor`(x, seq_len(n))
debug: {
    y <- NextMethod("[")
    attr(y, "contrasts") <- attr(x, "contrasts")
    attr(y, "levels") <- attr(x, "levels")
    class(y) <- oldClass(x)
    lev <- levels(x)
    if (drop) 
        factor(y, exclude = if (anyNA(levels(x))) 
            NULL
        else NA)
    else y
}
....

The cbind() behaviour can be explained from the documented behaviour (from ?cbind, emphasis mine):

The functions cbind and rbind are S3 generic, ...

....

In the default method, all the vectors/matrices must be atomic (see vector) or lists. Expressions are not allowed. Language objects (such as formulae and calls) and pairlists will be coerced to lists: other objects (such as names and external pointers) will be included as elements in a list result. Any classes the inputs might have are discarded (in particular, factors are replaced by their internal codes).

Again, the fact that f is of class "factor" is defeating you because the default cbind method will get called and it will strip the levels information and return the internal integer codes as you observed.

In many respects, you have to ignore or at least not fully trust what the is.foo functions tell you, because they are just using simple tests to say whether something is or is not a foo object. is.matrix() and is.atomic() are clearly wrong when it comes to f (with dimensions) from a particular point of view. They are also right in terms of their implementation or at least their behaviour can be understood from the implementation; I think is.atomic(f) is not correct, but if by "if is of an atomic type" R Core mean "type" to be the thing returned by typeof(f) then is.atomic() is right. A more strict test is is.vector(), which f fails:

> is.vector(f)
[1] FALSE

because it has attributes beyond a names attribute:

> attributes(f)
$levels
[1] "a" "b" "c" "d" "e"

$class
[1] "factor"

$dim
[1] 4 5

As to how should you get a factor matrix, well you can't, at least if you want it to retain the factor information (the labels for the levels). One solution would be to use a character matrix, which would retain the labels:

> fl <- levels(f)
> fm <- matrix(f, ncol = 5)
> fm
     [,1] [,2] [,3] [,4] [,5]
[1,] "c"  "a"  "a"  "c"  "b" 
[2,] "d"  "b"  "d"  "b"  "a" 
[3,] "e"  "e"  "e"  "c"  "e" 
[4,] "a"  "b"  "b"  "a"  "e"

and we store the levels of f for future use incase we lose some elements of the matrix along the way.

Or work with the internal integer representation:

> (fm2 <- matrix(unclass(f), ncol = 5))
     [,1] [,2] [,3] [,4] [,5]
[1,]    3    1    1    3    2
[2,]    4    2    4    2    1
[3,]    5    5    5    3    5
[4,]    1    2    2    1    5

and you can always get back to the levels/labels again via:

> fm2[] <- fl[fm2]
> fm2
     [,1] [,2] [,3] [,4] [,5]
[1,] "c"  "a"  "a"  "c"  "b" 
[2,] "d"  "b"  "d"  "b"  "a" 
[3,] "e"  "e"  "e"  "c"  "e" 
[4,] "a"  "b"  "b"  "a"  "e"

Using a data frame would seem to be not ideal for this as each component of the data frame would be treated as a separate factor whereas you seem to want to treat the array as a single factor with one set of levels.

If you really wanted to do what you want, which is have a factor matrix, you would most likely need to create your own S3 class to do this, plus all the methods to go with it. For example, you might store the factor matrix as a character matrix but with class "factorMatrix", where you stored the levels alongside the factor matrix as an extra attribute say. Then you would need to write [.factorMatrix, which would grab the levels, then use the default [ method on the matrix, and then add the levels attribute back on again. You could write cbindand head methods as well. The list of required method would grow quickly however, but a simple implementation may suit and if you make your objects have class c("factorMatrix", "matrix") (i.e inherit from the "matrix" class), you'll pick up all the properties/methods of the "matrix" class (which will drop the levels and other attributes) so you can at least work with the objects and see where you need to add new methods to fill out the behaviour of the class.

Myopic answered 25/2, 2015 at 16:48 Comment(13)
Great explanation +1, but disagree that the "factor" matrix is not a matrix. Rather, the failures for OP are an artifact of the inconsistent treatment of implicit classes in S3 dispatch. If R treated our factor matrix as class c("factor", "matrix", "integer") then most functions would work fine. I think failure to do so is a (small) design flaw in R.Rodriguez
Although your point about attribute discarding functions (rbind, etc.) still stands.Rodriguez
You can always mash the result of a function into a matrixy factor thing: structure(cbind(f,f),levels=levels(f),class=class(f)) - basically restoring the attributes that the cbind method took off. #hackhackAgglutinogen
@Rodriguez I would add that *"it depends what you by is a matrix" when it comes to whether the f-matrix is a matrix or not. From the viewpoint of is.matrix it clearly is. From the point of view of the S3 class system, it isn't (it is not of class "matrix" nor does it inherit from it). From the point of view of what the docs define a matrix to be, if also isn't a matrix as f is not an atomic vector. That last point is perhaps pertinent; that you can end up with an object that looks like a matrix is a quirk, which soon gets rectified. And yes, R could be improved if R Core wanted too.Myopic
@Spacedman, re: your comment in R chat, you can fop=function(f) function(x, ...) structure(f(unclass(x), ...),levels=levels(x),class=class(x)). Still need to account for factors with mixed levels though.Rodriguez
This just amazes me. `Feels like the learning curve for R never stops climbing to the skies! Many thanks for this pretty complete explanation. I'll go for an integer matrix then, with a character vector storing my levels (this is what I actually did), since it'd be too much work creating a new class and supporting it for my limited knowledge of R. I hope this thread'll be updated if ever R started featuring a broader factor support. Thank you guys anyway ;)Biebel
Lists can also be matricesAnnapolis
I think it would be better to work with a character matrix since there's less chance of accidentally treating the values like numbers. There's little memory overhead to using strings instead of integers because of the global string pool.Annapolis
@Rodriguez for there to be such a thing as a factor-matrix, S3 would need to support multiple inheritance, which it does not. (Not that "matrix" is really well defined in the S3 system, being mostly an implicit class)Annapolis
@hadley, I'm not sure multiple inheritance is needed here. Consider this example from ts: z <- ts(matrix(rnorm(300), 100, 3), start = c(1961, 1), frequency = 12). Both a time series and a matrix (though the matrix is defined explicitly). Simple linear inheritance is fine. In fact, we can do stuff like class(f)<-c("factor", "matrix") and all of a sudden head(f) actually works.Rodriguez
@Rodriguez I think you might be able to make it work, but it's going to fundamentally be a kludge. Matrix isn't even really an S3 classAnnapolis
@hadley, agree, though I do think it should be since it is treated as such (i.e. there are a lot of .matrix methods). There is this unfortunate discrepancy that when there is no class, then S3 dispatches on Matrix, but when there is one, the implicit class is ignored. There is really no reason why the implicit classes shouldn't be dispatched on when there are explicit classes (other than that isn't how R is coded).Rodriguez
@Annapolis Thanks; knew I would miss one. I've added a note to this effect with a link to the relevant section of Advanced R where this is shown.Myopic
R
7

Unfortunately factor support is not completely universal in R, so many R functions default to treating factors as their internal storage type, which is integer:

> typeof(factor(letters[1:3]))
[1] "integer  

This is what happens with matrix, cbind. They don't know how to handle factors, but they do know what to do with integers, so they treat your factor like an integer. head is actually the opposite. It does know how to handle a factor, but it never bothers to check that your factor is also a matrix so just treats it like a normal dimensionless factor vector.

Your best bet to operate as if you had factors with your matrix is to coerce it to character. Once you are done with your operations, you can restore it back to factor form. You could also do this with the integer form, but then you risk weird stuff (you could for example do matrix multiplication on an integer matrix, but that makes no sense for factors).

Note that if you add class "matrix" to your factor some (but not all) things start working:

f <- factor(letters[1:9])
dim(f) <- c(3, 3)
class(f) <- c("factor", "matrix")
head(f, 2)

Produces:

     [,1] [,2] [,3]
[1,] a    d    g   
[2,] b    e    h   
Levels: a b c d e f g h i

This doesn't fix rbind, etc.

Rodriguez answered 25/2, 2015 at 16:52 Comment(5)
Let's go for the integer matrix then, since I feel it's heavy to work with characters when all I need is 5 or 10 levels.. and try not to make dummy operations with my matrix ;)Biebel
@lago-lito, as per Hadley's comment, characters don't actually take up much room at all since the actual character strings are only stored once in memory. Also, note update to my answerRodriguez
Oh really? Where should I read further about the way matrices are stored in memory? I must admit that it does feel weird thinking of a matrix of integer as light as a matrix of strings..Biebel
@lago-lito, see the note in ?factor for the cost of character vectors. Matrices are just stored as vectors with a dim attribute, so the only thing that really matters is how the underlying vector is stored.Rodriguez
Great, thanks! I've also just found the function utils::object.size that can help me measuring the actual weight of my matrices, `lot of fun ;)Biebel

© 2022 - 2024 — McMap. All rights reserved.