How can I ensure that a partition has representative observations from each level of a factor?
Asked Answered
Y

1

7

I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable?

test.df <- data.frame(a = sample(c(0,1),100, rep = T),
                      b = factor(sample(letters, 100, rep = T)),
                      c = factor(sample(c("apple", "orange"), 100, rep = T)))

set.seed(123)
partition <- function(data, train.size = .7){
  train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
  test <- data[-as.numeric(row.names(train)), ]
  partitioned.data <- list(train = train, test = test)
  return(partitioned.data)
}

part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])

EDIT - New function using 'caret' package and createDataPartition():

partition <- function(data, factor=NULL, train.size = .7){
  if (("package:caret" %in% search()) == FALSE){
    stop("Install and Load 'caret' package")
  }
  if (is.null(factor)){
    train.index <- createDataPartition(as.numeric(row.names(data)),
                                       times = 1, p = train.size, list = FALSE)
    train <- data[train.index, ]
    test <- data[-train.index, ]
  }
  else{
    train.index <- createDataPartition(factor,
                                       times = 1, p = train.size, list = FALSE)
    train <- data[train.index, ]
    test <- data[-train.index, ]
  }
  partitioned.data <- list(train = train, test = test)
  return(partitioned.data)
}
Yellowknife answered 11/5, 2013 at 5:1 Comment(3)
I know this doesn't answer your question, but is it even a good idea to condition on factor variable with such a small number of observations? These are bound to be very imprecisely estimated and might make your out-of-sample forecasts worse rather than better.Uranian
You are correct that this would be a bad idea. However, I would never be using this function on such a small dataset in practice. I made it small so that the partitioned test.df is pretty much guaranteed to have some factors with 0 observations.Yellowknife
I have the same problem, but it seems that second partition function definition applies only for one factor at a time. I understood your question refers about having a partition in train data set, that contains all levels of factors for input columns: b and c, but createDataPartition only works for one column, for example: partition(test.df, factor = test.df[, c("b", "c")]) does not work.Percuss
A
6

Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:

caret - data splitting

The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).

stratified <- function(df, group, size) {
  # USE: * Specify your data frame and grouping variable (as column
  # number) as the first two arguments.
  # * Decide on your sample size. For a sample proportional to the
  # population, enter "size" as a decimal. For an equal number
  # of samples from each group, enter "size" as a whole number.
  #
  # Example 1: Sample 10% of each group from a data frame named "z",
  # where the grouping variable is the fourth variable, use:
  #
  # > stratified(z, 4, .1)
  #
  # Example 2: Sample 5 observations from each group from a data frame
  # named "z"; grouping variable is the third variable:
  #
  # > stratified(z, 3, 5)
  #
  require(sampling)
  temp = df[order(df[group]),]
  colsToReturn <- ncol(df)

  #Don't want to attempt to sample more than possible
  dfCounts <- table(df[group])
  if (size > min(dfCounts)) {
    size <- min(dfCounts)
  }



  if (size < 1) {
    size = ceiling(table(temp[group]) * size)
  } else if (size >= 1) {
    size = rep(size, times=length(table(temp[group])))
  }
  strat = strata(temp, stratanames = names(temp[group]),
                 size = size, method = "srswor")
  (dsample = getdata(temp, strat))

  dsample <- dsample[order(dsample[1]),]
  dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
  return(dsample)

}
Angulo answered 11/5, 2013 at 6:53 Comment(8)
I'll check it out. I've heard of it before but never used it.Yellowknife
Let me know. I have another function I could give you the code for as well.Angulo
it might be nice to see your function... I'm still having some issues. I isolated the part of createDataPartition() that is messing me up, but I am unsure how to fix it. This is kind of getting into a question of "is it worth building a model using many factors with only one observation?"Yellowknife
Sure, should I make it a separate answer?Angulo
It's nice to try to mention where or from whom you might have gotten that code from, "somewhere on the net".Riyal
Yeah, I got this code more than a year before this question was posted, and then I modified the original to what you see above. I figured the answer was of more use than the cite which I've since lost.Angulo
I'm the original author, and there's a good chance you got it from an answer here on SO since that's where I would have posted the version that uses the "sampling" package. Since then, there are two improved versions: a data.frame one and a data.table one (the latter requires the most recent development version of "data.table" though, but is very fast).Riyal
Hey Ananda, I didn't in any way mean to take anything from you. It's really nice code and I took pains (I thought) to say I had gotten it from someone else. I just honestly couldn't recall where I'd seen it. I hope you can accept the apology.Angulo

© 2022 - 2024 — McMap. All rights reserved.