Dataframes in a list; adding a new variable with name of dataframe
Asked Answered
I

5

15

I have a list of dataframes which I eventually want to merge while maintaining a record of their original dataframe name or list index. This will allow me to subset etc across all the rows. To accomplish this I would like to add a new variable 'id' to every dataframe, which contains the name/index of the dataframe it belongs to.

Edit: "In my real code the dataframe variables are created from reading multiple files using the following code, so I don't have actual names only those in the 'files.to.read' list which I'm unsure if they will align with the dataframe order:

mylist <- llply(files.to.read, read.csv)

A few methods have been highlighted in several posts: Working-with-dataframes-in-a-list-drop-variables-add-new-ones and Using-lapply-with-changing-arguments

I have tried two similar methods, the first using the index list:

df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1,df2)

# Adds a new coloumn 'id' with a value of 5 to every row in every dataframe.
# I WANT to change the value based on the list index.
mylist1 <- lapply(mylist, 
    function(x){
        x$id <- 5
        return (x)
    }
)
#Example of what I WANT, instead of '5'.
#> mylist1
#[[1]]
  #x  y id
#1 1 11  1
#2 2 12  1
#3 3 13  1
#4 4 14  1
#5 5 15  1
#
#[[2]]
  #x  y id
#1 1 11  2
#2 2 12  2
#3 3 13  2
#4 4 14  2
#5 5 15  2

The second attempts to pass the names() of the list.

# I WANT it to add a new coloumn 'id' with the name of the respective dataframe
# to every row in every dataframe.
mylist2 <- lapply(names(mylist), 
    function(x){
        portfolio.results[[x]]$id <- "dataframe name here"
        return (portfolio.results[[x]])
    }
)
#Example of what I WANT, instead of 'dataframe name here'.
# mylist2
#[[1]]
  #x  y id
#1 1 11  df1
#2 2 12  df1
#3 3 13  df1
#4 4 14  df1
#5 5 15  df1
#
#[[2]]
  #x  y id
#1 1 11  df2
#2 2 12  df2
#3 3 13  df2
#4 4 14  df2
#5 5 15  df2

But the names() function doesn't work on a list of dataframes; it returns NULL. Could I use seq_along(mylist) in the first example.

Any ideas or better way to handle the whole "merge with source id"

Edit - Added Solution below: I've implemented a solution using Hadleys suggestion and Tommy’s nudge which looks something like this.

files.to.read <- list.files(datafolder, pattern="\\_D.csv$", full.names=FALSE)
mylist <- llply(files.to.read, read.csv)
all <- do.call("rbind", mylist)
all$id <- rep(files.to.read, sapply(mylist, nrow))

I used the files.to.read vector as the id for each dataframe

I also changed from using merge_recurse() as it was very slow for some reason.

 all <- merge_recurse(mylist)

Thanks everyone.

Interdenominational answered 16/8, 2011 at 5:1 Comment(0)
L
19

Personally, I think it's easier to add the names after collapse:

df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)

all <- do.call("rbind", mylist)
all$id <- rep(names(mylist), sapply(mylist, nrow))
Laura answered 16/8, 2011 at 18:18 Comment(2)
Thanks I was unsure if collapsing would retain the order. I'm collapsing the list in order to use qplot() and specify the subsetting by $id. However, it is very slow to collapse so I'm open to other options.Interdenominational
Wow, I used rbind() instead of merge_recurse() in my actual code and the total time reduced from ~10mins to 10 seconds.Interdenominational
S
9

Your first attempt was very close. By using indices instead of values it will work. Your second attempt failed because you didn't name the elements in your list.

Both solutions below use the fact that lapply can pass extra parameters (mylist) to the function.

df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1=df1,df2=df2) # Name each data.frame!
# names(mylist) <- c("df1", "df2") # Alternative way of naming...

# Use indices - and pass in mylist
mylist1 <- lapply(seq_along(mylist), 
        function(i, x){
            x[[i]]$id <- i
            return (x[[i]])
        }, mylist
)

# Now the names work - but I pass in mylist instead of using portfolio.results.
mylist2 <- lapply(names(mylist), 
    function(n, x){
        x[[n]]$id <- n
        return (x[[n]])
    }, mylist
)
Stichter answered 16/8, 2011 at 5:46 Comment(1)
Thanks for answering both items. I edited my question to state that I didn't have the names but I actually do. Converting the files.to.read to a list allows me to assign each list element the corresponding name, providing they align correctly...I didn't see the woods for the trees!Interdenominational
H
2

names() could work it it had names, but you didn't give it any. It's an unnamed list. You will need ti use numeric indices:

> for(i in 1:length(mylist) ){ mylist[[i]] <- cbind(mylist[[i]], id=rep(i, nrow(mylist[[i]]) ) ) }
> mylist
[[1]]
  x  y id
1 1 11  1
2 2 12  1
3 3 13  1
4 4 14  1
5 5 15  1

[[2]]
  x  y id
1 1 11  2
2 2 12  2
3 3 13  2
4 4 14  2
5 5 15  2
Homework answered 16/8, 2011 at 5:37 Comment(1)
Thanks. Between Tommy, Hadley and yourself I've created a suitable solution. I ultimately wanted the names for recognition further down the process.Interdenominational
M
1

dlply function form plyr package could be an answer:

library('plyr')
df1 <- data.frame(x=c(1:5),y=c(11:15))
df2 <- data.frame(x=c(1:5),y=c(11:15))
mylist <- list(df1 = df1, df2 = df2)

all <- ldply(mylist)
Monadism answered 9/11, 2014 at 13:34 Comment(0)
J
0

You could also use tidyverse, using lst instead of list which automatically names the list for you and then use imap:

library(tidyverse)
mylist <- dplyr::lst(df1, df2)
purrr::imap(mylist, ~mutate(.x, id = .y))
# $df1
#   x  y  id
# 1 1 11 df1
# 2 2 12 df1
# 3 3 13 df1
# 4 4 14 df1
# 5 5 15 df1

# $df2
#   x  y  id
# 1 1 11 df2
# 2 2 12 df2
# 3 3 13 df2
# 4 4 14 df2
# 5 5 15 df2
Juvenile answered 4/5, 2022 at 9:32 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.