How do I make a list of data frames?
Asked Answered
P

10

271

How do I make a list of data frames and how do I access each of those data frames from the list?

For example, how can I put these data frames in a list ?

d1 <- data.frame(y1 = c(1, 2, 3),
                 y2 = c(4, 5, 6))
d2 <- data.frame(y1 = c(3, 2, 1),
                 y2 = c(6, 5, 4))
Pratt answered 6/7, 2013 at 2:16 Comment(3)
This is in a couple answers, but it's worth having a visible comment here too: use = not <- inside data.frame(). By using <- you create y1 and y2 in your global environment and your data frame isn't what you want it to be.Tragedienne
Look at that mess of code with no spaces and <-s inside data.frame(). What a newb I was.Pratt
Not anymore. I just edited your question to fix the code formatting. Feel free to revert if you feel nostalgic.Giuseppe
S
156

This isn't related to your question, but you want to use = and not <- within the function call. If you use <-, you'll end up creating variables y1 and y2 in whatever environment you're working in:

d1 <- data.frame(y1 <- c(1, 2, 3), y2 <- c(4, 5, 6))
y1
# [1] 1 2 3
y2
# [1] 4 5 6

This won't have the seemingly desired effect of creating column names in the data frame:

d1
#   y1....c.1..2..3. y2....c.4..5..6.
# 1                1                4
# 2                2                5
# 3                3                6

The = operator, on the other hand, will associate your vectors with arguments to data.frame.

As for your question, making a list of data frames is easy:

d1 <- data.frame(y1 = c(1, 2, 3), y2 = c(4, 5, 6))
d2 <- data.frame(y1 = c(3, 2, 1), y2 = c(6, 5, 4))
my.list <- list(d1, d2)

You access the data frames just like you would access any other list element:

my.list[[1]]
#   y1 y2
# 1  1  4
# 2  2  5
# 3  3  6
Spence answered 6/7, 2013 at 2:36 Comment(0)
T
475

The other answers show you how to make a list of data.frames when you already have a bunch of data.frames, e.g., d1, d2, .... Having sequentially named data frames is a problem, and putting them in a list is a good fix, but best practice is to avoid having a bunch of data.frames not in a list in the first place.

The other answers give plenty of detail of how to assign data frames to list elements, access them, etc. We'll cover that a little here too, but the Main Point is to say don't wait until you have a bunch of a data.frames to add them to a list. Start with the list.

The rest of the this answer will cover some common cases where you might be tempted to create sequential variables, and show you how to go straight to lists. If you're new to lists in R, you might want to also read What's the difference between [[ and [ in accessing elements of a list?.


Lists from the start

Don't ever create d1 d2 d3, ..., dn in the first place. Create a list d with n elements.

Reading multiple files into a list of data frames

This is done pretty easily when reading in files. Maybe you've got files data1.csv, data2.csv, ... in a directory. Your goal is a list of data.frames called mydata. The first thing you need is a vector with all the file names. You can construct this with paste (e.g., my_files = paste0("data", 1:5, ".csv")), but it's probably easier to use list.files to grab all the appropriate files: my_files <- list.files(pattern = "\\.csv$"). You can use regular expressions to match the files, read more about regular expressions in other questions if you need help there. This way you can grab all CSV files even if they don't follow a nice naming scheme. Or you can use a fancier regex pattern if you need to pick certain CSV files out from a bunch of them.

At this point, most R beginners will use a for loop, and there's nothing wrong with that, it works just fine.

my_data <- list()
for (i in seq_along(my_files)) {
    my_data[[i]] <- read.csv(file = my_files[i])
}

A more R-like way to do it is with lapply, which is a shortcut for the above

my_data <- lapply(my_files, read.csv)

Of course, substitute other data import function for read.csv as appropriate. readr::read_csv or data.table::fread will be faster, or you may also need a different function for a different file type.

Either way, it's handy to name the list elements to match the files

names(my_data) <- gsub("\\.csv$", "", my_files)
# or, if you prefer the consistent syntax of stringr
names(my_data) <- stringr::str_replace(my_files, pattern = ".csv", replacement = "")

Splitting a data frame into a list of data frames

This is super-easy, the base function split() does it for you. You can split by a column (or columns) of the data, or by anything else you want

mt_list = split(mtcars, f = mtcars$cyl)
# This gives a list of three data frames, one for each value of cyl

This is also a nice way to break a data frame into pieces for cross-validation. Maybe you want to split mtcars into training, test, and validation pieces.

groups = sample(c("train", "test", "validate"),
                size = nrow(mtcars), replace = TRUE)
mt_split = split(mtcars, f = groups)
# and mt_split has appropriate names already!

Simulating a list of data frames

Maybe you're simulating data, something like this:

my_sim_data = data.frame(x = rnorm(50), y = rnorm(50))

But who does only one simulation? You want to do this 100 times, 1000 times, more! But you don't want 10,000 data frames in your workspace. Use replicate and put them in a list:

sim_list = replicate(n = 10,
                     expr = {data.frame(x = rnorm(50), y = rnorm(50))},
                     simplify = F)

In this case especially, you should also consider whether you really need separate data frames, or would a single data frame with a "group" column work just as well? Using data.table or dplyr it's quite easy to do things "by group" to a data frame.

I didn't put my data in a list :( I will next time, but what can I do now?

If they're an odd assortment (which is unusual), you can simply assign them:

mylist <- list()
mylist[[1]] <- mtcars
mylist[[2]] <- data.frame(a = rnorm(50), b = runif(50))
...

If you have data frames named in a pattern, e.g., df1, df2, df3, and you want them in a list, you can get them if you can write a regular expression to match the names. Something like

df_list = mget(ls(pattern = "df[0-9]"))
# this would match any object with "df" followed by a digit in its name
# you can test what objects will be got by just running the
ls(pattern = "df[0-9]")
# part and adjusting the pattern until it gets the right objects.

Generally, mget is used to get multiple objects and return them in a named list. Its counterpart get is used to get a single object and return it (not in a list).

Combining a list of data frames into a single data frame

A common task is combining a list of data frames into one big data frame. If you want to stack them on top of each other, you would use rbind for a pair of them, but for a list of data frames here are three good choices:

# base option - slower but not extra dependencies
big_data = do.call(what = rbind, args = df_list)

# data table and dplyr have nice functions for this that
#  - are much faster
#  - add id columns to identify the source
#  - fill in missing values if some data frames have more columns than others
# see their help pages for details
big_data = data.table::rbindlist(df_list)
big_data = dplyr::bind_rows(df_list)

(Similarly using cbind or dplyr::bind_cols for columns.)

To merge (join) a list of data frames, you can see these answers. Often, the idea is to use Reduce with merge (or some other joining function) to get them together.

But I really need sequentially named variables

They can be a pain to work with, and almost always you don't actually need them, but if you do, do everything you can in a list for ease, and then you can use list2env() to put all the list items into an environment, such as your .GlobalEnv.

Why put the data in a list?

Put similar data in lists because you want to do similar things to each data frame, and functions like lapply, sapply do.call, the purrr package, and the old plyr l*ply functions make it easy to do that. Examples of people easily doing things with lists are all over SO.

Even if you use a lowly for loop, it's much easier to loop over the elements of a list than it is to construct variable names with paste and access the objects with get. Easier to debug, too.

Think of scalability. If you really only need three variables, it's fine to use d1, d2, d3. But then if it turns out you really need 6, that's a lot more typing. And next time, when you need 10 or 20, you find yourself copying and pasting lines of code, maybe using find/replace to change d14 to d15, and you're thinking this isn't how programming should be. If you use a list, the difference between 3 cases, 30 cases, and 300 cases is at most one line of code---no change at all if your number of cases is automatically detected by, e.g., how many .csv files are in your directory.

You can name the elements of a list, in case you want to use something other than numeric indices to access your data frames (and you can use both, this isn't an XOR choice).

Overall, using lists will lead you to write cleaner, easier-to-read code, which will result in fewer bugs and less confusion.

Tragedienne answered 23/6, 2014 at 23:34 Comment(4)
Which book do you recommend that covers working with lists?Weathersby
I recommend reading questions and answers on Stack Overflow that are tagged with both r and list.Tragedienne
I am performing dfs <- list.files(pattern = "^[0-9]") |> lapply(read.csv) but the elements of dfs are not data frames but lists, i.e. dfs[i] |> class() gives > [1] "list". What is going on and how do I get to dfs[i] |> class() giving > [1] "dataframe"? Note that dfs[i] <- dfs[i] |> as.data.frame() is not helping.Omnipotence
@Omnipotence If dfs is a list of data frames, then dfs[i] will be a list of one data frame. You need to use [[ to extract an individual list element, dfs[[i]] will be a data frame. See this FAQ for more context and explanation.Tragedienne
S
156

This isn't related to your question, but you want to use = and not <- within the function call. If you use <-, you'll end up creating variables y1 and y2 in whatever environment you're working in:

d1 <- data.frame(y1 <- c(1, 2, 3), y2 <- c(4, 5, 6))
y1
# [1] 1 2 3
y2
# [1] 4 5 6

This won't have the seemingly desired effect of creating column names in the data frame:

d1
#   y1....c.1..2..3. y2....c.4..5..6.
# 1                1                4
# 2                2                5
# 3                3                6

The = operator, on the other hand, will associate your vectors with arguments to data.frame.

As for your question, making a list of data frames is easy:

d1 <- data.frame(y1 = c(1, 2, 3), y2 = c(4, 5, 6))
d2 <- data.frame(y1 = c(3, 2, 1), y2 = c(6, 5, 4))
my.list <- list(d1, d2)

You access the data frames just like you would access any other list element:

my.list[[1]]
#   y1 y2
# 1  1  4
# 2  2  5
# 3  3  6
Spence answered 6/7, 2013 at 2:36 Comment(0)
B
27

You can also access specific columns and values in each list element with [ and [[. Here are a couple of examples. First, we can access only the first column of each data frame in the list with lapply(ldf, "[", 1), where 1 signifies the column number.

ldf <- list(d1 = d1, d2 = d2)  ## create a named list of your data frames
lapply(ldf, "[", 1)
# $d1
#   y1
# 1  1
# 2  2
# 3  3
#
# $d2
#   y1
# 1  3
# 2  2
# 3  1

Similarly, we can access the first value in the second column with

lapply(ldf, "[", 1, 2)
# $d1
# [1] 4
# 
# $d2
# [1] 6

Then we can also access the column values directly, as a vector, with [[

lapply(ldf, "[[", 1)
# $d1
# [1] 1 2 3
#
# $d2
# [1] 3 2 1
Bourbon answered 23/6, 2014 at 22:3 Comment(1)
It's stuff like this that makes me not want to use R ever again. It just seems super kludgy.Fanchet
S
14

If you have a large number of sequentially named data frames you can create a list of the desired subset of data frames like this:

d1 <- data.frame(y1=c(1,2,3), y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1), y2=c(6,5,4))
d3 <- data.frame(y1=c(6,5,4), y2=c(3,2,1))
d4 <- data.frame(y1=c(9,9,9), y2=c(8,8,8))

my.list <- list(d1, d2, d3, d4)
my.list

my.list2 <- lapply(paste('d', seq(2,4,1), sep=''), get)
my.list2

where my.list2 returns a list containing the 2nd, 3rd and 4th data frames.

[[1]]
  y1 y2
1  3  6
2  2  5
3  1  4

[[2]]
  y1 y2
1  6  3
2  5  2
3  4  1

[[3]]
  y1 y2
1  9  8
2  9  8
3  9  8

Note, however, that the data frames in the above list are no longer named. If you want to create a list containing a subset of data frames and want to preserve their names you can try this:

list.function <-  function() { 

     d1 <- data.frame(y1=c(1,2,3), y2=c(4,5,6))
     d2 <- data.frame(y1=c(3,2,1), y2=c(6,5,4))
     d3 <- data.frame(y1=c(6,5,4), y2=c(3,2,1))
     d4 <- data.frame(y1=c(9,9,9), y2=c(8,8,8))

     sapply(paste('d', seq(2,4,1), sep=''), get, environment(), simplify = FALSE) 
} 

my.list3 <- list.function()
my.list3

which returns:

> my.list3
$d2
  y1 y2
1  3  6
2  2  5
3  1  4

$d3
  y1 y2
1  6  3
2  5  2
3  4  1

$d4
  y1 y2
1  9  8
2  9  8
3  9  8

> str(my.list3)
List of 3
 $ d2:'data.frame':     3 obs. of  2 variables:
  ..$ y1: num [1:3] 3 2 1
  ..$ y2: num [1:3] 6 5 4
 $ d3:'data.frame':     3 obs. of  2 variables:
  ..$ y1: num [1:3] 6 5 4
  ..$ y2: num [1:3] 3 2 1
 $ d4:'data.frame':     3 obs. of  2 variables:
  ..$ y1: num [1:3] 9 9 9
  ..$ y2: num [1:3] 8 8 8

> my.list3[[1]]
  y1 y2
1  3  6
2  2  5
3  1  4

> my.list3$d4
  y1 y2
1  9  8
2  9  8
3  9  8
Somatist answered 6/7, 2013 at 3:43 Comment(1)
Instead of lapply(foo, get), just use mget(foo)Tragedienne
B
12

Taking as a given you have a "large" number of data.frames with similar names (here d# where # is some positive integer), the following is a slight improvement of @mark-miller's method. It is more terse and returns a named list of data.frames, where each name in the list is the name of the corresponding original data.frame.

The key is using mget together with ls. If the data frames d1 and d2 provided in the question were the only objects with names d# in the environment, then

my.list <- mget(ls(pattern="^d[0-9]+"))

which would return

my.list
$d1
  y1 y2
1  1  4
2  2  5
3  3  6

$d2
  y1 y2
1  3  6
2  2  5
3  1  4

This method takes advantage of the pattern argument in ls, which allows us to use regular expressions to do a finer parsing of the names of objects in the environment. An alternative to the regex "^d[0-9]+$" is "^d\\d+$".

As @gregor points out, it is a better overall to set up your data construction process so that the data.frames are put into named lists at the start.

data

d1 <- data.frame(y1 = c(1,2,3),y2 = c(4,5,6))
d2 <- data.frame(y1 = c(3,2,1),y2 = c(6,5,4))
Benzol answered 4/6, 2016 at 23:24 Comment(0)
B
8

I consider myself a complete newbie, but I think I have an extremely simple answer to one of the original subquestions that has not been stated here: accessing the data frames, or parts of it.

Let's start by creating the list with data frames as was stated above:

d1 <- data.frame(y1 = c(1, 2, 3), y2 = c(4, 5, 6))

d2 <- data.frame(y1 = c(3, 2, 1), y2 = c(6, 5, 4))

my.list <- list(d1, d2)

Then, if you want to access a specific value in one of the data frames, you can do so by using the double brackets sequentially. The first set gets you into the data frame, and the second set gets you to the specific coordinates:

my.list[[1]][3, 2]

[1] 6
Bucella answered 7/1, 2020 at 12:4 Comment(0)
A
6

for loop simulations

If I have a for loop generating dataframes I start with an empty list() and append the dataframes as they're generated.

# Empty list
dat_list <- list()

for(i in 1:5){
    # Generate dataframe
    dat <- data.frame(x=rnorm(10), y=rnorm(10))
    # Add to list
    dat_list <- append(dat_list, list(dat))
}

Note that it's list(dat) inside our append() call.

Accessing the data

Then to get the nth dataframe from the list we use dat_list[[n]]. You can access the data within this dataframe in the normal way, e.g. dat_list[[2]]$x.

Or if you want the same part from all your dataframes sapply(dat_list, "[", "x").

See @Gregor Thomas's answer for doing this without for loops.

Animatism answered 29/7, 2021 at 9:42 Comment(0)
O
4

This may be a little late but going back to your example I thought I would extend the answer just a tad.

 D1 <- data.frame(Y1=c(1,2,3), Y2=c(4,5,6))
 D2 <- data.frame(Y1=c(3,2,1), Y2=c(6,5,4))
 D3 <- data.frame(Y1=c(6,5,4), Y2=c(3,2,1))
 D4 <- data.frame(Y1=c(9,9,9), Y2=c(8,8,8))

Then you make your list easily:

mylist <- list(D1,D2,D3,D4)

Now you have a list but instead of accessing the list the old way such as

mylist[[1]] # to access 'd1'

you can use this function to obtain & assign the dataframe of your choice.

GETDF_FROMLIST <- function(DF_LIST, ITEM_LOC){
   DF_SELECTED <- DF_LIST[[ITEM_LOC]]
   return(DF_SELECTED)
}

Now get the one you want.

D1 <- GETDF_FROMLIST(mylist, 1)
D2 <- GETDF_FROMLIST(mylist, 2)
D3 <- GETDF_FROMLIST(mylist, 3)
D4 <- GETDF_FROMLIST(mylist, 4)

Hope that extra bit helps.

Cheers!

Overman answered 22/6, 2014 at 15:12 Comment(3)
Yes I know but for some reason when I copied and pasted, everything went to caps. :( In any event the code in lower case works.Overman
I'm curious why you would prefer GETDF_FROMLIST(mylist, 1) to mylist[[1]]? If you prefer function syntax you could even do "[["(mylist, 1) without defining a custom function.Tragedienne
You could also simplify your function definition, the entire body of the function could just be return(DF_LIST[[ITEM_LOC]]), no need to assign an intermediate variable.Tragedienne
A
3

Very simple ! Here is my suggestion :

If you want to select dataframes in your workspace, try this :

Filter(function(x) is.data.frame(get(x)) , ls())

or

ls()[sapply(ls(), function(x) is.data.frame(get(x)))]

all these will give the same result.

You can change is.data.frame to check other types of variables like is.function

Abloom answered 22/5, 2018 at 11:16 Comment(0)
O
2

In the tidyverse, you can use the function lst() to automatically name the lists based on the objects.

library(tibble)

d1 <- data.frame(y1 = c(1, 2, 3),
                 y2 = c(4, 5, 6))
d2 <- data.frame(y1 = c(3, 2, 1),
                 y2 = c(6, 5, 4))

lst(d1, d2)
# $d1
# y1 y2
# 1  1  4
# 2  2  5
# 3  3  6
# 
# $d2
# y1 y2
# 1  3  6
# 2  2  5
# 3  1  4

This can be helpful when compiling lists that you later want to reference by name.

Ostrich answered 10/3, 2022 at 14:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.