Trying to merge multiple csv files in R
Asked Answered
W

10

18

I'm attempting to merge multiple csv files using R. all of the CSV files have the same fields and are all a shared folder only containing these CSV files. I've attempted to do it using the following code:

multmerge=function(mypath) {
    filenames=list.files(path=mypath, full.names=TRUE)
    datalist= lapply(filenames, function (x) read.csv(file=x, header=True))
    Reduce(function(x,y) merge(x,y), datalist)}

I am entering my path as something like "Y:/R Practice/specdata". I do get an ouput when I apply the function to my 300 or so csv files, but the result gives me my columns names, but beneath it has <0 rows> (or 0-length row.names). Please let me know if you have any suggestions on why this isn't working and how I can fix it.

Wicked answered 14/5, 2015 at 16:10 Comment(2)
Are there any warnings when you output the files? Are you using the proper encoding for them?Hatley
No there are no warnings. Everything processes fine. It's just that the output is nothing but my column namesWicked
D
45

For a shorter, faster solution

library(dplyr)
library(readr)
df <- list.files(path="yourpath", full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows 
Dasya answered 14/5, 2015 at 16:50 Comment(6)
Thanks for the reply. I attempted your solution. Maybe I'm not using it correctly, but after copying and pasting your code into my console I come up with "There were 50 or more warnings (use warnings() to see the first 50)". All of the problems are with parsing if that means anything to you. Thanks.Wicked
Note that df is a default function in R. Maybe rename to dVerticillaster
pasing problem can be solved by changing the function parameters. I could not find how to do it inside the dplyr code, so I created a different function I used inside the dplyr block: readmeta <- function (path){ read_tsv (path, col_types = cols( .default = col_character())) }Quita
Succinct version of the other stuff out there.Viol
May be obvious but if you are wondering where the final merged csv is, it is actually inside the dataframe df . You need to write it to a fresh csv with this write.csv(df, "newmergedfile.csv")Theoretical
If you want to merge multiple CSV column-wise, instead of row-wise, you can adapt it this way: list.files() %>% lapply(read_csv) %>% data.frameJennifer
F
3

Your code worked for me, but you need change header = True to header = TRUE.

Fortnightly answered 14/5, 2015 at 16:24 Comment(2)
So do you have any idea what the output that I got means? Also, thanks for replying.Wicked
I created a folder full of .csv files with the same column names as you described, then ran your code just with the change noted above, and it worked to combine all the .csv files. So I couldn't recreate your error. @Maiasaura's code is a much nicer solution though.Fortnightly
A
3

Another option that has proved to work for my setup:

multmerge = function(path){
  filenames=list.files(path=path, full.names=TRUE)
  rbindlist(lapply(filenames, fread))
}


path <- "Dropbox/rstudio-share/dataset/MB"
DF <- multmerge(path)

If you need a much granular control of your CSV file during the loading process you can change the fread by a function like so:

multmerge = function(path){
  filenames=list.files(path=path, full.names=TRUE)
  rbindlist(lapply(filenames, function(x){read.csv(x, stringsAsFactors = F, sep=';')}))
}
Age answered 23/10, 2017 at 8:35 Comment(0)
O
2

If all your csv files have exactly the same fields (column names) and you want simply to combine them vertically, you should use rbind instead of merge:

> a
             A         B
[1,]  2.471202 38.949232
[2,] 16.935362  6.343694
> b
            A          B
[1,] 0.704630  0.1132538
[2,] 4.477572 11.8869057
> rbind(a, b)
             A          B
[1,]  2.471202 38.9492316
[2,] 16.935362  6.3436939
[3,]  0.704630  0.1132538
[4,]  4.477572 11.8869057
Oas answered 28/12, 2016 at 4:43 Comment(0)
P
2

I tried working with the same function but included the all=TRUE in the merge function and worked just fine.

The code I used is as follows:

multmerge = function(mypath){
  filenames=list.files(path=mypath, full.names=TRUE)
  datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
  Reduce(function(x,y) {merge(x,y,all = TRUE)}, datalist)
}

full_data = multmerge("path_name for your csv folder")

Hope this helps. Cheers!

Precious answered 24/3, 2019 at 9:56 Comment(0)
R
2

For anyone who has many csvs with the same header (I had ~1000), here is a quicker approach, which avoids parsing the csvs individually.

filenames <- list.files(path=mypath, full.names=TRUE)

#read the files in as plaintext
csv_list <- lapply(filenames , readLines)

#remove the header from all but the first file
csv_list[-1] <- sapply(csv_list[-1], "[", 2)

#unlist to create a character vector
csv_list <- unlist(csv_list)

#write the csv as one single file
writeLines(text=csv_list,
           con="all_my_csvs_combined.csv")
           )

#read the csv as one single file
all_my_csvs_combined <- read.csv("all_my_csvs_combined.csv")

Ramer answered 22/4, 2021 at 15:44 Comment(0)
G
1

Let me give you the best I have ever had:

library(pacman)
p_load(doParallel,data.table,stringr)

# get the file name
dir() %>% str_subset("\\.csv$") -> fn

# use parallel setting
(cl = detectCores() %>% 
  makeCluster()) %>% 
  registerDoParallel()

# read and bind
system.time({
  big_df = foreach(i = fn,
                    .packages = "data.table") %dopar% {
                      fread(i,colClasses = "chracter")
                    } %>% 
    rbindlist(fill = T)
})

# end of parallel work
stopImplicitCluster(cl)

This should be faster as long as you have more cores in your computer.If you are dealing with big data, it is preferred.

Gest answered 8/1, 2020 at 5:22 Comment(4)
I get Error in stopImplicitCluster(cl) : unused argument (cl)Ditch
Also, why so many packages in the p_load?Ditch
Thank you for the comments, I've edited to revise (lots of packages in p_load because this is extracted from my own working codes and I need them for other tasks). About the error, do you have the reproducible codes so I can yield your results?Gest
I used parallel::stopCluster(cl) instead which removed the error. I was combining the csvs which starts with CCG in the following zipfile https://files.digital.nhs.uk/32/3986A6/Daily.zip, downloaded onto desktop. Then was running your code against that unzipped (i as the file list)Ditch
F
0

To combine many CSV files:

  1. Set your file location for R programming.
  2. Run:
    library(plyr)
    veri <- ldply(list.files(), read.csv, header=TRUE, sep='\t')
    View(veri)
    
Firewater answered 22/4, 2021 at 5:13 Comment(0)
C
0

i used the rbind() function. just separate the csv files you want to merge by commas.

after reading all of the csv files separately:

read_csv("name_of_file.csv")

you can then use rbind() to concatenate them into a single csv file.

all_of_the_marbles <- rbind("name_of_file.csv", "name_of_file2.csv", "name_of_file4.csv")
Chorography answered 1/9, 2021 at 15:38 Comment(1)
Please add further details to expand on your answer, such as working code or documentation citations.Darkle
V
0

The following code is similar to the answer by @Michael but uses only base R functions; rbindlist is replaced.

multmerge <- function(path) {
  files <- list.files(path, pattern="csv")
  lst <- lapply(files, function(x){ read.csv(x, header=TRUE, stringsAsFactors=FALSE) })
  do.call("rbind", lst)
}
Visionary answered 19/9, 2022 at 15:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.