I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this.
My problem
I have 50+ .txt files at around 300 mb each, all in the same folder, call it x
, that I need to import to sparklyr, preferably one table.
I can read them individually like
spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE)
If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist
and then import them all into a list with lapply
filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(file = x, sep="|", header=FALSE))
This gives me a list where element k is the k:th .txt file in filelist
. So my question is: is there an equivalent way in sparklyr to do this?
What I've tried
I've tried to use lapply()
and spark_read_csv
, like I did above outside sparklyr. Just changed read.table
to spark_read_csv
and the arguments
datalist = lapply(filelist, function(x)spark_read_csv(path = x, sc = sc, name = "name", delimiter="|", header=FALSE))
which gives me a list with the same number of elements as .txt files, but every element (.txt file) is identical to the last .txt file in the file list.
> identical(datalist[[1]],datalist[[2]])
[1] TRUE
I obviously want each element to be one of the datasets. My idea is that after this, I can just rbind
them together.
Edit:
Found a way. The problem was that the argument "name" in spark_read_csv
needs to be updated for each time a new file is read, otherwise it will overwrite. So I did in a for loop instead of lapply, and in each iteration I change the name. Are there better ways?
datalist <- list()
for(i in 1:length(filelist)){
name <- paste("dataset",i,sep = "_")
datalist[[i]] <- spark_read_csv(path = filelist[i], sc = sc,
name = name, delimiter="|", header=FALSE)
}