Importing multiple files in sparklyr
Asked Answered
S

1

5

I'm very new to sparklyr and spark, so please let me know if this is not the "spark" way to do this.

My problem

I have 50+ .txt files at around 300 mb each, all in the same folder, call it x, that I need to import to sparklyr, preferably one table.

I can read them individually like

spark_read_csv(path=x, sc=sc, name="mydata", delimiter = "|", header=FALSE)

If I were to import them all outside of sparklyr, I would probably create a list with the file names, call it filelist and then import them all into a list with lapply

filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(file = x, sep="|", header=FALSE)) 

This gives me a list where element k is the k:th .txt file in filelist. So my question is: is there an equivalent way in sparklyr to do this?

What I've tried

I've tried to use lapply()and spark_read_csv, like I did above outside sparklyr. Just changed read.table to spark_read_csv and the arguments

datalist = lapply(filelist, function(x)spark_read_csv(path = x, sc = sc, name = "name", delimiter="|", header=FALSE))

which gives me a list with the same number of elements as .txt files, but every element (.txt file) is identical to the last .txt file in the file list.

> identical(datalist[[1]],datalist[[2]])
[1] TRUE

I obviously want each element to be one of the datasets. My idea is that after this, I can just rbind them together.

Edit:

Found a way. The problem was that the argument "name" in spark_read_csv needs to be updated for each time a new file is read, otherwise it will overwrite. So I did in a for loop instead of lapply, and in each iteration I change the name. Are there better ways?

datalist <- list()
for(i in 1:length(filelist)){
  name <- paste("dataset",i,sep = "_")
  datalist[[i]] <- spark_read_csv(path = filelist[i], sc = sc,
  name = name, delimiter="|", header=FALSE)
}
Shayna answered 31/3, 2018 at 10:23 Comment(0)
L
7

Since you (emphasis mine)

have 50+ .txt files at around 300 mb each, all in the same folder

you can just use wildcard in the path:

spark_read_csv(
  path = "/path/to/folder/*.txt",
  sc = sc, name = "mydata", delimiter = "|", header=FALSE) 

If directory contains only the data you can simplify this even further:

spark_read_csv(
  path = "/path/to/folder/",
  sc = sc, name = "mydata", delimiter = "|", header = FALSE)

Native Spark readers also support reading multiple paths at once (Scala code):

spark.read.csv("/some/path", "/other/path")

but as of 0.7.0-9014 it is not properly implemented in (current implementation of spark_normalize_path doesn't support vectors of size larger than one).

Lynellelynett answered 31/3, 2018 at 20:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.