Is it possible to use data.table on SparkR with Sparkdataframes?
Asked Answered
E

1

6

Situation

I used to work on Rstudio with data.table instead of plyr or sqldf because it's really fast. Now, i'm working on sparkR on an azure cluster and i'd like to now if i can use data.table on my spark Data frames and if it's faster than sql ?

Externalization answered 9/11, 2017 at 12:35 Comment(3)
There is a sparklyr package by Rstudio which allows you to use a spark dataframe with dplyr.Rydder
Yes, @DavidArenburg, but can one use the data.table package and its idioms to analyze spark dataframes, or must one use dplyr?Misshape
@Misshape data.tables author works at h2o.ai. It is a distributed system (based on Spark IIRC) that undarstands R syntax and has a lot of data.table features built in (thanks to Matt) such as distributed binary search (see this). Other than that, I'm not sure how you would work with data.table on a Spark data.frame unless you will collect it to one node.Rydder
E
5

It is not possible. SparkDataFrames are Java objects with a thin R interface. While it is possible to use worker side R in some limited cases (dapply, gapply) there is no use for data.table there.

Excrement answered 9/11, 2017 at 12:53 Comment(1)
Thank you, but is it faster to keep data frames and work with data.table or to use SparkDataFrames and work with sparklyr or sparkSQL ??Externalization

© 2022 - 2024 — McMap. All rights reserved.