Is it possible to use data.table on SparkR with Sparkdataframes?

About

Asked 9/11, 2017 at 12:35 Answered 9/11, 2017 at 12:53

Solved r apache-spark data.table cluster-computing sparkr

Situation

I used to work on Rstudio with data.table instead of plyr or sqldf because it's really fast. Now, i'm working on sparkR on an azure cluster and i'd like to now if i can use data.table on my spark Data frames and if it's faster than sql ?

Externalization answered 9/11, 2017 at 12:35 Comment(3)

There is a sparklyr package by Rstudio which allows you to use a spark dataframe with dplyr. – Rydder 10/11, 2017 at 12:18

Yes, @DavidArenburg, but can one use the data.table package and its idioms to analyze spark dataframes, or must one use dplyr? – Misshape 17/1, 2018 at 16:13

@Misshape data.tables author works at h2o.ai. It is a distributed system (based on Spark IIRC) that undarstands R syntax and has a lot of data.table features built in (thanks to Matt) such as distributed binary search (see this). Other than that, I'm not sure how you would work with data.table on a Spark data.frame unless you will collect it to one node. – Rydder 17/1, 2018 at 18:35

It is not possible. SparkDataFrames are Java objects with a thin R interface. While it is possible to use worker side R in some limited cases (dapply, gapply) there is no use for data.table there.

Excrement answered 9/11, 2017 at 12:53 Comment(1)

Thank you, but is it faster to keep data frames and work with data.table or to use SparkDataFrames and work with sparklyr or sparkSQL ?? – Externalization 15/11, 2017 at 8:33

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags