what is the difference between dplyr::copy_to and sparklyr::sdf_copy_to? - McMap

About

what is the difference between dplyr::copy_to and sparklyr::sdf_copy_to?

Asked 15/5, 2019 at 11:57 Answered 19/10, 2020 at 15:24

r dplyr sparklyr

A

1

6

I am using the library sparklyr to interact with 'spark'. There are two functions for put a data frame in a spark context. Such functions are 'dplyr::copy_to' and 'sparklyr::sdf_copy_to'. What is the difference and when is recommended to use one instead of the other?

Acetophenetidin answered 15/5, 2019 at 11:57 Comment(8)

The sparklyrone is implemented for spark data frames (following the RDD concept in a distributed environment), whereas dplyr works for R data frames, tibbles, etc...Is this what you are asking? I am not really sure – Jessalyn 15/5, 2019 at 14:27

This answers the first part of my question The second part is: do they perform the same? In case "yes", what situation is better to use one instead of the other one? – Acetophenetidin 15/5, 2019 at 14:39

You can't use either one or the other. You cannot use dplyr::copy_to inside spark environment, UNLESS you collect your data frames from RDDs to R data frames. Vice versa for sparklyr – Jessalyn 15/5, 2019 at 14:41

So if have two dataframes and I want to copy to the spark environment, there is absolutely no difference between them? I expected something as: is more efficiente the sparklyr version, or something in this way... – Acetophenetidin 15/5, 2019 at 14:45

If your data frame is small enough to be handled locally (or not distributed) then dplyr will be more efficient. The thing about spark is that it is more efficient IF your data set is big enough to be analysed in a distributed env. So If you try any type of analysis on a small data set, it will be more efficient to do it locally using dplyr or any other R as per usual – Jessalyn 15/5, 2019 at 14:48

so for big dataframes, is better the sparklyr version? Actually I came across of many problems trying to upload with dplyr version to spark, a dataframe with 2 millions of observations and just 3 columns. My solution was to split the dataframe in 4 pieces and upload separately, and later binding in one dataframe in spark. Do you think I could avoid this problem using the sparklyr version? – Acetophenetidin 15/5, 2019 at 14:53

Of course. Just load the entire thing in spark and do the aggregations there. For me, I do all my aggregations in spark (but I use pyspark instead of R), and then I collect locally and continue in R (or python). – Jessalyn 15/5, 2019 at 14:55

Let us continue this discussion in chat. – Acetophenetidin 15/5, 2019 at 15:12

H

2

They're the same. I would use copy_to rather than the specialist sdf_copy_to because it is more consistent with other data sources, but that's stylistic.

The function copy_to is a generic from dplyr and works with any data source which implements a dplyr backend.

You can use it with a spark connection because sparklyr implements copy_to.src_spark and copy_to.spark_connection. They are not exposed to the user since you're supposed to use copy_to and let it dispatch to the correct method.

copy_to.src_sparck just calls copy_to.spark_connection:

#> sparklyr:::copy_to.src_spark
function (dest, df, name, overwrite, ...) 
{
    copy_to(spark_connection(dest), df, name, ...)
}
<bytecode: 0x5646b227a9d0>
<environment: namespace:sparklyr>

copy_to.spark_connection just calls sdf_copy_to:

#> sparklyr:::copy_to.spark_connection
function (dest, df, name = spark_table_name(substitute(df)), 
    overwrite = FALSE, memory = TRUE, repartition = 0L, ...) 
{
    sdf_copy_to(dest, df, name, memory, repartition, overwrite, 
        ...)
}
<bytecode: 0x5646b21ef120>
<environment: namespace:sparklyr>

sdf_copy_to follows the package-wide convention of prefixing with "sdf_" the functions related to Spark DataFrames. On the other hand, copy_to is from dplyr and sparklyr provides compatible methods for the convenience of dplyr users.

Hughie answered 19/10, 2020 at 15:24 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.