what is the difference between dplyr::copy_to and sparklyr::sdf_copy_to?
Asked Answered
A

1

6

I am using the library sparklyr to interact with 'spark'. There are two functions for put a data frame in a spark context. Such functions are 'dplyr::copy_to' and 'sparklyr::sdf_copy_to'. What is the difference and when is recommended to use one instead of the other?

Acetophenetidin answered 15/5, 2019 at 11:57 Comment(8)
The sparklyrone is implemented for spark data frames (following the RDD concept in a distributed environment), whereas dplyr works for R data frames, tibbles, etc...Is this what you are asking? I am not really sureJessalyn
This answers the first part of my question The second part is: do they perform the same? In case "yes", what situation is better to use one instead of the other one?Acetophenetidin
You can't use either one or the other. You cannot use dplyr::copy_to inside spark environment, UNLESS you collect your data frames from RDDs to R data frames. Vice versa for sparklyrJessalyn
So if have two dataframes and I want to copy to the spark environment, there is absolutely no difference between them? I expected something as: is more efficiente the sparklyr version, or something in this way...Acetophenetidin
If your data frame is small enough to be handled locally (or not distributed) then dplyr will be more efficient. The thing about spark is that it is more efficient IF your data set is big enough to be analysed in a distributed env. So If you try any type of analysis on a small data set, it will be more efficient to do it locally using dplyr or any other R as per usualJessalyn
so for big dataframes, is better the sparklyr version? Actually I came across of many problems trying to upload with dplyr version to spark, a dataframe with 2 millions of observations and just 3 columns. My solution was to split the dataframe in 4 pieces and upload separately, and later binding in one dataframe in spark. Do you think I could avoid this problem using the sparklyr version?Acetophenetidin
Of course. Just load the entire thing in spark and do the aggregations there. For me, I do all my aggregations in spark (but I use pyspark instead of R), and then I collect locally and continue in R (or python).Jessalyn
Let us continue this discussion in chat.Acetophenetidin
H
2

They're the same. I would use copy_to rather than the specialist sdf_copy_to because it is more consistent with other data sources, but that's stylistic.

The function copy_to is a generic from dplyr and works with any data source which implements a dplyr backend.

You can use it with a spark connection because sparklyr implements copy_to.src_spark and copy_to.spark_connection. They are not exposed to the user since you're supposed to use copy_to and let it dispatch to the correct method.

copy_to.src_sparck just calls copy_to.spark_connection:

#> sparklyr:::copy_to.src_spark
function (dest, df, name, overwrite, ...) 
{
    copy_to(spark_connection(dest), df, name, ...)
}
<bytecode: 0x5646b227a9d0>
<environment: namespace:sparklyr>

copy_to.spark_connection just calls sdf_copy_to:

#> sparklyr:::copy_to.spark_connection
function (dest, df, name = spark_table_name(substitute(df)), 
    overwrite = FALSE, memory = TRUE, repartition = 0L, ...) 
{
    sdf_copy_to(dest, df, name, memory, repartition, overwrite, 
        ...)
}
<bytecode: 0x5646b21ef120>
<environment: namespace:sparklyr>

sdf_copy_to follows the package-wide convention of prefixing with "sdf_" the functions related to Spark DataFrames. On the other hand, copy_to is from dplyr and sparklyr provides compatible methods for the convenience of dplyr users.

Hughie answered 19/10, 2020 at 15:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.