I am using the library sparklyr to interact with 'spark'. There are two functions for put a data frame in a spark context. Such functions are 'dplyr::copy_to
' and 'sparklyr::sdf_copy_to
'. What is the difference and when is recommended to use one instead of the other?
They're the same. I would use copy_to
rather than the specialist sdf_copy_to
because it is more consistent with other data sources, but that's stylistic.
The function copy_to
is a generic from dplyr
and works with any data source which implements a dplyr
backend.
You can use it with a spark connection because sparklyr
implements copy_to.src_spark
and copy_to.spark_connection
. They are not exposed to the user since you're supposed to use copy_to
and let it dispatch to the correct method.
copy_to.src_sparck
just calls copy_to.spark_connection
:
#> sparklyr:::copy_to.src_spark
function (dest, df, name, overwrite, ...)
{
copy_to(spark_connection(dest), df, name, ...)
}
<bytecode: 0x5646b227a9d0>
<environment: namespace:sparklyr>
copy_to.spark_connection
just calls sdf_copy_to
:
#> sparklyr:::copy_to.spark_connection
function (dest, df, name = spark_table_name(substitute(df)),
overwrite = FALSE, memory = TRUE, repartition = 0L, ...)
{
sdf_copy_to(dest, df, name, memory, repartition, overwrite,
...)
}
<bytecode: 0x5646b21ef120>
<environment: namespace:sparklyr>
sdf_copy_to
follows the package-wide convention of prefixing with "sdf_"
the functions related to Spark DataFrames. On the other hand, copy_to
is from dplyr
and sparklyr
provides compatible methods for the convenience of dplyr
users.
© 2022 - 2024 — McMap. All rights reserved.
sparklyr
one is implemented for spark data frames (following the RDD concept in a distributed environment), whereasdplyr
works for R data frames, tibbles, etc...Is this what you are asking? I am not really sure – Jessalyndplyr::copy_to
inside spark environment, UNLESS you collect your data frames from RDDs to R data frames. Vice versa forsparklyr
– Jessalyndplyr
will be more efficient. The thing about spark is that it is more efficient IF your data set is big enough to be analysed in a distributed env. So If you try any type of analysis on a small data set, it will be more efficient to do it locally usingdplyr
or any other R as per usual – Jessalynpyspark
instead ofR
), and then I collect locally and continue in R (or python). – Jessalyn