sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?

Asked 16/8, 2018 at 22:42 Answered 17/4, 2020 at 5:3

Solved r apache-spark hive apache-spark-sql sparklyr

Spark 2.0 with Hive

Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore

In Spark I would do that like this,

irisDf.write.format("orc")
    .mode("overwrite")
    .option("path", "s3://my_bucket/iris/")
    .saveAsTable("my_database.iris")

In sparklyr I can use the spark_write_tablefunction,

data("iris")
iris_spark <- copy_to(sc, iris, name = "iris")
output <- spark_write_table(
   iris
  ,name = 'my_database.iris'
  ,mode = 'overwrite'
)

But this doesn't allow me to set path or format

I can also use spark_write_orc

spark_write_orc(
    iris
  , path = "s3://my_bucket/iris/"
  , mode = "overwrite"
)

but it doesn't have the saveAsTable option

Now, I CAN use invoke statements to replicate the Spark code,

  sdf <- spark_dataframe(iris_spark)
  writer <- invoke(sdf, "write")
  writer %>% 
    invoke('format', 'orc') %>% 
    invoke('mode', 'overwrite') %>% 
    invoke('option','path', "s3://my_bucket/iris/") %>% 
    invoke('saveAsTable',"my_database.iris")

But I am wondering if there is anyway to instead pass the format and path options into spark_write_table or the saveAsTable option into spark_write_orc?

Phlegethon answered 16/8, 2018 at 22:42 Comment(0)

path can be set using options argument, which is equivalent to options call in the native DataFrameWriter:

spark_write_table(
  iris_spark, name = 'my_database.iris', mode = 'overwrite', 
  options = list(path = "s3a://my_bucket/iris/")
)

By default in Spark, this will create a table stored as Parquet at path (partition subdirectories can be specified with the partition_by argument).

As of today there is no such option for format, but an easy workaround is to set spark.sessionState.conf.defaultDataSourceName property, either on runtime

spark_session_config(
  sc, "spark.sessionState.conf.defaultDataSourceName", "orc"
)

or when you create a session.

Sydel answered 16/8, 2018 at 23:14 Comment(1)

Thanks for the edit @MichaelChirico. Parquet has been default source, since the beginning of the data source API, so unless it is configured explicitly, you can expect it will be consistent across versions. – Sydel 22/11, 2018 at 16:27

The spark.sessionState.conf.defaultDataSourceName was introduced in Spark 2.2, source can be found here.

In Spark 2.1.1, setting this (either in configuration before connection or run time) worked for me:

spark_session_config(
  sc, "spark.sql.sources.default", "orc"
)

Triable answered 17/4, 2020 at 5:3 Comment(0)

Recommended topics

Hot tags