How to specify the path where saveAsTable saves files to?

Asked 16/6, 2015 at 18:4 Answered 15/7, 2024 at 20:34

Solved apache-spark pyspark apache-spark-sql

I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter

df = sqlContext.read.format("json").load("s3a://somefile")
df_writer = pyspark.sql.DataFrameWriter(df)
df_writer.partitionBy('col1')\
         .saveAsTable('test_table', format='parquet', mode='overwrite')

The parquet files went to "/tmp/hive/warehouse/...." which is a local tmp directory on my driver.

I did setup hive.metastore.warehouse.dir in hive-site.xml to a "s3a://...." location, but spark doesn't seem to respect to my hive warehouse setting.

Breezeway answered 16/6, 2015 at 18:4 Comment(1)

It saves the file path with the "column name = " like s3a://bucket/foo/col1=1/,s3a://bucket/foo/col1=2/,s3a://bucket/foo/col1=3/,..... Is there any way to avoid appending the column name? like s3a://bucket/foo/1/,s3a://bucket/foo/2/ – Inalienable 5/4, 2017 at 19:18

Use path.

df_writer.partitionBy('col1')\
         .saveAsTable('test_table', format='parquet', mode='overwrite',
                      path='s3a://bucket/foo')

Breezeway answered 3/8, 2015 at 3:5 Comment(1)

as of May 2024, the saveAsTable function takes exactly one parameter: github.com/apache/spark/blob/master/sql/core/src/main/scala/org/… – Misread 8/5, 2024 at 15:54

you can use insertInto(tablename) to overwrite a existing table since 1.4

Allista answered 13/4, 2016 at 2:46 Comment(0)

path is an option passed to writer when using either save or saveAsTable. The following should work:

df_writer.partitionBy('col1').option("path", "s3a://bucket/foo").mode('overwrite').format('parquet').saveAsTable('test_table')

Guillaume answered 15/7, 2024 at 20:34 Comment(0)

Recommended topics

Hot tags