Spark2 Can't write dataframe to parquet hive table : HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`

Asked 9/1, 2019 at 14:42 Answered 29/1, 2024 at 9:47

apache-spark hive parquet apache-spark-2.0

I'm trying to save dataframe in table hive.

In spark 1.6 it's work but after migration to 2.2.0 it doesn't work anymore.

Here's the code:

blocs
      .toDF()
      .repartition($"col1", $"col2", $"col3", $"col4")
      .write
      .format("parquet")
      .mode(saveMode)
      .partitionBy("col1", "col2", "col3", "col4")
      .saveAsTable("db".tbl)

The format of the existing table project_bsc_dhr.bloc_views is HiveFileFormat. It doesn't match the specified format ParquetFileFormat.; org.apache.spark.sql.AnalysisException: The format of the existing table project_bsc_dhr.bloc_views is HiveFileFormat. It doesn't match the specified format ParquetFileFormat.;

Iceland answered 9/1, 2019 at 14:42 Comment(3)

have you got any solution ? i am facing same issue..can you please let me know what is the work around – Stefanstefanac 8/2, 2019 at 11:42

Yes, i used insertInto instead of saveAsTable and i deleted partitionby. The code: blocs .toDF() .repartition($"col1", $"col2", $"col3", $"col4") .write .format("parquet") .insertInto("db".tbl) – Iceland 9/2, 2019 at 12:7

am using spark 2.3.0 .. is repartitions works on latest spark ? – Stefanstefanac 9/2, 2019 at 15:34

I have just tried to use .format("hive") to saveAsTable after getting the error and it worked.

I also would not recommend to use insertInto suggested by the author, because it looks not type-safe (as much as this term can be applied to SQL API) and is error-prone in the way it ignores column names and uses position-base resolution.

Counterweight answered 15/3, 2019 at 10:21 Comment(2)

how do i insert only specific columns from the dataFrame into the hive table? say, i have 50 columns in my table, but i have 20 columns only in my DF that i want to update/insert to the table. consider those 20 as required while the others are not mandatory. With above, it gives the position/column mismatch kind of error. – Liberec 17/9, 2020 at 14:41

Your solution .format('hive') works when the table is not partitioned. If it's partitioned, I am getting a different error org.apache.spark.SparkException: Requested partitioning does not match the after switching from .format('parquet'). – Terryterrye 13/2, 2023 at 1:41

Setting the format("hive") did not worked for me. How I solved this problem is as below.

Insert the current dataframe into a new temp table, Make sure you have same schema, column types and order of colmns in temp table w.r.t your actual target table.

df.write.format("parquet").partitionBy('date').mode("append") \
.saveAsTable('HiveDB.temp_table', path="s3://some_path/temp_table" )

Now physically copy the files from path "s3://some_path/temp_table/" to your actual target table path. (in my case "s3://some_path/actual_table/" )

Now run the below command through spark.sql or from Hue\Athena

MSCK REPAIR TABLE `actual_table`;

I was facing this issue while writing pyspark dataframe into Glue Catalog table that was created before via AWS Wrangler API.

Bohn answered 10/11, 2023 at 8:14 Comment(0)

I set TBLPROPERTITES to table and saveAsTable work

ALTER TABLE "db".tbl
SET TBLPROPERTIES ('spark.sql.partitionProvider='catalog', 
                   'spark.sql.sources.provider' = 'parquet')

if need, set SERDEPROPERTIES

ALTER TABLE "db".tbl
SET SERDEPROPERTIES('path'='hdfs:..')

Tyronetyrosinase answered 29/1, 2024 at 9:47 Comment(0)

Recommended topics

Hot tags