pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type
Asked Answered
B

1

11

I am trying to add a column in my dataframe df1 in PySpark.

The code I tried:

import pyspark.sql.functions as F
df1 = df1.withColumn("empty_column", F.lit(None))

But I get this error:

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type.

Can anyone help me with this?

Bloxberg answered 18/10, 2022 at 18:36 Comment(0)
H
14

Instead of just F.lit(None), use it with a cast and a proper data type. E.g.:

F.lit(None).cast('string')
F.lit(None).cast('double')

When we add a literal null column, it's data type is void:

from pyspark.sql import functions as F
spark.range(1).withColumn("empty_column", F.lit(None)).printSchema()
# root
#  |-- id: long (nullable = false)
#  |-- empty_column: void (nullable = true)

But when saving as parquet file, void data type is not supported, so such columns must be cast to some other data type.

Helterskelter answered 18/10, 2022 at 20:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.