Create new Dataframe with empty/null field values

Asked 18/8, 2015 at 8:36 Answered 17/4, 2020 at 13:13

scala apache-spark dataframe apache-spark-sql

I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated.

val edwDf = omniDataFrame 
  .withColumn("field1", callUDF((value: String) => None)) 
  .withColumn("field2",
    callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df")))) 

edwDf
  .select("field1", "field2")
  .save("odsoutdatafldr", "com.databricks.spark.csv");

Grueling answered 18/8, 2015 at 8:36 Comment(0)

110

It is possible to use lit(null):

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))

One problem here is that the column type is null:

scala> dfWithFoobar.printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: null (nullable = true)

and it is not retained by the csv writer. If it is a hard requirement you can cast column to the specific type (lets say String), with either DataType

import org.apache.spark.sql.types.StringType

df.withColumn("foobar", lit(null).cast(StringType))

or string description

df.withColumn("foobar", lit(null).cast("string"))

or use an UDF like this:

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

A Python equivalent can be found here: Add an empty column to spark DataFrame

Disjunction answered 18/8, 2015 at 11:39 Comment(3)

@zero323, thanks for sharing this, very helpful. See my edits for support other types. – Wilfredowilfrid 24/8, 2015 at 10:14

@DmitriySelivanov Thank you for your helpful edit. I gave up the idea of using Option after some failed experiments with literals a while ago :) – Disjunction 24/8, 2015 at 10:47

Is it possible to output a null Struct, so it can be saved to Parquet with null value but struct type? val getNull = udf(() => None: Option[StructType]) didn't work for me in 2.4 – Bantustan 5/7, 2019 at 21:33

Just to extend the perfect answer provided by @zero323, here's a solution which can be used starting from Spark 2.2.0.

import org.apache.spark.sql.functions.typedLit

df.withColumn("foobar", typedLit[Option[String]](None)).printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

It's similar to the 3rd solution, but without using any UDF.

Edwinaedwine answered 17/4, 2020 at 13:13 Comment(0)

Recommended topics

Hot tags