Can I change the nullability of a column in my Spark dataframe?
Asked Answered
K

5

12

I have a StructField in a dataframe that is not nullable. Simple example:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields

which returns:

[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]

Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)

which failed with:

TypeError: StructField(name,StringType,true) is not JSON serializable

I also see this in the stack trace:

raise ValueError("Circular reference detected")

So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo is nullable?

Klatt answered 6/9, 2017 at 10:6 Comment(0)
L
5

Seems you missed the StructType(newSchema).

l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()
Lorenza answered 6/9, 2017 at 10:53 Comment(0)
L
12

For the general case, one can change the nullability of a column via the nullable property of the StructField of that specific column. Here's an example:

df.schema['col_1']
# StructField(col_1,DoubleType,false)

df.schema['col_1'].nullable = True

df.schema['col_1']
# StructField(col_1,DoubleType,true)
Lehet answered 25/11, 2020 at 10:50 Comment(2)
this seems to only work if i re-create the dataframe with the schema change. is there no way to do this "in-line"?Perpetuity
This SO answer shows how to change the nullable property in-place.Vraisemblance
C
11

I know this question is already answered, but I was looking for a more generic solution when I came up with this:

def set_df_columns_nullable(spark, df, column_list, nullable=True):
    for struct_field in df.schema:
        if struct_field.name in column_list:
            struct_field.nullable = nullable
    df_mod = spark.createDataFrame(df.rdd, df.schema)
    return df_mod

You can then call it like this:

set_df_columns_nullable(spark,df,['name','age'])
Cistaceous answered 13/8, 2018 at 11:35 Comment(3)
Great answer. Are there any performance implications in doing this? What exactly happens when you "create a new dataframe" based on the existing RDD?Sixteenmo
Is there any way by which this can be done in scala?Reliquiae
@Sixteenmo and others come here, seems this is super costly, i tested it with a 10X1 df only, and the one without updating the nullable took 184.47592 ms while with this update nullable in place took 1187.75746 msSkerry
L
5

Seems you missed the StructType(newSchema).

l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()
Lorenza answered 6/9, 2017 at 10:53 Comment(0)
N
-1

One can use the SPARK SQL that is ANSI compliant. In this case the ALTER statement is necessary.

Please see examples:

to unset the nullability:

ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL;

to set the nullability:

ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL;
Novokuznetsk answered 22/1 at 15:2 Comment(2)
What you've described here change the nullability of a table in a column, the question was asking about a column in a dataframe.Klatt
typo. I meant ... "change the nullability of a column in a table"Klatt
I
-6
df1 = df.rdd.toDF()
df1.printSchema()

Output:

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- foo: boolean (nullable = true)
Irs answered 6/9, 2017 at 10:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.