How do I replace a string value with a NULL in PySpark?

Asked 27/4, 2016 at 18:8 Answered 5/7, 2023 at 11:57

Solved apache-spark dataframe null pyspark

I want to do something like this:

df.replace('empty-value', None, 'NAME')

Basically, I want to replace some value with NULL, but it does not accept None as an argument. How can I do this?

Hale answered 27/4, 2016 at 18:8 Comment(0)

This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')]

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

Tetrapod answered 27/4, 2016 at 20:10 Comment(2)

I didn't think of trying UDFs, that seems to be the way to go – Hale 27/4, 2016 at 21:2

Your code will run faster if you use native Spark functions rather than UDFs - see the other answers. (That's why they have more upvotes than the accepted answer) – Bouncy 15/1, 2020 at 11:26

You can combine when clause with NULL literal and types casting as follows:

from pyspark.sql.functions import when, lit, col

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])

def replace(column, value):
    return when(column != value, column).otherwise(lit(None))

df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## |  x|   y|
## +---+----+
## |  1| foo|
## |  2|null|
## +---+----+

It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

Goodygoody answered 27/4, 2016 at 18:37 Comment(6)

Excellent, much more efficient for large datasets. – Aliment 23/9, 2016 at 20:44

@Aliment It is, Python UDFs have relatively high serde overhead. – Goodygoody 23/9, 2016 at 21:24

@Goodygoody you don't actually need the otherwise() part - when() will return null by default. – Optimize 11/7, 2018 at 14:36

Could you please show how to do that with nested fields? – Egarton 25/9, 2018 at 9:0

@Goodygoody in otherwise, I tried returning None instead of lit(None) and the result is same. Can you please explain why you used lit? – Cressler 5/10, 2019 at 9:20

@Goodygoody when can directly mention .otherwise(lit(null )) right ? – Middendorf 15/9, 2020 at 9:42

One-line solution in native spark code

You can simply use a dict for the first argument of replace: it accepts None as replacement value which will result in NULL.

Here is an example:

df = df.replace({'empty-value': None}, subset=['NAME'])

Just replace 'empty-value' with whatever value you want to overwrite with NULL. Note that your 'empty-value' needs to be hashable.

An additional advantage is that you can use this on multiple columns at the same time. Just add the column names to the list under subset.

Pacien answered 8/7, 2020 at 14:46 Comment(1)

You can also use plain lists: df = df.replace(['null','', 'NULL'], None, ['Col1', 'Col2','Col3']) I find it easier/simpler to use than a dictionary. – Goren 9/1 at 19:16

This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')]

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

Tetrapod answered 27/4, 2016 at 20:10 Comment(2)

I didn't think of trying UDFs, that seems to be the way to go – Hale 27/4, 2016 at 21:2

Your code will run faster if you use native Spark functions rather than UDFs - see the other answers. (That's why they have more upvotes than the accepted answer) – Bouncy 15/1, 2020 at 11:26

The best alternative is the use of a when combined with a NULL. Example:

from pyspark.sql.functions import when, lit, col

df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))

If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.

Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.

Leftist answered 22/11, 2018 at 19:58 Comment(0)

This will also work:

df5 = df.replace(str('empty-value'), None)

Corsetti answered 5/7, 2023 at 11:57 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Donovan 9/7, 2023 at 13:20

One-line solution in native spark code

Recommended topics

Hot tags