How to replace null values with a specific value in Dataframe using spark in Java?

Asked 21/6, 2017 at 9:7 Answered 21/6, 2017 at 10:38

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

Jaala answered 21/6, 2017 at 9:7 Comment(0)

You can use .na.fill function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN.

In your case it will be something like:

val df2 = df.na.fill("a", Seq("Name"))
            .na.fill("a2", Seq("Place"))

Oliana answered 21/6, 2017 at 9:21 Comment(4)

Is it available in Java? I couldn't find a similar fill function. – Jaala 21/6, 2017 at 9:48

Sorry I didn't use it in Java, but you can find here the latest version documentation of Spark, and you can see the DataFrameNaFunctions there: spark.apache.org/docs/latest/api/java/index.html probably try fill without .na – Oliana 21/6, 2017 at 9:52

@Jaala can you please accept the answer if it solved your problem? – Oliana 11/9, 2017 at 11:56

have tried using it with null ? . It says cannot be applied to (Null, Int). It hasn't solved the purpose for me. So was wondering there might be some solution now after 2 years :) – Annulation 27/6, 2019 at 23:54

You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.

So if you already know the value that you want to replace Null with...:

String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns.

Luggage answered 21/6, 2017 at 9:22 Comment(3)

My dataframes are of type Dataset<Row>. It says it's not defined for type Dataset<Row> – Jaala 21/6, 2017 at 9:50

I have updated my answer to include the .na part. You could also try: df.na.fill(ImmutableMap.of("ColumnName", "replacementValue", "egName", "egA"); – Luggage 21/6, 2017 at 10:2

Thanks a lot for help. I was able to implement it using the scala Sequence libraries. I'll update the same in my answer. – Jaala 21/6, 2017 at 10:35

You can use DataFrame.na.fill() to replace the null with some value To update at once you can do as

val map = Map("Name" -> "a", "Place" -> "a2")

df.na.fill(map).show()

But if you want to replace a bad record too then you need to validate the bad records first. You can do this by using regular expression with like function.

Desiccated answered 21/6, 2017 at 9:21 Comment(2)

I need to this for each column separately instead of whole dataframe at once. Can you please share an example as how will I replace any value. Also, I'll create a regular expression for the bad records. Please share the java example if you have. Thank you. – Jaala 21/6, 2017 at 9:52

can we do this based on condition like -> fill column2 "only if col1 is not null"? – Proximal 10/8, 2018 at 2:11

In order to replace the NULL values with a given string I've used fill function present in Spark for Java. It accepts the word to be replaced with and a sequence of column names. Here is how I have implemented that:-

List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);

Jaala answered 21/6, 2017 at 10:38 Comment(1)

my Dataset in spark is having null values, which I'm trying to save in Redshift it's accepting as a null string which is not what we want we want null as null in Redshift too. Any idea how to implement that. – Hoagy 21/4, 2023 at 13:50

Recommended topics

Hot tags