Spark DataFrame handing empty String in OneHotEncoder
Asked Answered
S

3

8

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around this?

I could reproduce the error in the example provided on Spark ml page:

val df = sqlContext.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, ""),         //<- original example has "a" here
  (4, "a"),
  (5, "c")
)).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)
val indexed = indexer.transform(df)

val encoder = new OneHotEncoder()
  .setInputCol("categoryIndex")
  .setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)

encoded.show()

It is annoying since missing/empty values is a highly generic case.

Thanks in advance, Nikhil

Skeptic answered 12/10, 2015 at 20:36 Comment(0)
M
9

Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :

java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33) at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32) [...]

This is how I will do it : (There is other way to do it, rf. @Anthony 's answer)

I'll create an UDF to process the empty category :

import org.apache.spark.sql.functions._

def processMissingCategory = udf[String, String] { s => if (s == "") "NA"  else s }

Then, I'll apply the UDF on the column :

val df = sqlContext.createDataFrame(Seq(
   (0, "a"),
   (1, "b"),
   (2, "c"),
   (3, ""),         //<- original example has "a" here
   (4, "a"),
   (5, "c")
)).toDF("id", "category")
  .withColumn("category",processMissingCategory('category))

df.show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

Now, you can go back to your transformations

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// |  0|       a|          0.0|
// |  1|       b|          2.0|
// |  2|       c|          1.0|
// |  3|      NA|          3.0|
// |  4|       a|          0.0|
// |  5|       c|          1.0|
// +---+--------+-------------+

// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)

encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex|  categoryVec|
// +---+--------+-------------+-------------+
// |  0|       a|          0.0|(3,[0],[1.0])|
// |  1|       b|          2.0|(3,[2],[1.0])|
// |  2|       c|          1.0|(3,[1],[1.0])|
// |  3|      NA|          3.0|    (3,[],[])|
// |  4|       a|          0.0|(3,[0],[1.0])|
// |  5|       c|          1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+

EDIT:

@Anthony 's solution in Scala :

df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// |  0|       a|
// |  1|       b|
// |  2|       c|
// |  3|      NA|
// |  4|       a|
// |  5|       c|
// +---+--------+

I hope this helps!

Marquita answered 13/1, 2016 at 10:53 Comment(0)
A
6

Yep, it's a little thorny but maybe you can just replace the empty string with something sure to be different than other values. NOTE that I am using pyspark DataFrameNaFunctions API but Scala's should be similar.

df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
df = df.na.replace('', 'EMPTY', 'category')
df.show()

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  3|   EMPTY|
|  4|       a|
|  5|       c|
+---+--------+
Au answered 19/10, 2015 at 18:32 Comment(3)
Is .na really needed here? Wouldn't just df.replace('', 'EMPTY', 'category') work?Janinajanine
Thanks David for your question. DataFrameNaFunctions.replace and DataFrame.replace are aliases of the sameAu
Here, we know the exact column name. What if we dont know the column name? That is if I'm trying to load a list of tab files into table, how to replace the empty value to something?Claver
V
0

if the column contains null the OneHotEncoder fails with a NullPointerException. therefore i extended the udf to tanslate null values as well

object OneHotEncoderExample {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    // $example on$
    val df1 = sqlContext.createDataFrame(Seq(
      (0.0, "a"),
      (1.0, "b"),
      (2.0, "c"),
      (3.0, ""),
      (4.0, null),
      (5.0, "c")
    )).toDF("id", "category")


    import org.apache.spark.sql.functions.udf
    def emptyValueSubstitution = udf[String, String] {
      case "" => "NA"
      case null => "null"
      case value => value
    }
    val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )


    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)
    indexed.show()

    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")
      .setDropLast(false)
    val encoded = encoder.transform(indexed)
    encoded.show()
    // $example off$
    sc.stop()
  }
}
Vergara answered 21/4, 2016 at 13:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.