Scala DataFrame: Explode an array
Asked Answered
H

2

6

I am using the spark libraries in Scala. I have created a DataFrame using

val searchArr = Array(
  StructField("log",IntegerType,true),
  StructField("user", StructType(Array(
    StructField("date",StringType,true),
    StructField("ua",StringType,true),
    StructField("ui",LongType,true))),true),
  StructField("what",StructType(Array(
    StructField("q1",ArrayType(IntegerType, true),true),
    StructField("q2",ArrayType(IntegerType, true),true),
    StructField("sid",StringType,true),
    StructField("url",StringType,true))),true),
  StructField("where",StructType(Array(
    StructField("o1",IntegerType,true),
    StructField("o2",IntegerType,true))),true)
)

val searchSt = new StructType(searchArr)    

val searchData = sqlContext.jsonFile(searchPath, searchSt)

I am now what to explode the field what.q1, which should contain an array of integers, but the documentation is limited: http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#explode(java.lang.String,%20java.lang.String,%20scala.Function1,%20scala.reflect.api.TypeTags.TypeTag)

So far I tried a few things without much luck

val searchSplit = searchData.explode("q1", "rb")(q1 => q1.getList[Int](0).toArray())

Any ideas/examples of how to use explode on an array?

Hour answered 30/6, 2015 at 13:42 Comment(1)
The documentation you're looking at is 1.4.0. Is that the version of spark you are using?Briannabrianne
C
0

Did you try with an UDF on field "what"? Something like that could be useful:

val explode = udf {
(aStr: GenericRowWithSchema) => 
  aStr match {
      case null => ""
      case _  =>  aStr.getList(0).get(0).toString()
  }
}


val newDF = df.withColumn("newColumn", explode(col("what")))

where:

  • getList(0) returns "q1" field
  • get(0) returns the first element of "q1"

I'm not sure but you could try to use getAs[T](fieldName: String) instead of getList(index: Int).

Convoke answered 9/11, 2016 at 13:59 Comment(0)
H
0

I'm not used to Scala; but in Python/pyspark, the array type column nested within a struct type field can be exploded as follows. If it works for you, then you can convert it to corresponding Scala representation.

from pyspark.sql.functions import col, explode
from pyspark.sql.types import ArrayType, IntegerType, LongType, StringType, StructField, StructType

schema = StructType([
  StructField("log", IntegerType()),
  StructField("user", StructType([
    StructField("date", StringType()),
    StructField("ua", StringType()),
    StructField("ui", LongType())])),
  StructField("what", StructType([
    StructField("q1", ArrayType(IntegerType())),
    StructField("q2", ArrayType(IntegerType())),
    StructField("sid", StringType()),
    StructField("url", StringType())])),
  StructField("where", StructType([
    StructField("o1", IntegerType()),
    StructField("o2", IntegerType())]))
])

data = [((1), ("2022-01-01","ua",1), ([1,2,3],[6],"sid","url"), (7,8))]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)

Output:

+---+-------------------+--------------------------+------+
|log|user               |what                      |where |
+---+-------------------+--------------------------+------+
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|
+---+-------------------+--------------------------+------+

With what.q1 exploded:

df.withColumn("what.q1_exploded", explode(col("what.q1"))).show(truncate=False)

Output:

+---+-------------------+--------------------------+------+----------------+
|log|user               |what                      |where |what.q1_exploded|
+---+-------------------+--------------------------+------+----------------+
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|1               |
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|2               |
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|3               |
+---+-------------------+--------------------------+------+----------------+
Haemophilic answered 1/9, 2022 at 6:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.