How to explode get_json_object in Apache Spark

I have the following string in one of my dataframe's column:

row1:[{"key":"foo"},{"key":"bar"},{"key":"baz"}]
row2:[{"key":"foo"},{"key":"bar"}]
row3:null
etc

I found that Spark has "get_json_object" function. So if I want to use xpath to extract data I would use:

 get_json_object($"json", s"$[0].key")

would returns:

"foo"
"foo"
null

but I need the equivalent of "explosion" function of Spark.

I found that I can use the "*" symbol on my xpath.

 get_json_object($"json", s"$[*].key")

Which don't do the as expected, it will create a string like:

[foo,bar,baz]
[foo,baz]

I found the solution in another stackoverflow thread,

val jsonElements = (0 until 3).map(i => get_json_object($"json", s"$$[$i].key"))


val jsonElements = .map(i => get_json_object($"json", s"$$[$i].key"))
df.select($"id",explode(array(jsonElements: _*).alias("foo")))

This partially solve my issue, because this solution presumes that I know how maximum depth my array can be. The function "from_json" of Spark needs schema, I have huge complex JSON type would take an "infinity" amount of time to create the schema.

Disclaimer

I will not use any regular expression/substring/etc to parse JSON. The entire propose of use a parser is that.

import net.liftweb.json.{DefaultFormats, parseOpt} case class jsonElement(key: String, value: Optional[String]) //assuming the value key always exists and value may or may not exist, //so making that as optional / ignore the fields if you don't really care at all val jsonKeys = inputRdd.map(eachRow => implicit val formats = DefaultFormats // hate this but deal with scala val parsedObject = parseOpt(eachRow).flatMap(_.extractOpt[List[jsonElement]]) parsedObject match{ case Some(parsedItem) => parsedItem.map(json => json.key) case None => List() })

Recommended topics

Hot tags