How to explode get_json_object in Apache Spark
Asked Answered
C

1

7

I have the following string in one of my dataframe's column:

row1:[{"key":"foo"},{"key":"bar"},{"key":"baz"}]
row2:[{"key":"foo"},{"key":"bar"}]
row3:null
etc

I found that Spark has "get_json_object" function. So if I want to use xpath to extract data I would use:

 get_json_object($"json", s"$[0].key")

would returns:

"foo"
"foo"
null

but I need the equivalent of "explosion" function of Spark.

I found that I can use the "*" symbol on my xpath.

 get_json_object($"json", s"$[*].key")

Which don't do the as expected, it will create a string like:

[foo,bar,baz]
[foo,baz]

I found the solution in another stackoverflow thread,

val jsonElements = (0 until 3).map(i => get_json_object($"json", s"$$[$i].key"))


val jsonElements = .map(i => get_json_object($"json", s"$$[$i].key"))
df.select($"id",explode(array(jsonElements: _*).alias("foo")))

This partially solve my issue, because this solution presumes that I know how maximum depth my array can be. The function "from_json" of Spark needs schema, I have huge complex JSON type would take an "infinity" amount of time to create the schema.

Disclaimer

I will not use any regular expression/substring/etc to parse JSON. The entire propose of use a parser is that.

Complot answered 30/10, 2018 at 21:15 Comment(1)
get_json_object returns a string, not a parsed object, so that's not an option. Take a look at from_jsonFrancisfrancisca
F
-2

Just sticking to the scala basics can solve it simple. Try case classes with options to solve the problem.

You can use any standard json parser. I use liftweb.

import net.liftweb.json.{DefaultFormats, parseOpt}

case class jsonElement(key: String, value: Optional[String])
//assuming the value key always exists and value may or may not exist, 
//so making that as optional / ignore the fields if you don't really care at all

val jsonKeys = inputRdd.map(eachRow => 
  implicit val formats = DefaultFormats // hate this but deal with scala

  val parsedObject = parseOpt(eachRow).flatMap(_.extractOpt[List[jsonElement]])

  parsedObject match{
    case Some(parsedItem) => parsedItem.map(json => json.key)
    case None => List()
})

This gives a Rdd of list(key). Use filter(list => !list.isEmpty) if you want to remove the empty lists. You know it from there.

Filose answered 30/10, 2018 at 22:28 Comment(1)
Thank you for your answer, but I'm looking a Spark interface to do it.Complot

© 2022 - 2024 — McMap. All rights reserved.