I have the following string in one of my dataframe's column:
row1:[{"key":"foo"},{"key":"bar"},{"key":"baz"}]
row2:[{"key":"foo"},{"key":"bar"}]
row3:null
etc
I found that Spark has "get_json_object" function. So if I want to use xpath to extract data I would use:
get_json_object($"json", s"$[0].key")
would returns:
"foo"
"foo"
null
but I need the equivalent of "explosion" function of Spark.
I found that I can use the "*" symbol on my xpath.
get_json_object($"json", s"$[*].key")
Which don't do the as expected, it will create a string like:
[foo,bar,baz]
[foo,baz]
I found the solution in another stackoverflow thread,
val jsonElements = (0 until 3).map(i => get_json_object($"json", s"$$[$i].key"))
val jsonElements = .map(i => get_json_object($"json", s"$$[$i].key"))
df.select($"id",explode(array(jsonElements: _*).alias("foo")))
This partially solve my issue, because this solution presumes that I know how maximum depth my array can be. The function "from_json" of Spark needs schema, I have huge complex JSON type would take an "infinity" amount of time to create the schema.
Disclaimer
I will not use any regular expression/substring/etc to parse JSON. The entire propose of use a parser is that.
get_json_object
returns a string, not a parsed object, so that's not an option. Take a look atfrom_json
– Francisfrancisca