zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json()
:
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json()
, as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json()
will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json()
will then yield a lot of null
or empty values where the schema returned by schema_of_json()
doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json()
vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json()
and deriving it from all the data using spark.read.json()
.
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a
has a boolean value in one row and a string value in another. The merged schema for a
would set its type to string. b
would be an integer.
Let's see how the different approaches compare. First, the schema_of_json()
approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello"
couldn't be parsed as a boolean and returned null
, and "b": 22
was just dropped because it wasn't in our schema.
Now with spark.read.json()
:
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true
was cast as a string to match the schema of "a": "hello"
.
The main downside of using spark.read.json()
is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json()
against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio
to something less than 1.0
in your call to spark.read.json()
to look at a subset of the data.
Here are the docs for spark.read.json()
: Scala API / Python API
key
an unique identifier? – Exercitation