How to infer schema of serialized JSON column in Spark SQL?
Asked Answered
C

3

6

I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json function).

I can do this in Scala like

val contextSchema = spark.read.json(data.select("context").as[String]).schema
val updatedData = data.withColumn("context", from_json(col("context"), contextSchema))

How can I transform this solution to a pure Spark-SQL?

Cortex answered 29/8, 2021 at 16:13 Comment(0)
C
2

For spark-sql use toDDL to generate schema then use the schema in from_json.

Example:

df.show(10,false)
//+---+-------------------+
//|seq|json               |
//+---+-------------------+
//|1  |{"id":1,"name":"a"}|
//+---+-------------------+

val sch=spark.read.json(df.select("json").as[String]).schema.toDDL
//sch: String = `id` BIGINT,`name` STRING

df.createOrReplaceTempView("tmp")

spark.sql(s"""select seq,jsn.* from (select *,from_json(json,"$sch") as jsn  from tmp)""").
show(10,false)
//+---+---+----+
//|seq|id |name|
//+---+---+----+
//|1  |1  |a   |
//+---+---+----+
Cloistral answered 29/8, 2021 at 17:22 Comment(1)
I am looking for pure Spark-SQL solution. Environment where I want to apply this only supports Spark-SQL statements.Cortex
A
1

You can use schema_of_json() function to infer JSON schema.

select from_json(<column_name>, schema_of_json(<sample_JSON>)) from <table> 
Anglofrench answered 30/8, 2021 at 8:15 Comment(3)
I don't have sample JSON. I am hoping it would be possible to get it table itself.Cortex
Is JSON structure/schema is fixed? In that case you can form schema and pass that to from_jsonAnglofrench
No. JSON structure/schema changes with input.Cortex
V
0

I found a workaround by

  1. converting it to RDD and read with spark dataframe
spark
  .read
  .option("inferSchema", True)
  .json(
      df.rdd.map(
          lambda rec: rec.context
      )
  )
  1. if field/path is known beforehand, we can use
df.select(json_tuple(col("context"),"<json path or attribute>").alias("field_name")).show()
Viens answered 18/10, 2023 at 14:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.