This is effectively the same as my previous question, but using Avro rather than JSON as the data format.
I'm working with a Spark dataframe which could be loading data from one of a few different schema versions:
// Version One
{"namespace": "com.example.avro",
"type": "record",
"name": "MeObject",
"fields": [
{"name": "A", "type": ["null", "int"], "default": null}
]
}
// Version Two
{"namespace": "com.example.avro",
"type": "record",
"name": "MeObject",
"fields": [
{"name": "A", "type": ["null", "int"], "default": null},
{"name": "B", "type": ["null", "int"], "default": null}
]
}
I'm using Spark Avro to load the data.
DataFrame df = context.read()
.format("com.databricks.spark.avro")
.load("path/to/avro/file");
which may be a Version One file or Version Two file. However I'd like to be able to process it in an identical manner, with the unknown values set to "null". The recommendation in my previous question was to set the schema, however I do not want to repeat myself writing the schema in both a .avro
file and as sparks StructType
and friends. How can I convert the avro schema (either text file or the generated MeObject.getClassSchema()
) into sparks StructType
?
Spark Avro has a SchemaConverters
, but it is all private and returns some strange internal object.