I'm trying to create hive/impala tables base on avro files in HDFS. The tool for doing the transformations is Spark.
I can't use spark.read.format("avro")
to load the data into a dataframe, as in that way the doc
part (description of the column) will be lost. I can see the doc by doing:
input = sc.textFile("/path/to/avrofile")
avro_schema = input.first() # not sure what type it is
The problem is, it's a nested schema and I'm not sure how to traverse it to map the doc
to the column description in dataframe. I'd like to have doc
to the column description of the table. For example, the input schema looks like:
"fields": [
{
"name":"productName",
"type": [
"null",
"string"
],
"doc": "Real name of the product"
"default": null
},
{
"name" : "currentSellers",
"type": [
"null",
{
"type": "record",
"name": "sellers",
"fields":[
{
"name": "location",
"type":[
"null",
{
"type": "record"
"name": "sellerlocation",
"fields": [
{
"name":"locationName",
"type": [
"null",
"string"
],
"doc": "Name of the location",
"default":null
},
{
"name":"locationArea",
"type": [
"null",
"string"
],
"doc": "Area of the location",#The comment needs to be added to table comments
"default":null
.... #These are nested fields
In the final table, for example one field name would be currentSellers_locationName
, with column description "Name of the location". Could someone please help to shed some light on how to parse the schema and add the doc to description? and explain a bit about what this below bit is about outside of the fields? Many thanks. Let me know if I can explain it better.
"name" : "currentSellers",
"type": [
"null",
{
"type": "record",
"name": "sellers",
"fields":[
{
comments
to a column. Is this what you need? I'm not sure if comments can be added directly from the schema. Also, in spark dataframes you can addmetadata
to columns but again I don't think when the dataframe is written to the table it will write the metadata as comments into the hive table. Please correct me if my understanding of your problem is wrong. – Deannadeannedoc
that comes with avro, and added it to the corresponding column in the dataframe. – Bushnell