generating an AVRO schema from a JSON document
Asked Answered
S

3

23

Is there any tool able to create an AVRO schema from a 'typical' JSON document.

For example:

{
"records":[{"name":"X1","age":2},{"name":"X2","age":4}]
}

I found http://jsonschema.net/reboot/#/ which generates a 'json-schema'

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "id": "http://jsonschema.net#",
  "type": "object",
  "required": false,
  "properties": {
    "records": {
      "id": "#records",
      "type": "array",
      "required": false,
      "items": {
        "id": "#1",
        "type": "object",
        "required": false,
        "properties": {
          "name": {
            "id": "#name",
            "type": "string",
            "required": false
          },
          "age": {
            "id": "#age",
            "type": "integer",
            "required": false
          }
        }
      }
    }
  }
}

but I'd like an AVRO version.

Shrier answered 3/7, 2014 at 8:51 Comment(2)
Did you got the answer for this? If no, then did you manually created avro schema from json? :|Onepiece
Me too.. any luck any body ! Seems to me like this is a manual task , i need to generate avro schema files for regularly generated JSON data file in a automation script :(Medial
G
9

You can achieve that easily using Apache Spark and python. First download the spark distribution from http://spark.apache.org/downloads.html, then install avro package for python using pip. Then run pyspark with avro package:

./bin/pyspark --packages com.databricks:spark-avro_2.11:3.1.0

and use the following code (assuming the input.json files contains one or more json documents, each in separate line):

import os, avro.datafile

spark.read.json('input.json').coalesce(1).write.format("com.databricks.spark.avro").save("output.avro")
avrofile = filter(lambda file: file.startswith('part-r-00000'), os.listdir('output.avro'))[0]

with open('output.avro/' + avrofile) as avrofile:
    reader = avro.datafile.DataFileReader(avrofile, avro.io.DatumReader())
    print(reader.datum_reader.writers_schema)

For example: for input file with content:

{'string': 'somestring', 'number': 3.14, 'structure': {'integer': 13}}
{'string': 'somestring2', 'structure': {'integer': 14}}

The script will result in:

{"fields": [{"type": ["double", "null"], "name": "number"}, {"type": ["string", "null"], "name": "string"}, {"type": [{"type": "record", "namespace": "", "name": "structure", "fields": [{"type": ["long", "null"], "name": "integer"}]}, "null"], "name": "structure"}], "type": "record", "name": "topLevelRecord"}
Georgetta answered 1/12, 2016 at 8:6 Comment(0)
A
1

using latest spark 3.1.2 and python 3.9:

./bin/pyspark --packages org.apache.spark:spark-avro_2.12:3.1.2

import os, avro.datafile
spark.read.json('input.json').coalesce(1).write.format("avro").save("output.avro")
avrofile = list(filter(lambda file: file.startswith('part-00000'), os.listdir('output.avro')))[0]`
with open(avrofile.name,'rb') as af:
   reader = avro.datafile.DataFileReader(af, avro.io.DatumReader())`
   print(reader.datum_reader.writers_schema)`
Assam answered 16/8, 2021 at 13:6 Comment(0)
A
-3

Try this site for generate avro schema from json:

https://toolslick.com/generation/metadata/avro-schema-from-json

Anchie answered 10/12, 2021 at 14:31 Comment(1)
Cool tool, but it's not free.Coniology

© 2022 - 2024 — McMap. All rights reserved.