I am working with data from very long, nested JSON files. Problem is, that the structure of these files is not always the same as some of them miss columns others have. I want to create a custom schema from an empty JSON file that contains all columns. If I later read JSON files into this pre-defined schema, the non-existing columns will be filled with null values (thats at least the plan). What I did so far:
- loading a test JSON (that does not contain all columns that can be expected) into a dataframe
- writing its schema into a JSON file
- Opening this JSON file in a text-editor and adding the missing columns manually
Next thing I want to do is creating a new schema by reading the JSON file into my code, but I struggle with the synthax. Can I read the schema directly from the file itself? I have tried
schemaFromJson = StructType.fromJson(json.loads('filepath/spark-schema.json'))
but it gives me TypeError: init() missing 2 required positional arguments: 'doc' and 'pos'
Any idea whats wrong about my current code? Thanks a lot
edit: I came across this link sparkbyexamples.com/pyspark/pyspark-structtype-and-structfield . Chapter 7 pretty much describes the problem I am having. I just dont understand how I can parse the json file I manually enhanced to schemaFromJson = StructType.fromJson(json.loads(schema.json)).
When I do:
jsonDF = spark.read.json(filesToLoad)
schema = jsonDF.schema.json()
schemaNew = StructType.fromJson(json.loads(schema))
jsonDF2 = spark.read.schema(schemaNew).json(filesToLoad)
The code runs through, but its obviously not useful because jsonDF and jsonDF2 do have the same content/schema. What I want to achieve, is adding some columns to 'schema' which will then be reflected in 'schemaNew'.
filesToLoad
variable here? – Elevenses