Convert nested JSON to CSV file in Python
Asked Answered
Z

5

10

I know this question has been asked many times. I tried several solutions but I couldn't solve my problem.

I have a large nested JSON file (1.4GB) and I would like to make it flat and then convert it to a CSV file.

The JSON structure is like this:

{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}

I know that this is easy to accomplish with pandas module but I am not familiar with it.

EDITED

The desired output should be something like this:

company_number, address_line_1, locality, country_of_residence, kind,

12345678, Address 1, Henley-On-Thamed, England, individual-person-with-significant-control

Note that this is just the short version. The output should have all the fields.

Zobe answered 16/12, 2016 at 9:20 Comment(7)
Can you show desired output?Benison
I have edited my postZobe
first you'll have to fix that error for yourself.. but I don't get the error, and json is loaded fineEnenstein
Possible duplicate of Parsing nested JSON and writing it to CSVEnenstein
did you check #20424973 and github.com/vinay20045/json-to-csv for example?Enenstein
I tried github.com/vinay20045/json-to-csv but I dont get the results that I expect. Maybe I am misusing it. I tried this way python2 json_to_csv.py 'data' 'small.json' 'test.csv' Zobe
Possible duplicate of multiple Json objects in one file extract by pythonGarmon
S
32

Please scroll down for the newer, faster solution

This is an older question, but I struggled the entire night to get a satisfactory result for a similar situation, and I came up with this:

import json
import pandas

def cross_join(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key', how='outer').drop('key', 1)

def json_to_dataframe(data_in):
    def to_frame(data, prev_key=None):
        if isinstance(data, dict):
            df = pandas.DataFrame()
            for key in data:
                df = cross_join(df, to_frame(data[key], prev_key + '.' + key))
        elif isinstance(data, list):
            df = pandas.DataFrame()
            for i in range(len(data)):
                df = pandas.concat([df, to_frame(data[i], prev_key)])
        else:
            df = pandas.DataFrame({prev_key[1:]: [data]})
        return df
    return to_frame(data_in)

if __name__ == '__main__':
    with open('somefile') as json_file:
        json_data = json.load(json_file)

    df = json_to_dataframe(json_data)
    df.to_csv('data.csv', mode='w')

Explanation:

The cross_join function is a neat way I found to do a cartesian product. (credit: here)

The json_to_dataframe function does the logic, using pandas dataframes. In my case, the json was deeply nested, and I wanted to split dictionary key:value pairs into columns, but the lists I wanted to transform into rows for a column -- hence the concat -- which I then cross join with the upper level, thus multiplying the records number so that each value from the list has its own row, while the previous columns are identical.

The recursiveness creates stacks that cross join with the one below, until the last one is returned.

Then with the dataframe in a table format, it's easy to convert to CSV with the "df.to_csv()" dataframe object method.

This should work with deeply nested JSON, being able to normalize all of it into rows by the logic described above.

I hope this will help someone, someday. Just trying to give back to this awesome community.

---------------------------------------------------------------------------------------------

LATER EDIT: NEW SOLUTION

I'm coming back to this as while the dataframe option kinda worked, it took the app minutes to parse not so large JSON data. Therefore I thought of doing what the dataframes do, but by myself:

from copy import deepcopy
import pandas


def cross_join(left, right):
    new_rows = [] if right else left
    for left_row in left:
        for right_row in right:
            temp_row = deepcopy(left_row)
            for key, value in right_row.items():
                temp_row[key] = value
            new_rows.append(deepcopy(temp_row))
    return new_rows


def flatten_list(data):
    for elem in data:
        if isinstance(elem, list):
            yield from flatten_list(elem)
        else:
            yield elem


def json_to_dataframe(data_in):
    def flatten_json(data, prev_heading=''):
        if isinstance(data, dict):
            rows = [{}]
            for key, value in data.items():
                rows = cross_join(rows, flatten_json(value, prev_heading + '.' + key))
        elif isinstance(data, list):
            rows = []
            for item in data:
                [rows.append(elem) for elem in flatten_list(flatten_json(item, prev_heading))]
        else:
            rows = [{prev_heading[1:]: data}]
        return rows

    return pandas.DataFrame(flatten_json(data_in))


if __name__ == '__main__':
    json_data = {
        "id": "0001",
        "type": "donut",
        "name": "Cake",
        "ppu": 0.55,
        "batters":
            {
                "batter":
                    [
                        {"id": "1001", "type": "Regular"},
                        {"id": "1002", "type": "Chocolate"},
                        {"id": "1003", "type": "Blueberry"},
                        {"id": "1004", "type": "Devil's Food"}
                    ]
            },
        "topping":
            [
                {"id": "5001", "type": "None"},
                {"id": "5002", "type": "Glazed"},
                {"id": "5005", "type": "Sugar"},
                {"id": "5007", "type": "Powdered Sugar"},
                {"id": "5006", "type": "Chocolate with Sprinkles"},
                {"id": "5003", "type": "Chocolate"},
                {"id": "5004", "type": "Maple"}
            ],
        "something": []
    }
    df = json_to_dataframe(json_data)
    print(df)

OUTPUT:

      id   type  name   ppu batters.batter.id batters.batter.type topping.id              topping.type
0   0001  donut  Cake  0.55              1001             Regular       5001                      None
1   0001  donut  Cake  0.55              1001             Regular       5002                    Glazed
2   0001  donut  Cake  0.55              1001             Regular       5005                     Sugar
3   0001  donut  Cake  0.55              1001             Regular       5007            Powdered Sugar
4   0001  donut  Cake  0.55              1001             Regular       5006  Chocolate with Sprinkles
5   0001  donut  Cake  0.55              1001             Regular       5003                 Chocolate
6   0001  donut  Cake  0.55              1001             Regular       5004                     Maple
7   0001  donut  Cake  0.55              1002           Chocolate       5001                      None
8   0001  donut  Cake  0.55              1002           Chocolate       5002                    Glazed
9   0001  donut  Cake  0.55              1002           Chocolate       5005                     Sugar
10  0001  donut  Cake  0.55              1002           Chocolate       5007            Powdered Sugar
11  0001  donut  Cake  0.55              1002           Chocolate       5006  Chocolate with Sprinkles
12  0001  donut  Cake  0.55              1002           Chocolate       5003                 Chocolate
13  0001  donut  Cake  0.55              1002           Chocolate       5004                     Maple
14  0001  donut  Cake  0.55              1003           Blueberry       5001                      None
15  0001  donut  Cake  0.55              1003           Blueberry       5002                    Glazed
16  0001  donut  Cake  0.55              1003           Blueberry       5005                     Sugar
17  0001  donut  Cake  0.55              1003           Blueberry       5007            Powdered Sugar
18  0001  donut  Cake  0.55              1003           Blueberry       5006  Chocolate with Sprinkles
19  0001  donut  Cake  0.55              1003           Blueberry       5003                 Chocolate
20  0001  donut  Cake  0.55              1003           Blueberry       5004                     Maple
21  0001  donut  Cake  0.55              1004        Devil's Food       5001                      None
22  0001  donut  Cake  0.55              1004        Devil's Food       5002                    Glazed
23  0001  donut  Cake  0.55              1004        Devil's Food       5005                     Sugar
24  0001  donut  Cake  0.55              1004        Devil's Food       5007            Powdered Sugar
25  0001  donut  Cake  0.55              1004        Devil's Food       5006  Chocolate with Sprinkles
26  0001  donut  Cake  0.55              1004        Devil's Food       5003                 Chocolate
27  0001  donut  Cake  0.55              1004        Devil's Food       5004                     Maple

As per what the above does, well, the cross_join function does pretty much the same thing as in the dataframe solution, but without dataframes, thus being faster.

I added the flatten_list generator as I wanted to make sure that the JSON arrays are all nice and flattened, then provided as a single list of dictionaries comprising of the previous key from one iteration before assigned to each of the list's values. This pretty much mimics the pandas.concat behaviour in this case.

The logic in the main function, json_to_dataframe is then the same as before. All that needed to change was having the operations performed by dataframes as coded functions.

Also, in the dataframes solution I was not appending the previous heading to the nested object, but unless you are 100% sure you do not have conflicts in column names, then it is pretty much mandatory.

I hope this helps :).

EDIT: Modified the cross_join function to deal with the case when a nested list is empty, basically maintaining the previous result set unmodified. The output is unchanged even after adding the empty JSON list in the example JSON data. Thank you, @Nazmus Sakib for pointing it out.

Strachan answered 20/8, 2020 at 7:46 Comment(15)
Your code looks a bit awkward to me. I am not very familiar with python, but I'm pretty sure your indentation is incorrect. Can you please review this?Urgency
@Urgency Yeah, you're right. Thanks for mentioning it. Should be alright now.Strachan
This new solution works perfectly, converting complex json to data frame. I could then easily convert it to CSV. Thank you !Suitcase
This solution needs more appreciation!!! Works really well for me.Ferruginous
This is really awesome!!!Quod
Hi, I have slightly different JSON, have pasted below the block which is causing issue. ExpectOutput should have been with 2 total rows but due to crossjoin 4 rows are displayed. Please suggest what change can help to address such data. [Added JSON data in next comment]Stopgap
JSON Data: \n { "Load_id": "1", "Loads": [ { "time_list": [ 0.1727, 0.318 ], "data_list": [ 2.0, 1.0 ] } ] }Stopgap
@Stopgap Modying the list part in flatten_json similar to this should do it:-------------------------------------for item in data:-------------------------------------------------------------------------------------------------------if isinstance(item, dict) and all([isinstance(elem, list) for elem in item.values()]):-----------------------------------item = [dict(zip(item.keys(), zip_data)) for zip_data in zip(*item.values())]----------------------------------[rows.append(elem) for elem in flatten_list(flatten_json(item, prev_heading))]Strachan
@BogdanMircea Firstly, thanks and really appreciate that you took time and replied to my comment. I did update the code which you gave and worked fine for the sample data I had shared. Below is the update I did:Stopgap
#Added as per comment ----elif isinstance(data, dict) and all([isinstance(elem, list) for elem in data.values()]):- --------rows = [] --------item = [dict(zip(data.keys(), zip_data)) for zip_data in zip(*data.values())] --------[rows.append(elem) for elem in flatten_list(flatten_json(item, prev_heading))] ----elif isinstance(data, list): --------rows = [] --------for i in range(len(data)): ------------[rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))] ----else: --------rows = [{prev_heading[1:]: data}]Stopgap
As i said had shared a sample of data, with this data it was causing same problem: {"process":[{"process_id":"101","task":[{"task_id":"1","load":[{"load_id":"1","time_list":[0.1727,0.318],"data_list":[2.0,1.0]},{"load_id":"2","time_list":[0.231,0.232],"data_list":[2.0,1.0]}]},{"task_id":"2","load":[{"load_id":"1","time_list":[15.0],"data_list":[7.0]}]}]}]}.Stopgap
But did a change in crossjoin and the code works now :) but not sure that is the right way to fix the issue. but it working fine currently for the data i have new_rows = [] if right else left<----> # Kumar: If condition added to skip cross join<----> if (len(right)== len(left)):<----> # Appended left and right as two new columns<----> for i in range(0,len(left)) :<----> temp_row = deepcopy(left[i])<----> for key, value in right[i].items():<----> temp_row[key] = value<----> new_rows.append(deepcopy(temp_row))<----> else:<---->Stopgap
@Stopgap We can probably keep going like this forever , so I'd suggest you ask a new question if you need help with heavily customizing this for your use case. That, of course, in case the solution you came up with is not satisfactory.Strachan
@BogdanMircea how to handle dict values where we have something like this - "parentField": { "childField": [ "string1", "string2", "string3" ] },Insensitive
What if we don't want to take cross join? instead just want to insert records with nested combinations?Illyricum
G
6

For the JSON data you have given, you could do this by parsing the JSON structure to just return a list of all the leaf nodes.

This assumes that your structure is consistent throughout, if each entry can have different fields, see the second approach.

For example:

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = []
        for i in item.keys():
            leaves.extend(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = []
        for i in item:
            leaves.extend(get_leaves(i, key))
        return leaves
    else:
        return [(key, item)]


with open('json.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    write_header = True

    for entry in json.load(f_input):
        leaf_entries = sorted(get_leaves(entry))

        if write_header:
            csv_output.writerow([k for k, v in leaf_entries])
            write_header = False

        csv_output.writerow([v for k, v in leaf_entries])

If your JSON data is a list of entries in the format you have given, then you should get output as follows:

address_line_1,company_number,country_of_residence,etag,forename,kind,locality,middle_name,month,name,nationality,natures_of_control,notified_on,postal_code,premises,region,self,surname,title,year
Address 1,12345678,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
Address 1,12345679,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977

If each entry can contain different (or possibly missing) fields, then a better approach would be to use a DictWriter. In this case, all of the entries would need to be processed to determine the complete list of possible fieldnames so that the correct header can be written.

import json
import csv

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = {}
        for i in item.keys():
            leaves.update(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = {}
        for i in item:
            leaves.update(get_leaves(i, key))
        return leaves
    else:
        return {key : item}


with open('json.txt') as f_input:
    json_data = json.load(f_input)

# First parse all entries to get the complete fieldname list
fieldnames = set()

for entry in json_data:
    fieldnames.update(get_leaves(entry).keys())

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames))
    csv_output.writeheader()
    csv_output.writerows(get_leaves(entry) for entry in json_data)
Garver answered 20/12, 2016 at 7:52 Comment(12)
I think this could cause problems if the nested key-values weren't consistent throughout the entire json file. If one of the structures is missing a field, the data in that row will be offset.Nursemaid
This code didn't worked for my json data. I can parse only this keys: "K6v8Ht6nXCjaO_ApNGr" Can you help me about that ? Please. My python version is 3.6.4Savonarola
@tpbafk, for Python 3.x you will need to make a minor change to the open() commands (I have updated the script), but without seeing your JSON, I would not be able to tell you the reason it is not parsing everything. Perhaps you should start a new question?Garver
agree with @Nursemaid code does not work if fields are missingMedulla
@SriniSydney if your entries contain different fields, then a better approach would be to use a DictWriter. The data would need to first be parsed to get the complete fieldname list. I have updated the answer to help show you how this could be done.Garver
Thanks @MartinEvans but it returns only the top level tag name. Example i tried copied belowMedulla
{"LOG_28MAY":[{"pk":"22","venue_name":"manchester","venue_code":"03839","fields":{"codename":"L01","name":"Can add log entry","content_type":"8","DAILY_LIST":["LOG_ID:12309","HOST_ID:1293123"]}},{"pk":"23","venue_name":"Birmingham","fields":{"codename":"Edit Log entry","content_type":"9","DAILY_LIST":["LOG_ID:230912309","HOST_ID:2494569","LOG_LOCATION_ID:20190627"]}}]}Medulla
Try changing the 2nd script to use json_data = json.load(f_input)['LOG_28MAY'] I suggest you start a new question.Garver
with the root tag the elements of the array are not retrieved. i.e elements of DAILY LIST are not fully populated. 1 missing from first row and 2 missing from second row. output belowMedulla
DAILY_LIST,codename,content_type,name,pk,venue_code,venue_name HOST_ID:1293123,L01,8,Can add log entry,22,03839,manchester LOG_LOCATION_ID:20190627,Edit Log entry,9,,23,,BirminghamMedulla
@SriniSydney, please copy all of this information and start your own question. You could add a comment to your question with my name and I will see it.Garver
@MartinEvans - new question #56452085Medulla
W
6

You can use the pandas library json_normalize function to flatten the struct, and then deal with it as you please. For example:

import pandas as pd
import json

raw = """[{
  "company_number": "12345678",
  "data": {
    "address": {
      "address_line_1": "Address 1",
      "locality": "Henley-On-Thames",
      "postal_code": "RG9 1DP",
      "premises": "161",
      "region": "Oxfordshire"
    },
    "country_of_residence": "England",
    "date_of_birth": {
      "month": 2,
      "year": 1977
    },
    "etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
    },
    "name": "John M Smith",
    "name_elements": {
      "forename": "John",
      "middle_name": "M",
      "surname": "Smith",
      "title": "Mrs"
    },
    "nationality": "Vietnamese",
    "natures_of_control": [
      "ownership-of-shares-50-to-75-percent"
    ],
    "notified_on": "2016-04-06"
  }
}]"""

data = json.loads(raw)
data = pd.json_normalize(data)
print(data.to_csv())

Which gives you:

,company_number,data.address.address_line_1,data.address.locality,data.address.postal_code,data.address.premises,data.address.region,data.country_of_residence,data.date_of_birth.month,data.date_of_birth.year,data.etag,data.kind,data.links.self,data.name,data.name_elements.forename,data.name_elements.middle_name,data.name_elements.surname,data.name_elements.title,data.nationality,data.natures_of_control,data.notified_on
0,12345678,Address 1,Henley-On-Thames,RG9 1DP,161,Oxfordshire,England,2,1977,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,individual-person-with-significant-control,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,John M Smith,John,M,Smith,Mrs,Vietnamese,['ownership-of-shares-50-to-75-percent'],2016-04-06
Worked answered 20/4, 2020 at 5:52 Comment(0)
H
3

Referring to the answer of Bogdan Mircea,

The code almost served my purpose! But it returns an empty dataframe whenever it encounters an empty list in a nested json.

You can easily overcome the issue by putting this in the code

elif isinstance(data, list):
        rows = []
        if(len(data) != 0):
            for i in range(len(data)):
                [rows.append(elem) for elem in flatten_list(flatten_json(data[i], prev_heading))]
        else:
            data.append(None)
            [rows.append(elem) for elem in flatten_list(flatten_json(data[0], prev_heading))]
Hohenlinden answered 23/3, 2021 at 12:37 Comment(1)
Great catch. Didn't encounter this when I had to deal with my problem, but I modified the cross_join function to deal with this and edited the answer. Thank you!Strachan
O
0
import pandas as pd
import json
import glob
from pandas.io.json import json_normalize

json_files = glob.glob("*.json")
dfs = []
for file in json_files:
    with open(file) as f:
        for line in f.readlines():
            df = pd.json_normalize(json.loads(line))
            list_= ['Item.dataId.S','Item.metadata.M.timestamp.S','Item.sensor.M.celcius.N','Item.sensor.M.water.N'']
            df = df.loc[:, df.columns.isin(list_)]
            dfs.append(df)
df_combine = pd.concat(dfs, sort=False)
df_combine.to_csv('json_to_raw.csv',index= None)
Oster answered 14/10, 2022 at 21:52 Comment(2)
useful link: medium.com/p/4d5899f3b621Oster
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Primula

© 2022 - 2025 — McMap. All rights reserved.