How to read a JSON file with nested objects as a pandas DataFrame?
Asked Answered
F

4

68

I am curious how I can use pandas to read nested json of the following structure:

{
    "number": "",
    "date": "01.10.2016",
    "name": "R 3932",
    "locations": [
        {
            "depTimeDiffMin": "0",
            "name": "Spital am Pyhrn Bahnhof",
            "arrTime": "",
            "depTime": "06:32",
            "platform": "2",
            "stationIdx": "0",
            "arrTimeDiffMin": "",
            "track": "R 3932"
        },
        {
            "depTimeDiffMin": "0",
            "name": "Windischgarsten Bahnhof",
            "arrTime": "06:37",
            "depTime": "06:40",
            "platform": "2",
            "stationIdx": "1",
            "arrTimeDiffMin": "1",
            "track": ""
        },
        {
            "depTimeDiffMin": "",
            "name": "Linz/Donau Hbf",
            "arrTime": "08:24",
            "depTime": "",
            "platform": "1A-B",
            "stationIdx": "22",
            "arrTimeDiffMin": "1",
            "track": ""
        }
    ]
}

This here keeps the array as json. I would rather prefer it to be expanded into columns.

pd.read_json("/myJson.json", orient='records')

edit

Thanks for the first answers. I should refine my question: A flattening of the nested attributes in the array is not mandatory. It would be ok to just [A, B, C] concatenate the df.locations['name'].

My file contains multiple JSON objects (1 per line) I would like to keep number, date, name, and locations column. However, I would need to join the locations.

allLocations = ""
isFirst = True
for location in result.locations:
    if isFirst:
        isFirst = False
        allLocations = location['name']
    else:
        allLocations += "; " + location['name']
allLocations

My approach here does not seem to be efficient / pandas style.

Foal answered 14/11, 2016 at 12:31 Comment(0)
B
82

You can use json_normalize:

import json

with open('myJson.json') as data_file:    
    data = json.load(data_file)  

df = pd.json_normalize(data, 'locations', ['date', 'number', 'name'], 
                    record_prefix='locations_')
print (df)
  locations_arrTime locations_arrTimeDiffMin locations_depTime  \
0                                                        06:32   
1             06:37                        1             06:40   
2             08:24                        1                     

  locations_depTimeDiffMin           locations_name locations_platform  \
0                        0  Spital am Pyhrn Bahnhof                  2   
1                        0  Windischgarsten Bahnhof                  2   
2                                    Linz/Donau Hbf               1A-B   

  locations_stationIdx locations_track number    name        date  
0                    0          R 3932         R 3932  01.10.2016  
1                    1                         R 3932  01.10.2016  
2                   22                         R 3932  01.10.2016 

EDIT:

You can use read_json with parsing name by DataFrame constructor and last groupby with apply join:

df = pd.read_json("myJson.json")
df.locations = pd.DataFrame(df.locations.values.tolist())['name']
df = df.groupby(['date','name','number'])['locations'].apply(','.join).reset_index()
print (df)
        date    name number                                          locations
0 2016-01-10  R 3932         Spital am Pyhrn Bahnhof,Windischgarsten Bahnho... 
Brancusi answered 14/11, 2016 at 12:41 Comment(1)
Let us continue this discussion in chat.Gynaecocracy
L
10

Another option if anyone finds this, as I was working through a notebook. Read the file in as a df with

df = pd.read_json('filename.json')
df2 = pd.DataFrame.from_records(df['nest_level_1']['nest_level_2'])

Happy coding

Lumberjack answered 26/8, 2022 at 2:14 Comment(0)
C
0

A possible alternative to pandas.json_normalize is to build your own dataframe by extracting only the selected keys and values from the nested dictionary. The main reason for doing this is because json_normalize gets slow for very large json file (and might not always produce the output you want).

So, here is an alternative way to flatten the nested dictionary in pandas using glom. The aim is to extract selected keys and value from the nested dictionary and save them in a separate column of the pandas dataframe (:

Here is a step by step guide: https://medium.com/@enrico.alemani/flatten-nested-dictionaries-in-pandas-using-glom-7948345c88f5

import pandas as pd
from glom import glom
from ast import literal_eval


target = {
    "number": "",
    "date": "01.10.2016",
    "name": "R 3932",
    "locations":
        {
            "depTimeDiffMin": "0",
            "name": "Spital am Pyhrn Bahnhof",
            "arrTime": "",
            "depTime": "06:32",
            "platform": "2",
            "stationIdx": "0",
            "arrTimeDiffMin": "",
            "track": "R 3932"
        }
}   



# Import data
df = pd.DataFrame([str(target)], columns=['target'])

# Extract id keys and save value into a separate pandas column
df['id'] = df['target'].apply(lambda row: glom(literal_eval(row), 'locations.name'))
Cavitation answered 23/6, 2020 at 11:15 Comment(0)
R
0

I have a mutiline Json having one json object every line {'a':'b','scope':{'eid':123213}} {'a':'d','scope':{'eid':1343213}}

NO comma seperated. Each line is indepoendent

i used following logic to read nested structure

threshold = pd.read_json(r"/content/data.json",lines=True)

threshold = pd.read_json(r"/content/data.json",lines=True)
threshold['entityId'] = pd.DataFrame.from_records(threshold['scope'])['entityId']
threshold.head()
Retroaction answered 14/3, 2023 at 20:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.