How to json_normalize a column in pandas with empty lists, without losing records
Asked Answered
G

2

8

I am using pd.json_normalize to flatten the "sections" field in this data into rows. It works fine except for rows where the "sections" is an empty list.

This ID gets completely ignored and is missing from the final flattened dataframe. I need to make sure that I have at least one row per unique ID in the data (some IDs may have many rows up to one row per unique ID, per unique section_id, question_id, and answer_id as I unnest more fields in the data):

     {'_id': '5f48f708fe22ca4d15fb3b55',
      'created_at': '2020-08-28T12:22:32Z',
      'sections': []}]

Sample data:

sample = [{'_id': '5f48bee4c54cf6b5e8048274',
          'created_at': '2020-08-28T08:23:00Z',
          'sections': [{'comment': '',
            'type_fail': None,
            'answers': [{'comment': 'stuff',
              'feedback': [],
              'value': 10.0,
              'answer_type': 'default',
              'question_id': '5e59599c68369c24069630fd',
              'answer_id': '5e595a7c3fbb70448b6ff935'},
             {'comment': 'stuff',
              'feedback': [],
              'value': 10.0,
              'answer_type': 'default',
              'question_id': '5e598939cedcaf5b865ef99a',
              'answer_id': '5e598939cedcaf5b865ef998'}],
            'score': 20.0,
            'passed': True,
            '_id': '5e59599c68369c24069630fe',
            'custom_fields': []},
           {'comment': '',
            'type_fail': None,
            'answers': [{'comment': '',
              'feedback': [],
              'value': None,
              'answer_type': 'not_applicable',
              'question_id': '5e59894f68369c2398eb68a8',
              'answer_id': '5eaad4e5b513aed9a3c996a5'},
             {'comment': '',
              'feedback': [],
              'value': None,
              'answer_type': 'not_applicable',
              'question_id': '5e598967cedcaf5b865efe3e',
              'answer_id': '5eaad4ece3f1e0794372f8b2'},
             {'comment': "stuff",
              'feedback': [],
              'value': 0.0,
              'answer_type': 'default',
              'question_id': '5e598976cedcaf5b865effd1',
              'answer_id': '5e598976cedcaf5b865effd3'}],
            'score': 0.0,
            'passed': True,
            '_id': '5e59894f68369c2398eb68a9',
            'custom_fields': []}]},
         {'_id': '5f48f708fe22ca4d15fb3b55',
          'created_at': '2020-08-28T12:22:32Z',
          'sections': []}]

Tests:

df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")

At this point I am now missing a row for ID "5f48f708fe22ca4d15fb3b55" which I still need.

df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")

Can I alter this somehow to make sure that I get one row per ID at minimum? I'm dealing with millions of records and don't want to realize later that some IDs were missing from my final data. The only solution I can think of is to normalize each dataframe and then left join it to the original dataframe again.

Ganiats answered 9/9, 2020 at 14:4 Comment(0)
S
8
  • The best way to resolve the issue, is fix the dict
  • If sections is an empty list, fill it with [{'answers': [{}]}]
for i, d in enumerate(sample):
    if not d['sections']:
        sample[i]['sections'] = [{'answers': [{}]}]

df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")

# display(df2)
  section_comment  section_type_fail                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               section_answers  section_score section_passed               section__id section_custom_fields                       _id            created_at
0                                NaN                                                                                                                                                                        [{'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e59599c68369c24069630fd', 'answer_id': '5e595a7c3fbb70448b6ff935'}, {'comment': 'stuff', 'feedback': [], 'value': 10.0, 'answer_type': 'default', 'question_id': '5e598939cedcaf5b865ef99a', 'answer_id': '5e598939cedcaf5b865ef998'}]           20.0           True  5e59599c68369c24069630fe                    []  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z
1                                NaN  [{'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e59894f68369c2398eb68a8', 'answer_id': '5eaad4e5b513aed9a3c996a5'}, {'comment': '', 'feedback': [], 'value': None, 'answer_type': 'not_applicable', 'question_id': '5e598967cedcaf5b865efe3e', 'answer_id': '5eaad4ece3f1e0794372f8b2'}, {'comment': 'stuff', 'feedback': [], 'value': 0.0, 'answer_type': 'default', 'question_id': '5e598976cedcaf5b865effd1', 'answer_id': '5e598976cedcaf5b865effd3'}]            0.0           True  5e59894f68369c2398eb68a9                    []  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z
2             NaN                NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [{}]            NaN            NaN                       NaN                   NaN  5f48f708fe22ca4d15fb3b55  2020-08-28T12:22:32Z

df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")

# display(df3)
  comment feedback  value     answer_type               question_id                 answer_id                       _id            created_at               section__id section_score section_passed section_type_fail section_comment
0   stuff       []   10.0         default  5e59599c68369c24069630fd  5e595a7c3fbb70448b6ff935  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59599c68369c24069630fe            20           True               NaN                
1   stuff       []   10.0         default  5e598939cedcaf5b865ef99a  5e598939cedcaf5b865ef998  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59599c68369c24069630fe            20           True               NaN                
2               []    NaN  not_applicable  5e59894f68369c2398eb68a8  5eaad4e5b513aed9a3c996a5  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59894f68369c2398eb68a9             0           True               NaN                
3               []    NaN  not_applicable  5e598967cedcaf5b865efe3e  5eaad4ece3f1e0794372f8b2  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59894f68369c2398eb68a9             0           True               NaN                
4   stuff       []    0.0         default  5e598976cedcaf5b865effd1  5e598976cedcaf5b865effd3  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z  5e59894f68369c2398eb68a9             0           True               NaN                
5     NaN      NaN    NaN             NaN                       NaN                       NaN  5f48f708fe22ca4d15fb3b55  2020-08-28T12:22:32Z                       NaN           NaN            NaN               NaN             NaN
Scene answered 14/9, 2020 at 0:55 Comment(4)
Filling empty values with a list + empty dict worked for me and allowed me to use json_normalize purely (compared to using another library). The only other thing I had to account for was when you normalize a field and other fields are a list-type - I had to convert those list columns to strings. I now have a working solution though, thanks.Ganiats
@Ganiats I'm glad that worked for you. Working with incorrectly structured JSON files can be a pain.Scene
@TrentonMcKinney Holy smokes! I just ran into this issue. Thank you for your solution here!Ravishment
@ScottBoston Hey Scott! Glad this worked for you. JSON objects can be such a pain. Woops! Said the same thing 1.5 years ago. I stand by that comment.Scene
N
2

This is a known issue with json_normalize. I haven't found a way to do this using json_normalize. You can try using flatten_json something like this:

import flatten_json as fj

dic = (fj.flatten(d) for d in sample)
df = pd.DataFrame(dic)
print(df)

                        _id            created_at sections_0_comment  ...            sections_1__id sections_1_custom_fields sections
0  5f48bee4c54cf6b5e8048274  2020-08-28T08:23:00Z                     ...  5e59894f68369c2398eb68a9                       []      NaN
1  5f48f708fe22ca4d15fb3b55  2020-08-28T12:22:32Z                NaN  ...                       NaN                      NaN       []
Nowhere answered 9/9, 2020 at 14:19 Comment(4)
Known issue as in there is a current issue report, or should I create one potentially? I feel like there should at least be a warning since you can be missing data without really knowing why. I just tested the flatten_json module, thanks - is there any chance that it can flatten/explode the data into rows instead of columns? This field contains custom fields that can get added to the results at any point, there are literally thousands of unique question and answer IDs so having a column for each of them won't work out for me.Ganiats
There is an issue opened here: github.com/pandas-dev/pandas/issues/21830Nowhere
To flatten into rows, you have to loop through the dict and then flatten them using the flatten functionNowhere
Cool, let me try that. I also create a new issue which seemed a bit more specific: github.com/pandas-dev/pandas/issues/36245 with an example of the exact output that I'd expect when using meta columnsGaniats

© 2022 - 2024 — McMap. All rights reserved.