I am using pd.json_normalize
to flatten the "sections"
field in this data into rows. It works fine except for rows where the "sections"
is an empty list.
This ID gets completely ignored and is missing from the final flattened dataframe. I need to make sure that I have at least one row per unique ID in the data (some IDs may have many rows up to one row per unique ID, per unique section_id
, question_id
, and answer_id
as I unnest more fields in the data):
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
Sample data:
sample = [{'_id': '5f48bee4c54cf6b5e8048274',
'created_at': '2020-08-28T08:23:00Z',
'sections': [{'comment': '',
'type_fail': None,
'answers': [{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e59599c68369c24069630fd',
'answer_id': '5e595a7c3fbb70448b6ff935'},
{'comment': 'stuff',
'feedback': [],
'value': 10.0,
'answer_type': 'default',
'question_id': '5e598939cedcaf5b865ef99a',
'answer_id': '5e598939cedcaf5b865ef998'}],
'score': 20.0,
'passed': True,
'_id': '5e59599c68369c24069630fe',
'custom_fields': []},
{'comment': '',
'type_fail': None,
'answers': [{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e59894f68369c2398eb68a8',
'answer_id': '5eaad4e5b513aed9a3c996a5'},
{'comment': '',
'feedback': [],
'value': None,
'answer_type': 'not_applicable',
'question_id': '5e598967cedcaf5b865efe3e',
'answer_id': '5eaad4ece3f1e0794372f8b2'},
{'comment': "stuff",
'feedback': [],
'value': 0.0,
'answer_type': 'default',
'question_id': '5e598976cedcaf5b865effd1',
'answer_id': '5e598976cedcaf5b865effd3'}],
'score': 0.0,
'passed': True,
'_id': '5e59894f68369c2398eb68a9',
'custom_fields': []}]},
{'_id': '5f48f708fe22ca4d15fb3b55',
'created_at': '2020-08-28T12:22:32Z',
'sections': []}]
Tests:
df = pd.json_normalize(sample)
df2 = pd.json_normalize(df.to_dict(orient="records"), meta=["_id", "created_at"], record_path="sections", record_prefix="section_")
At this point I am now missing a row for ID "5f48f708fe22ca4d15fb3b55" which I still need.
df3 = pd.json_normalize(df2.to_dict(orient="records"), meta=["_id", "created_at", "section__id", "section_score", "section_passed", "section_type_fail", "section_comment"], record_path="section_answers", record_prefix="")
Can I alter this somehow to make sure that I get one row per ID at minimum? I'm dealing with millions of records and don't want to realize later that some IDs were missing from my final data. The only solution I can think of is to normalize each dataframe and then left join it to the original dataframe again.