Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file?
If I save lists or lists of dictionaries as a string, I normally have to .apply(eval)
the field if I read it into memory again in order for pandas to recognize the data as a list (so I can normalize it with pd.json_normalize
)
column_a:
[
{"id": "something", "value": "else"},
{"id": "something2", "value": "else2"},
]
column_b:
["test", "test2", "test3"]
Just wondering if I should save this data as something else besides a string.
Edit - pasting a snippet of some raw JSON from Zendesk. The audits field has a field called events which is a list of dictionaries. Inside that, there can be other lists of dictionaries as well (attachments and inside that there is a list of dictionaries called thumbnails)
Are you able to use pa.map_ to handle situations like this? I sometimes need to retrieve data from these nested fields which I do not even know exist initially. In my current parquet dataset, the events field is just a single column (string type) even though there are many nested fields within it.
udt = pa.map_(pa.string(), pa.string())
.
"audit": {
"id": ,
"ticket_id": ,
"created_at": "",
"author_id": ,
"events": [
{
"id": ,
"type": "",
"author_id": ,
"body": "" ,
"plain_body": "",
"public": false,
"attachments": [
{
"url": "",
"id": ,
"file_name": "",
"content_url": "",
"content_type": "image/png",
"size": 2888,
"width": 100,
"height": 30,
"inline": false,
"deleted": false,
"thumbnails": [
{
"url": "",
"id": ,
"file_name": "",
"content_url": "",
"mapped_content_url": "",
"content_type": "image/png",
"size": 2075,
"width": 80,
"height": 24,
"inline": false,
"deleted": false
}
]
},