pyarrow data types for columns that have lists of dictionaries?
Asked Answered
T

1

7

Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file?

If I save lists or lists of dictionaries as a string, I normally have to .apply(eval) the field if I read it into memory again in order for pandas to recognize the data as a list (so I can normalize it with pd.json_normalize)

column_a:

[
 {"id": "something", "value": "else"},
 {"id": "something2", "value": "else2"},
]

column_b:

["test", "test2", "test3"]

Just wondering if I should save this data as something else besides a string.

Edit - pasting a snippet of some raw JSON from Zendesk. The audits field has a field called events which is a list of dictionaries. Inside that, there can be other lists of dictionaries as well (attachments and inside that there is a list of dictionaries called thumbnails)

Are you able to use pa.map_ to handle situations like this? I sometimes need to retrieve data from these nested fields which I do not even know exist initially. In my current parquet dataset, the events field is just a single column (string type) even though there are many nested fields within it.

udt = pa.map_(pa.string(), pa.string())

.

  "audit": {
    "id": ,
    "ticket_id": ,
    "created_at": "",
    "author_id": ,
    "events": [
      {
        "id": ,
        "type": "",
        "author_id": ,
        "body": "" ,
        "plain_body": "",
        "public": false,
        "attachments": [
          {
            "url": "",
            "id": ,
            "file_name": "",
            "content_url": "",
            "content_type": "image/png",
            "size": 2888,
            "width": 100,
            "height": 30,
            "inline": false,
            "deleted": false,
            "thumbnails": [
              {
                "url": "",
                "id": ,
                "file_name": "",
                "content_url": "",
                "mapped_content_url": "",
                "content_type": "image/png",
                "size": 2075,
                "width": 80,
                "height": 24,
                "inline": false,
                "deleted": false
              }
            ]
          },
Try answered 24/8, 2020 at 1:44 Comment(1)
Why the downvote? Seems like an interesting question. – Waldenses
B
6

Assuming you have a df with "dictionary" and string columns, and the dictionaries all have the same keys (id, value in your case):

df = pd.DataFrame({
        'col1': pd.Series([
            {"id": "something", "value": "else"}, 
            {"id": "something2", "value": "else2"}
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.struct([pa.field('id', pa.string()), pa.field('value', pa.string())])
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
df = table.to_pandas()

If your dictionaries don't have the same keys or you don't know the keys of the dictionaries in advance, you can do this:

df = pd.DataFrame({
        'col1': pd.Series([
            [('id', 'something'), ('value', '"else')],
            [('id', 'something2'), ('value','else2')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)

Note that the format for col1 is different (it is using a list of pairs instead of a dict). Also you can't convert your table back to pandas as it is not supported (yet):

table.to_pandas()
>>> ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type map<string, string> is known.


Bongbongo answered 24/8, 2020 at 12:11 Comment(3)
Cool - thanks. I suppose I won't be able to utilize this yet because I need to get the data back to pandas to flatten it out for csv files or visualizations typically. I edited my question with a more complex real life example - would this pa.map_ handle nested fields within nested fields? – Try
pa.map_ won't work because the values need to be all of the same type. You could define a pa.struct for thumbnail, then define a pa.struct for attachment that would have a pa.list_ of thumbnail. But considering the deeply nested nature of your data and the fact that there are a lot of repeated fields (many attachment/thumbnails in each record) they don't fit very well into a columnar format like arrow or a pandas dataframe. – Bongbongo
I cannot thank you enough for your insight of lists of tuples for map types. I cannot find documentation anywhere that a dict cannot be made into a map, and it seems unintuitive to me that it can't. πŸ‘πŸ‘ – Leukocyte

© 2022 - 2024 β€” McMap. All rights reserved.