Denormalize/flatten list of nested objects into dot separated key value pairs
Asked Answered
B

2

2

It would have simpler if my nested objects were dictionaries, but these are list of dictionaries. Example:

all_objs1 = [{
    'a': 1,
    'b': [{'ba': 2, 'bb': 3}, {'ba': 21, 'bb': 31}],
    'c': 4
}, {
    'a': 11,
    'b': [{'ba': 22, 'bb': 33, 'bc': [{'h': 1, 'e': 2}]}],
    'c': 44
}]

I expect output in following format:

[
  {'a': 1, 'b.ba': 2, 'b.bb': 3, 'c': 4},
  {'a': 1, 'b.ba': 21, 'b.bb': 31, 'c': 4},
  {'a': 11, 'b.ba': 22, 'b.bb': 33, 'bc.h': 1, 'bc.e': 2, 'c': 44},
]

Basically, number of flattened objects generated will be equal to (obj * depth)

With my current code:

def flatten(obj, flattened_obj, last_key=''):
  for k,v in obj.iteritems():
    if not isinstance(v, list):
      flattened_obj.update({last_key+k : v})
    else:
      last_key += k + '.'
      for nest_obj in v:
        flatten(nest_obj, flattened_obj, last_key)
        last_key = remove_last_key(last_key)

def remove_last_key(key_path):
    second_dot = key_path[:-1].rfind('.')
    if second_dot > 0:
      return key_path[:second_dot+1]
    return key_path

Output:


[
  {'a': 1, 'b.bb': 31, 'c': 4, 'b.ba': 21},
  {'a': 11, 'b.bc.e': 2, 'c': 44, 'b.bc.h': 1, 'b.bb': 33, 'b.ba': 22}
]

I am able to flatten the object (not accurate though), but I am not able to create a new object at each nested object. I can not use pandas library as my app is deployed on app engine.

Bucaramanga answered 21/9, 2017 at 10:43 Comment(4)
I see that a complex object (a dict) is always encapsulated in a list. What would happen if in the input you'd have e.g. "d" : {"da": 4, "db": 6}? Is that possible?Salvidor
No, nested object will always be objects in a list. Value is always going to be a either a list of objects or string/number.Bucaramanga
Another question: what if for example the 1st object in the input would have more that one list field? e.g. "d": [{"da": 1, "db": 2},{"da": 3, "db": 4}], that would yield 4 entries in the output for it?Salvidor
@CristiFati, first of all there can be only one key having list of further dictionaries. But these dictionaries can be further nested in same way. So in first object, if collection is there at "b", it can not have any other key with collection.Bucaramanga
S
3

code.py:

from itertools import product
from pprint import pprint as pp


all_objs = [{
    "a": 1,
    "b": [{"ba": 2, "bb": 3}, {"ba": 21, "bb": 31}],
    "c": 4,
    #"d": [{"da": 2}, {"da": 5}],
}, {
    "a": 11,
    "b": [{"ba": 22, "bb": 33, "bc": [{"h": 1, "e": 2}]}],
    "c": 44,
}]


def flatten_dict(obj, parent_key=None):
    base_dict = dict()
    complex_items = list()
    very_complex_items = list()
    for key, val in obj.items():
        new_key = ".".join((parent_key, key)) if parent_key is not None else key
        if isinstance(val, list):
            if len(val) > 1:
                very_complex_items.append((key, val))
            else:
                complex_items.append((key, val))
        else:
            base_dict[new_key] = val
    if not complex_items and not very_complex_items:
        return [base_dict]
    base_dicts = list()
    partial_dicts = list()
    for key, val in complex_items:
        partial_dicts.append(flatten_dict(val[0], parent_key=new_key))
    for product_tuple in product(*tuple(partial_dicts)):
        new_base_dict = base_dict.copy()
        for new_dict in product_tuple:
            new_base_dict.update(new_dict)
        base_dicts.append(new_base_dict)
    if not very_complex_items:
        return base_dicts
    ret = list()
    very_complex_keys = [item[0] for item in very_complex_items]
    very_complex_vals = tuple([item[1] for item in very_complex_items])
    for product_tuple in product(*very_complex_vals):
        for base_dict in base_dicts:
            new_dict = base_dict.copy()
            new_items = zip(very_complex_keys, product_tuple)
            for key, val in new_items:
                new_key = ".".join((parent_key, key)) if parent_key is not None else key
                new_dict.update(flatten_dict(val, parent_key=new_key)[0])
            ret.append(new_dict)
    return ret


def main():
    flatten = list()
    for obj in all_objs:
        flatten.extend(flatten_dict(obj))
    pp(flatten)


if __name__ == "__main__":
    main()

Notes:

  • As expected, recursion is used
  • It's general, it also works for the case that I mentioned in my 2nd comment (for one input dict having more than one key with a value consisting of a list with more than one element), that can be tested by decommenting the "d" key in all_objs. Also, theoretically it should support any depth
  • flatten_dict: takes an input dictionary and outputs a list of dictionaries (as the input dictionary might yield more than one output dictionary):
    • Every key having a "simple" (not list) value, goes into the output dictionar(y/ies) unchanged
    • At this point, a base output dictionary is complete (if the input dictionary will generate more than output dictionary, all will have the base dictionary keys/values, if it only generates one output dictionary, then that will be the base one)
    • Next, the keys with "problematic" values - that may generate more than output dictionary - (if any) are processed:
      • Keys having a list with a single element ("problematic") - each might generate more than one output dictionary:
        • Each of the values will be flattened (might yield more than one output dictionary); the corresponding key will be used in the process
        • Then, the cartesian product will be computed on all the flatten dictionary lists (for current input, there will only be one list with one element)
        • Now, each product item needs to be in a distinct output dictionary, so the base dictionary is duplicated and updated with the keys / values of every element in the product item (for current input, there will be only one element per product item)
        • The new dictionary is appended to a list
      • At this point a list of base dictionaries (might only be one) is complete, if no values consisting of lists with more than one element are present, this is the return list, else everything below has to be done for each base dictionary in the list
      • Keys having a list with a more elements ("very problematic") - each will generate more than one output dictionaries:
        • First, the cartesian product will be computed against all the values (lists with more than one element). In the current case, since since it's only one such list, each product item will only contain an element from that list
        • Then, for each product item element, its key will need to be established based on the lists order (for the current input, the product item will only contain one element, and also, there will only be one key)
        • Again, each product item needs to be in a distinct output dictionary, so the base dictionary is duplicated and updated with the keys / values, of the flattened product item
      • The new dictionary is appended to the output dictionaries list
  • Works with Python 3 and Python 2
  • Might be slow (especially for big input objects), as performance was not the goal. Also since it was built bottom-up (adding functionality as new cases were handled), it is pretty twisted (RO: intortocheated :) ), there might be a simpler implementation that I missed.

Output:

c:\Work\Dev\StackOverflow\q046341856>c:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe code.py
[{'a': 1, 'b.ba': 2, 'b.bb': 3, 'c': 4},
 {'a': 1, 'b.ba': 21, 'b.bb': 31, 'c': 4},
 {'a': 11, 'b.ba': 22, 'b.bb': 33, 'b.bc.e': 2, 'b.bc.h': 1, 'c': 44}]

@EDIT0:

  • Made it more general (although it's not visible for the current input): values containing only one element can yield more than output dictionary (when flattened), addressed that case (before I was only considering the 1st output dictionary, simply ignoring the rest)
  • Corrected a logical error that was masked out tuple unpacking combined with cartesian product: if not complex_items ... part

@EDIT1:

  • Modified the code to match a requirement change: the key in the flattened dictionary must have the full nesting path in the input dictionary
Salvidor answered 21/9, 2017 at 19:3 Comment(3)
{'a': 11, 'b.ba': 22, 'b.bb': 33, 'bc.e': 2, 'bc.h': 1, 'c': 44}] This output object is not correct, expected to be {'a': 11, 'b.ba': 22, 'b.bb': 33, 'b.bc.e': 2, 'b.bc.h': 1, 'c': 44}] All level nesting should maintain its complete path.Bucaramanga
solution worked very well with minor change. Thanks a lot :)Bucaramanga
Although it was in the actual output, the complete key path was not a requirement in the expected output ('bc.h'). I found that a little bit strange, but I thought that was the requirement. Anyway, when going deeper into recursion, the parent_key must be set to new_key instead of key. Will edit the answer accordingly.Salvidor
D
0

use this code to get your desired output. It generates output based on recursive call.

import json
from copy import deepcopy
def flatten(final_list, all_obj, temp_dct, last_key):


    for dct in all_obj:
        deep_temp_dct = deepcopy(temp_dct)
        for k, v in dct.items():
            if isinstance(v, list):
                final_list, deep_temp_dct = flatten(final_list, v, deep_temp_dct, k)
            else:
                prefix = ""
                if last_key : prefix = last_key + "."
                key = prefix+ k
                deep_temp_dct[key] = v
        if deep_temp_dct not in final_list:
            final_list.append(deep_temp_dct)

    return final_list, deep_temp_dct

final_list, _ = flatten([], all_objs1, {}, "")
print json.dumps(final_list, indent =4 )

let me know if it works for you.

Deceptive answered 21/9, 2017 at 12:49 Comment(2)
It doesn't work well with this sample: [{'a': 1, 'c': [{'aa': 11, 'bb': 22}, {'aa': 101, 'bb': 202}], 'b': 2}] <br/> Output comes out: [ { "a": 1, "c.aa": 11, "c.bb": 22 }, { "a": 1, "c.aa": 101, "c.bb": 202, "b": 2 } ] , i.e. missing "b": 2 in first generated flat object.Bucaramanga
@Madhukar, it sometimes doesn't iterate over key, value coming after the collection value.Bucaramanga

© 2022 - 2024 — McMap. All rights reserved.