Find all occurrences of a key in nested dictionaries and lists
Asked Answered
H

14

121

I have a dictionary like this:

{
    "id": "abcde",
    "key1": "blah",
    "key2": "blah blah",
    "nestedlist": [
        {
            "id": "qwerty",
            "nestednestedlist": [
                {
                    "id": "xyz",
                    "keyA": "blah blah blah"
                },
                {
                    "id": "fghi",
                    "keyZ": "blah blah blah"
                }
            ],
            "anothernestednestedlist": [
                {
                    "id": "asdf",
                    "keyQ": "blah blah"
                },
                {
                    "id": "yuiop",
                    "keyW": "blah"
                }
            ]
        }
    ]
}

Basically a dictionary with nested lists, dictionaries, and strings, of arbitrary depth.

What is the best way of traversing this to extract the values of every "id" key? I want to achieve the equivalent of an XPath query like "//id". The value of "id" is always a string.

So from my example, the output I need is basically:

["abcde", "qwerty", "xyz", "fghi", "asdf", "yuiop"]

Order is not important.

Heritor answered 21/3, 2012 at 15:26 Comment(2)
See also: #7681801 https://mcmap.net/q/110938/-search-for-a-key-in-a-nested-python-dictionarySlaby
Most of your solutions blow up if we pass None as input. Do you care about robustness? (since this is now being used as canonical question)Exhibition
P
109

I found this Q/A very interesting, since it provides several different solutions for the same problem. I took all these functions and tested them with a complex dictionary object. I had to take two functions out of the test, because they had to many fail results and they did not support returning lists or dicts as values, which i find essential, since a function should be prepared for almost any data to come.

So i pumped the other functions in 100.000 iterations through the timeit module and output came to following result:

0.11 usec/pass on gen_dict_extract(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6.03 usec/pass on find_all_items(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.15 usec/pass on findkeys(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.79 usec/pass on get_recursively(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.14 usec/pass on find(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.36 usec/pass on dict_extract(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

All functions had the same needle to search for ('logging') and the same dictionary object, which is constructed like this:

o = { 'temparature': '50', 
      'logging': {
        'handlers': {
          'console': {
            'formatter': 'simple', 
            'class': 'logging.StreamHandler', 
            'stream': 'ext://sys.stdout', 
            'level': 'DEBUG'
          }
        },
        'loggers': {
          'simpleExample': {
            'handlers': ['console'], 
            'propagate': 'no', 
            'level': 'INFO'
          },
         'root': {
           'handlers': ['console'], 
           'level': 'DEBUG'
         }
       }, 
       'version': '1', 
       'formatters': {
         'simple': {
           'datefmt': "'%Y-%m-%d %H:%M:%S'", 
           'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
         }
       }
     }, 
     'treatment': {'second': 5, 'last': 4, 'first': 4},   
     'treatment_plan': [[4, 5, 4], [4, 5, 4], [5, 5, 5]]
}

All functions delivered the same result, but the time differences are dramatic! The function gen_dict_extract(k,o) is my function adapted from the functions here, actually it is pretty much like the find function from Alfe, with the main difference, that i am checking if the given object has iteritems function, in case strings are passed during recursion:

# python 2
def gen_dict_extract(key, var):
    if hasattr(var,'iteritems'): # hasattr(var,'items') for python 3
        for k, v in var.iteritems(): # var.items() for python 3
            if k == key:
                yield v
            if isinstance(v, dict):
                for result in gen_dict_extract(key, v):
                    yield result
            elif isinstance(v, list):
                for d in v:
                    for result in gen_dict_extract(key, d):
                        yield result

So this variant is the fastest and safest of the functions here. And find_all_items is incredibly slow and far off the second slowest get_recursivley while the rest, except dict_extract, is close to each other. The functions fun and keyHole only work if you are looking for strings.

Interesting learning aspect here :)

Pademelon answered 15/4, 2015 at 14:9 Comment(18)
This is the right answer for anyone but the OP's narrow case. This has worked on every nasty thing that jsonpickle can throw at it.Sacred
If you want to search for multiple keys as I did, just: (1) change to gen_dict_extract(keys, var) (2) put for key in keys: as line 2 & indent the rest (3) change the first yield to yield {key: v}Sacred
You're comparing apples to oranges. Running a function that returns a generator takes less time than running a function that returns a finished result. Try timeit on next(functionname(k, o) for all the generator solutions.Rayon
hasattr(var, 'items') for python3Ventral
Did you consider to strip the if hasattr part for a version using try to catch the exception in case the call fails (see pastebin.com/ZXvVtV0g for a possible implementation)? That would reduce the doubled lookup of the attribute iteritems (once for hasattr() and once for the call) and thus probably reduce the runtime (which seems important to you). Didn't make any benchmarks, though.Detective
I have to agree with @Rayon on this. Those timeit results are absolutely meaningless. It might have been the same dict and the same search term, but generators by their nature do nothing until you start iterating through them. If the dict object had 1 entry, or 99 Quadrillion your generator would "finish" in the exact same amount of time, because all it did was find the first result and then it stopped processing. The other solutions, namely "get_recursively", ACTUALLY found all occurrences of the key, as the OP had asked. As such, this answer is very misleading.Ottoman
I tried list(gen_dict_extract('logging',o)), but it returns an empty list, why is this?Copestone
For anyone visiting this page now that Python 3 has taken over, remember that iteritems has become items.Arakawa
I do not understand how this function actually solves all cases. Bruno mentions in response to kev's answer that this is the better solution. To me it seems more narrow: it does not look through lists (which don't have the items or in Py2 iteritems attributes). Yet, kev's solution handles this. What am I missing?Arakawa
@edge27 as Mike Williamson has mentioned, you are most likely using Python 3 and therefore have to use 'iter' instead of 'iteritems'Pademelon
@MikeWilliamson Lists are not handled, as they are values in a dictionary and this function only handles returning what ever is set with given key. So it could return a list as value, but searching for a value with given key within a list is not supportedPademelon
What is really bad is that you assume the top-level is a dictionary (and not a list), but even worse, you also unnecessarily assume that lists have primitives or dicts as values. Typical case of premature optimisation for speed, when the algorithm doesn't work for all cases (which it easily could)Kandi
Are you sure that your function works? - I am using your function and data and it returns empty list.Amory
@PatrickB. totally agree, cleary needs an update after five years, will write an update and add a notebook to allow playing around with as soon as I find the timePademelon
This example would support lists like ["a":{...}] if added this block at the end: elif isinstance(var, list): for l in var: for result in gen_dict_extract(key, l): yield resultKarriekarry
By the way, you can write yield from gen_dict_extract(key, d), instead of directly iterating over the resultBraunite
Why are you testing if it is a dictionary in two different ways? You do hasattr(var,'items') and later if isinstance(v, dict). Also, this fails if your initial var is a list. Also, yield rocks but comparing it to fully evaluated solutions is like comparing the blueprints to the house. And if you like performance so much, maybe you should not go with C-style for cycles where you keep comparing past finding. It's ok to criticise everybody else's solution, especially when (a) your solution is better and (b) it actually works.Mancy
@Mancy my post was never intended to be criticism to others solutions. In contrary, 9 years ago I was excited about the different approaches and just wanted to do a speed comparison as a learning experience to find the solution that best fits my use case back then. By now, the post definitely needs an update (which I promised 4 years ago) with py 3.11 and the valuable comments included.Pademelon
S
50
d = { "id" : "abcde",
    "key1" : "blah",
    "key2" : "blah blah",
    "nestedlist" : [ 
    { "id" : "qwerty",
        "nestednestedlist" : [ 
        { "id" : "xyz", "keyA" : "blah blah blah" },
        { "id" : "fghi", "keyZ" : "blah blah blah" }],
        "anothernestednestedlist" : [ 
        { "id" : "asdf", "keyQ" : "blah blah" },
        { "id" : "yuiop", "keyW" : "blah" }] } ] } 


def fun(d):
    if 'id' in d:
        yield d['id']
    for k in d:
        if isinstance(d[k], list):
            for i in d[k]:
                for j in fun(i):
                    yield j

>>> list(fun(d))
['abcde', 'qwerty', 'xyz', 'fghi', 'asdf', 'yuiop']
Spurgeon answered 21/3, 2012 at 15:40 Comment(8)
The only thing I would change is for k in d to for k,value in d.items() with the subsequent use of value instead of d[k].Outweigh
Thanks, this works great. Required very slight modification because my lists can contain strings as well as dicts (which I didn't mention), but otherwise perfect.Heritor
This fits a very narrow case, you owe it to yourself to consider the answer from "hexerei software" called gen_dict_extractSacred
I got the error "TypeError: argument of type 'NoneType' is not iterable"Copestone
For anyone visiting this page now that Python 3 has taken over, remember that iteritems has become items.Arakawa
This solution doesn't seem to support listsAustine
@AlexR Indeed it doesn't. If you do d['nestedlist'] = [d['nestedlist']] you'll get a TypError. This answer incorrectly, and unnecessarily, assumes lists cannot be either at the root level of the data structure and also that they cannot be directly nested.Kandi
I think that this solution does not support (nested) lists.Amory
P
42
d = { "id" : "abcde",
    "key1" : "blah",
    "key2" : "blah blah",
    "nestedlist" : [
    { "id" : "qwerty",
        "nestednestedlist" : [
        { "id" : "xyz", "keyA" : "blah blah blah" },
        { "id" : "fghi", "keyZ" : "blah blah blah" }],
        "anothernestednestedlist" : [
        { "id" : "asdf", "keyQ" : "blah blah" },
        { "id" : "yuiop", "keyW" : "blah" }] } ] }


def findkeys(node, kv):
    if isinstance(node, list):
        for i in node:
            for x in findkeys(i, kv):
               yield x
    elif isinstance(node, dict):
        if kv in node:
            yield node[kv]
        for j in node.values():
            for x in findkeys(j, kv):
                yield x

print(list(findkeys(d, 'id')))
Peanuts answered 9/11, 2013 at 3:14 Comment(4)
This example worked with every complex dictionary I tested. Well done.Triune
This should be the accepted answer, it can find keys that are within dictionaries that are nested within list of lists etc.Kandi
This works in Python3 as well, as long as the print statement at the end is modified. None of the solutions above this worked for an API response with lists nested inside dicts listed inside lists, etc, but this one worked beautifully.Requisite
This solution is epic! Really delivered for me, thank you!Thessalonian
D
29
def find(key, value):
  for k, v in value.items():
    if k == key:
      yield v
    elif isinstance(v, dict):
      for result in find(key, v):
        yield result
    elif isinstance(v, list):
      for d in v:
        for result in find(key, d):
          yield result

EDIT: @Anthon noticed that this will not work for directly nested lists. If you have this in your input, you can use this:

def find(key, value):
  for k, v in (value.items() if isinstance(value, dict) else
               enumerate(value) if isinstance(value, list) else []):
    if k == key:
      yield v
    elif isinstance(v, (dict, list)):
      for result in find(key, v):
        yield result

But I think the original version is easier to understand, so I will leave it.

Detective answered 21/3, 2012 at 15:48 Comment(11)
You can strip the dict-elif-branch if you like. your case doesn't seem to have these.Detective
This works great as well, but likewise runs into issues if it encounters a list that directly contains a string (which I forgot to include in my example). I think adding in an isinstance check for a dict before the last two lines solves this.Heritor
@Detective congratulations! see my answer, where i speed tested all functions, your function is outstanding, only overtaken by my adaption to your function, where i checked if given object has iteritems() functionPademelon
Thanks for the accolades, but I'd be prouder to get them for the cleanliness of my code than for its speed.Detective
95% of the time, yes. The remaining (rare) occasions are the ones in which some time limitation might force me to choose a faster version over a cleaner one. But I don't like this. It always means to put a load of work onto my successor who will have to maintain that code. It is a risk because my successor might get confused. I will have to write a lot of comments then, maybe a whole document explaining my motivations, timing experiments, their results etc. That's way more work for me and all colleagues to get it done properly. Cleaner is way simpler.Detective
@Detective - thanks for this answer. I had a need to extract all occurences of a string in a nested dict for a specific use case of Elasticsearch and this code was useful with a minor modification - #40586520Reader
This completely breaks on lists directly contained in lists.Kandi
@Kandi You are right. I could change the implementation to also support nested lists but then it gets more complex and harder to understand, and the OP's case didn't have those anyway. But thanks for point that out anyway.Detective
I am not sure why you think it gets more complex, as in fact the character count would decrease by at least 5% and your line-count would stay the same for that function. It is just re-arranging of the lines you have, indenting accordingly, and changing one elif to if.Kandi
I added another version solving your issue of directly nested lists. But I still believe the original version is easier to understand.Detective
I think that your second version indeed works for nested lists - well done ;)Amory
S
17

pip install nested-lookup does exactly what you are looking for:

document = [ { 'taco' : 42 } , { 'salsa' : [ { 'burrito' : { 'taco' : 69 } } ] } ]

>>> print(nested_lookup('taco', document))
[42, 69]
Szechwan answered 29/12, 2020 at 16:8 Comment(1)
Works the same as @Peanuts answer above, plus there's a bunch of other functions in nested-lookup library. For the above example use from nested_lookup import nested_lookupEntrust
R
9

I just wanted to iterate on @hexerei-software's excellent answer using yield from and accepting top-level lists.

def gen_dict_extract(var, key):
    if isinstance(var, dict):
        for k, v in var.items():
            if k == key:
                yield v
            if isinstance(v, (dict, list)):
                yield from gen_dict_extract(v, key)
    elif isinstance(var, list):
        for d in var:
            yield from gen_dict_extract(d, key)
Radbourne answered 21/5, 2018 at 7:34 Comment(2)
Excellent mod to @hexerei-software's answer: succinct and allows list-of-dicts! I'm using this along with @bruno-bronosky's suggestions in his comments to use for key in keys. Also I added to the 2nd isinstance to (list, tuple) for even more variety. ;)Supervisor
From a suggested edit: "Note: this appears to be about 50 times slower than the other gen_dict_extract"Homoeroticism
S
6

This function recursively searches a dictionary containing nested dictionaries and lists. It builds a list called fields_found, which contains the value for every time the field is found. The 'field' is the key I'm looking for in the dictionary and its nested lists and dictionaries.

def get_recursively(search_dict, field):
    """Takes a dict with nested lists and dicts,
    and searches all dicts for a key of the field
    provided.
    """
    fields_found = []

    for key, value in search_dict.iteritems():

        if key == field:
            fields_found.append(value)

        elif isinstance(value, dict):
            results = get_recursively(value, field)
            for result in results:
                fields_found.append(result)

        elif isinstance(value, list):
            for item in value:
                if isinstance(item, dict):
                    more_results = get_recursively(item, field)
                    for another_result in more_results:
                        fields_found.append(another_result)

    return fields_found
Sapp answered 27/11, 2013 at 23:9 Comment(1)
You could use fields_found.extend(more_results) instead of running another loop. Would look a bit cleaner in my opinion.Erbe
C
6

Another variation, which includes the nested path to the found results (note: this version doesn't consider lists):

def find_all_items(obj, key, keys=None):
    """
    Example of use:
    d = {'a': 1, 'b': 2, 'c': {'a': 3, 'd': 4, 'e': {'a': 9, 'b': 3}, 'j': {'c': 4}}}
    for k, v in find_all_items(d, 'a'):
        print "* {} = {} *".format('->'.join(k), v)    
    """
    ret = []
    if not keys:
        keys = []
    if key in obj:
        out_keys = keys + [key]
        ret.append((out_keys, obj[key]))
    for k, v in obj.items():
        if isinstance(v, dict):
            found_items = find_all_items(v, key, keys=(keys+[k]))
            ret += found_items
    return ret
Choose answered 27/10, 2014 at 21:27 Comment(0)
M
4

glom is a great library for searching and restructuring that can also do nested lookups using globs. Example:

In [1]: import glom

In [2]: data = { "id" : "abcde", "key1" : "blah", ... }  # OP example

In [3]: glom.glom(data, '**.id')
Out[3]: ['abcde', 'qwerty', 'xyz', 'fghi', 'asdf', 'yuiop']

The nesting levels are separated by dots (think of them as slashes in Unix globs), single star is a placeholder for one level, double star is a placeholder for multiple levels. In the above example, **.id means "id key on any level". More examples:

In [4]: glom.glom(d, ('*.*.id', glom.Flatten()))
Out[4]: ['qwerty']

This will go through all keys on third level only and extract the values for the id, whereas empty results are discarded (the nested results list is flattened).

Another example that collects id values specifically in the anothernestednestedlist list only:

In [5]: glom.glom(d, ('nestedlist', [('anothernestednestedlist', ['id'])]))
Out[5]: [['asdf', 'yuiop']]

However, the library can do a lot more than just that. Read the glom docs for more features.

Molybdenous answered 15/8, 2023 at 15:6 Comment(1)
THANK YOU SO MUCH! This glom is a life saver. I confess to reinventing a few wheels every now and then, just for fun, but this is great. And its author, Mahmoud, has a lot of other really cool stuff as well.Mancy
S
1

I was not able to get the solutions posted here to work out-of-the-box so I wanted to write something that is a little more flexible.

The recursive function below that should allow you to collect all values that meet some regex pattern for a given key in an arbitrarily deeply set of nested dictionaries and lists.

import re

def search(dictionary, search_pattern, output=None):
    """
    Search nested dictionaries and lists using a regex search
    pattern to match a key and return the corresponding value(s).
    """
    if output is None:
        output = []

    pattern = re.compile(search_pattern)
    
    for k, v in dictionary.items():
        pattern_found = pattern.search(k)
        if not pattern_found:
            if isinstance(v, list):
                for item in v:
                    if isinstance(item, dict):
                        search(item, search_pattern, output)
            if isinstance(v, dict):
                search(v, search_pattern, output)
        else:
            if pattern_found:
                output.append(v)

    return output

If you want to search for a specific term, you can always make your search pattern something like r'\bsome_term\b'.

Savonarola answered 24/5, 2021 at 4:34 Comment(1)
Fresh approach but does not work with nested <class 'dict'> such as mylist = { 'sad': [[1,1.2],[2,1.3],[3,1.4]], 'puppy': [[2,2.2],[3,2.3]], 'happy':[[3,4.5],[5,6.7,{'happy':[1,2,3]}]]}Entrust
L
0

Here is my stab at it:

def keyHole(k2b,o):
  # print "Checking for %s in "%k2b,o
  if isinstance(o, dict):
    for k, v in o.iteritems():
      if k == k2b and not hasattr(v, '__iter__'): yield v
      else:
        for r in  keyHole(k2b,v): yield r
  elif hasattr(o, '__iter__'):
    for r in [ keyHole(k2b,i) for i in o ]:
      for r2 in r: yield r2
  return

Ex.:

>>> findMe = {'Me':{'a':2,'Me':'bop'},'z':{'Me':4}}
>>> keyHole('Me',findMe)
<generator object keyHole at 0x105eccb90>
>>> [ x for x in keyHole('Me',findMe) ]
['bop', 4]
Lange answered 4/3, 2015 at 22:5 Comment(0)
A
0

Following up on @hexerei software's answer and @bruno-bronosky's comment, if you want to iterate over a list/set of keys:

def gen_dict_extract(var, keys):
   for key in keys:
      if hasattr(var, 'items'):
         for k, v in var.items():
            if k == key:
               yield v
            if isinstance(v, dict):
               for result in gen_dict_extract([key], v):
                  yield result
            elif isinstance(v, list):
               for d in v:
                  for result in gen_dict_extract([key], d):
                     yield result    

Note that I'm passing a list with a single element ([key]}, instead of the string key.

Aerification answered 17/2, 2020 at 11:37 Comment(0)
B
0

Since there's a maximum recursion depth in Python I would consider implementing an iterative approach for arbitrary size:

def get_ids(data: Dict, key: str) -> List:
    stack = [data]
    result = []
    while stack:
        elem = stack.pop()
        if isinstance(elem, dict):
            for k, v in elem.items():
                if k == key:
                    result.append(v)
                if isinstance(elem, (list, dict)):
                    stack.append(v)
        elif isinstance(elem, list):
            for obj in elem:
                stack.append(obj)
    return result
Braunite answered 11/7, 2022 at 10:39 Comment(0)
M
0

Intro

With the benefit of a decade of Python evolution (working on 3.11) and after commenting on the most voted solution, I developed a variant that, after the fact, I realised walked the very steps of many of the proposed solutions. Still, my goal was to add context to the response in both depth and breadth.

XPath solution

My solution delivers both a sort of XPath to the found node, but also surrounding context on demand. To this end, it accepts an iterable of keys to be output together with the target key. Obviously, it delivers key: value dictionary elements, for the benefit of readability:

def extract(var, key, context_keys=(), xpath=''):
    if isinstance(var, dict):
        if key in var:
            yield {f'{xpath}.{key}': var[key]} | {f'{xpath}.{key}': value for key, value in var.items() if key in context_keys}
        for subkey, value in var.items():
            yield from extract(value, key, context_keys, f'{xpath}.{subkey}')
    elif isinstance(var, list):
        for i, elem in enumerate(var):
            yield from extract(elem, key, context_keys, f'{xpath}[{i}]')

With this, looking for 'id' would retrieve:

[
  {
    ".id": "abcde"
  },
  {
    ".nestedlist[0].id": "qwerty"
  },
  {
    ".nestedlist[0].nestednestedlist[0].id": "xyz"
  },
  {
    ".nestedlist[0].nestednestedlist[1].id": "fghi"
  },
  {
    ".nestedlist[0].anothernestednestedlist[0].id": "asdf"
  },
  {
    ".nestedlist[0].anothernestednestedlist[1].id": "yuiop"
  }
]

Obviously, since all keys represent different XPaths, we could unify the entries into a single dictionary, with either reduce or comprehension, getting this:

{
  ".id": "abcde",
  ".nestedlist[0].id": "qwerty",
  ".nestedlist[0].nestednestedlist[0].id": "xyz",
  ".nestedlist[0].nestednestedlist[1].id": "fghi",
  ".nestedlist[0].anothernestednestedlist[0].id": "asdf",
  ".nestedlist[0].anothernestednestedlist[1].id": "yuiop"
}

The context could be used for something like this:

>>> pp(reduce(lambda acc, elem: acc | elem, extract(d, 'id', 'keyA')))
{
  ".id": "abcde",
  ".nestedlist[0].id": "qwerty",
  ".nestedlist[0].nestednestedlist[0].id": "xyz",
  ".nestedlist[0].nestednestedlist[0].keyA": "blah blah blah", # <-- HERE
  ".nestedlist[0].nestednestedlist[1].id": "fghi",
  ".nestedlist[0].anothernestednestedlist[0].id": "asdf",
  ".nestedlist[0].anothernestednestedlist[1].id": "yuiop"
}

I'm working on a solution to return an object that has the same structure of the original, but only the wanted keys. If I get it to work, I'll add it here.

Mancy answered 13/11, 2023 at 13:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.