json query that returns parent element and child data?

Asked 12/9, 2015 at 0:11 Answered 20/11, 2015 at 0:45

Given the following json:

{
    "README.rst": {
        "_status": {
            "md5": "952ee56fa6ce36c752117e79cc381df8"
        }
    },
    "docs/conf.py": {
        "_status": {
            "md5": "6e9c7d805a1d33f0719b14fe28554ab1"
        }
    }
}

is there a query language that can produce:

{
    "README.rst": "952ee56fa6ce36c752117e79cc381df8",
    "docs/conf.py": "6e9c7d805a1d33f0719b14fe28554ab1",
}

My best attempt so far with JMESPath (http://jmespath.org/) isn't very close:

>>> jmespath.search('*.*.md5[]', db)
['952ee56fa6ce36c752117e79cc381df8', '6e9c7d805a1d33f0719b14fe28554ab1']

I've gotten to the same point with ObjectPath (http://objectpath.org):

>>> t = Tree(db)
>>> list(t.execute('$..md5'))
['952ee56fa6ce36c752117e79cc381df8', '6e9c7d805a1d33f0719b14fe28554ab1']

I couldn't make any sense of JSONiq (do I really need to read a 105 page manual to do this?) This is my first time looking at json query languages..

Perpetual answered 12/9, 2015 at 0:11 Comment(2)

so you might be searching a key other than md5? and it can be arbitrarily deep? – Weisman 12/9, 2015 at 0:36

Does this happen to be in the context of a Django application? This can be done via rest_framework serializers but that is overkill outside of that context. – Colorful 18/9, 2015 at 16:29

Missed the python requirement, but if you are willing to call external program, this will still work. Please note, that jq >= 1.5 is required for this to work.

# If single "key" $p[0] has multiple md5 keys, this will reduce the array to one key.
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] | add '

# this will not create single object, but you'll see all key, md5 combinations
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] '

Get paths with "md5"-key '?'=ignore errors (like testing scalar for key). From resulting paths ($p) filter and surround result with '{}' = object. And then those are in an array ([] surrounding the whole expression) which is then "added/merged" together |add

https://stedolan.github.io/jq/

Tray answered 18/9, 2015 at 9:8 Comment(5)

This looks interesting, but I don't think your query is correct... this seems like it will find all leaf keys (not just md5) and if there are two leaf keys it will pick the value of the last one... – Perpetual 19/9, 2015 at 13:24

This is the closest I've gotten (also wrong..) [[paths(..) | select(index("md5"))] as $p | {(($p[])[0]): getpath($p[])}] | add. – Perpetual 19/9, 2015 at 14:38

@Perpetual You are right. First code had a "brain freeze and the select did nothing. I've now edited and tested with more complex test file. I'm not really a 'jq' expert, but it's been a really usefull tool. – Tray 21/9, 2015 at 12:51

New code removed the leaf requirement from search key. test with: jq-1.5 'paths(has("_status")?) as $p | { ($p[0]): getpath($p)["_status"]} ' – Tray 21/9, 2015 at 12:56

Excellent! (and also the only answer that answered the question asked <wink>) – Perpetual 21/9, 2015 at 19:6

not sure why you want a query language this is pretty easy

def find_key(data,key="md5"):
    for k,v in data.items():
       if k== key: return v
       if isinstance(v,dict):
          result = find_key(v,key)
          if result:return result

dict((k,find_key(v,"md5")) for k,v in json_result.items())

it's even easier if the value dict always has "_status" and "md5" as keys

dict((k,v["_status"]["md5"]) for k,v in json_result.items())

alternatively I think you could do something like

t = Tree(db)
>>> dict(zip(t.execute("$."),t.execute('$..md5'))

although I dont know that it would match them up quite right ...

Weisman answered 12/9, 2015 at 0:17 Comment(10)

I want a query language so I can give end-users ad-hoc search capabilities in a very large .json file. – Perpetual 12/9, 2015 at 0:19

I think you might want to look up ad-hoc ;-) Each file path has a number of "categories" as direct children (only _status in this example), and each "categry" has data (potentially deeply nested). I would like to create a command line tool that gives end-users (developers) a way to query this structure (think structured grep). – Perpetual 12/9, 2015 at 0:26

ad hoc ˌad ˈhäk/ adjective & adverb formed, arranged, or done for a particular purpose only. – Weisman 12/9, 2015 at 0:28

ahhh I think i now understand the pairing to md5 was only for demonstration and you want them to be able to arbirarilly decide what key to search for ? I still think the best solution is for you to write a python script ... (which is definitely an ad-hoc solution) ... – Weisman 12/9, 2015 at 0:30

I was using it in the "not pre-planned" meaning (dictionary.cambridge.org/dictionary/english/ad-hoc) – Perpetual 12/9, 2015 at 0:31

Any script I write would be similar to defining a query language, so instead of re-inventing the wheel I figured I'd use something existing and well-tested. For xml one would probably use xquery, even if it is perfectly possible to hand-parse xml too. – Perpetual 12/9, 2015 at 0:33

I see i was using it in the general CS term that means a solution designed for a particular purpose (ie pairing level0 json keys with deeper nested keys) but tbh if they are developers and dont know how to search through a json dict on their own they are not much of a developer – Weisman 12/9, 2015 at 0:33

Well.. take a look at the output for radon cc -s -j <project>, it's rather volumnious. Being able to pipe it through a declarative query makes things easier. – Perpetual 12/9, 2015 at 0:36

Its a pretty simple algorithm above that should work ... if you take the arguments from argv ... I think that this is so domain specific you will not find a prebuild solution ... if it was a hard algorithm i can see your concern ... but its not ... your biggest concern is if it nests dictionaries more than 1000 layers deep you may encounter a recursion overflow error – Weisman 12/9, 2015 at 0:45

I;ll upvote the question in the hopes that I am wrong and there is totally a way to do this with some other third party library ... – Weisman 12/9, 2015 at 0:57

Here is the JSONiq code that does the job:

{|
    for $key in keys($document)
    return {
        $key: $document.$key._status.md5
    }
|}

You can execute it here with the Zorba engine.

If the 105-page manual you mention is the specification, I do not recommend reading it as a JSONiq user. I would rather advise reading tutorials or books online, which give a more gentle introduction.

Salpinx answered 12/10, 2015 at 13:9 Comment(2)

Thanks! Can you suggest any tutorials that goes a bit in-depth? – Perpetual 12/10, 2015 at 17:1

Hi thebjorn, Sure! I revived an old tutorial here and updated it to the latest syntax. Also, there are a couple of online talks on youtube by Jonathan Robie and Chris Hillery. – Salpinx 14/10, 2015 at 9:18

Do in ObjectPath:

l = op.execute("[keys($.*), $..md5]")

you'll get:

[
  [
    "README.rst",
    "docs/conf.py"
  ],
  [
    "952ee56fa6ce36c752117e79cc381df8",
    "6e9c7d805a1d33f0719b14fe28554ab1"
  ]
]

then in Python:

dict(zip(l[0],l[1]))

to get:

{
    'README.rst': '952ee56fa6ce36c752117e79cc381df8', 
    'docs/conf.py': '6e9c7d805a1d33f0719b14fe28554ab1'
}

Hope that helps. :)

PS. I'm using OPs' keys() to show how to make full query that works anywhere in the document not only when keys are in the root of document.

PS2. I might add new function so that it would look like: object([keys($.*), $..md5]). Shoot me tweet http://twitter.com/adriankal if you want that.

Auburta answered 20/11, 2015 at 0:45 Comment(0)

A solution that implements a new query language:

def keylist(db):
    "Return all the keys in db."

    def _keylist(db, prefix, res):
        if prefix is None:
            prefix = []

        for key, val in db.items():
            if isinstance(val, dict):
                _keylist(val, prefix + [key], res)
            else:
                res.append(prefix + [key])

    res = []
    _keylist(db, [], res)
    return ['::'.join(key) for key in res]

def get_key(db, key):
    "Get path and value from key."

    def _get_key(db, key, path):
        k = key[0]
        if len(key) == 1:
            return path + [k, db[k]]
        return _get_key(db[k], key[1:], path + [k])

    return _get_key(db, key, [])

def search(query, db):
    "Convert query to regex and use it to search key space."
    keys = keylist(db)
    query = query.replace('*', r'(?:.*?)')
    matching = [key for key in keys if re.match(query, key)]
    res = [get_key(db, key.split('::')) for key in matching]
    return dict(('::'.join(r[:-1]), r[-1]) for r in res)

which gives me something that's pretty close to the requirements:

>>> pprint.pprint(search("*::md5", db))
{'README.rst::_status::md5': '952ee56fa6ce36c752117e79cc381df8',
 'docs/conf.py::_status::md5': '6e9c7d805a1d33f0719b14fe28554ab1'}

and a query language that looks like a glob/re hybrid (if we're making a new language, at least make it look familiar):

>>> pprint.pprint(search("docs*::md5", db))
{'docs/conf.py::_status::md5': '6e9c7d805a1d33f0719b14fe28554ab1'}

since the data contains file paths I've randomly used :: as a path separator. (I'm pretty sure it doesn't handle the full json grammar yet, but that should be mostly grunt work).

Perpetual answered 12/9, 2015 at 2:3 Comment(4)

i think its kinda funny that you didnt want to implement a new query language(at least partially because you were worried about bugs...) but then you use a much more complex algorithm than what i was suggesting(well somewhat more complex) ... that said if it solves your issue I am glad :) +1 for an answer that solves your problem – Weisman 12/9, 2015 at 2:43

Conceptually it's simple, it just transforms the problem into a form where the re module can be used to solve the search problem. Serializing the key-space and retrieving a tree-path are both relatively easy to write test for, so I'm relatively comfortable with the code quality issues. It is a very limited query language, not at all on the same level as ObjectPath or JMESPath, and as a solution I think it is unacceptable (why would anyone want to learn my very limited query language?) I'll set a bounty on it as soon as I'm able to see if there are any solutions using standard tools.. – Perpetual 12/9, 2015 at 16:32

a better question might be why would a developer want to learn a more complicated query language like JMES when about 4 lines of relativly straightforward python would suffice .... unless when you say its for developers you dont mean software developer .. – Weisman 12/9, 2015 at 17:8

I'm guessing because declarative is better than imperative. It's why we learn xpath and xquery, it's why document.querySelector() was added, not to mention sql. Describing what you want instead of how to get it turns out to be very useful. And 1 (short) line of a query language is more than a 80% savings in developer effort -- and that's just for a very simple case. For a more complex case, coming back to document.querySelectorAll("h2 ~ h2 + [href^=https] > img") after six months is much friendlier than the equivalent recursive tree-traversal code. – Perpetual 12/9, 2015 at 18:23

Missed the python requirement, but if you are willing to call external program, this will still work. Please note, that jq >= 1.5 is required for this to work.

# If single "key" $p[0] has multiple md5 keys, this will reduce the array to one key.
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] | add '

# this will not create single object, but you'll see all key, md5 combinations
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] '

https://stedolan.github.io/jq/

Tray answered 18/9, 2015 at 9:8 Comment(5)

This is the closest I've gotten (also wrong..) [[paths(..) | select(index("md5"))] as $p | {(($p[])[0]): getpath($p[])}] | add. – Perpetual 19/9, 2015 at 14:38

New code removed the leaf requirement from search key. test with: jq-1.5 'paths(has("_status")?) as $p | { ($p[0]): getpath($p)["_status"]} ' – Tray 21/9, 2015 at 12:56

Excellent! (and also the only answer that answered the question asked <wink>) – Perpetual 21/9, 2015 at 19:6

If your json is well structured, ie. assured you'll have _status and md5 sub-elements, you could just load the json up and use a list comprehension to spit out the items you're looking for.

>>> import json
>>> my_json = json.loads(json_string)
>>> print [(key, value['_status']['md5']) for key, value in my_json.iteritems()]
[(u'README.rst', u'952ee56fa6ce36c752117e79cc381df8'), (u'docs/conf.py', u'6e9c7d805a1d33f0719b14fe28554ab1')]

Sherillsherilyn answered 21/9, 2015 at 13:4 Comment(1)

This is the exact same solution as Joran Beasley's (starting with "it's even easier..") – Perpetual 21/9, 2015 at 19:33

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags