json query that returns parent element and child data?
Asked Answered
P

6

11

Given the following json:

{
    "README.rst": {
        "_status": {
            "md5": "952ee56fa6ce36c752117e79cc381df8"
        }
    },
    "docs/conf.py": {
        "_status": {
            "md5": "6e9c7d805a1d33f0719b14fe28554ab1"
        }
    }
}

is there a query language that can produce:

{
    "README.rst": "952ee56fa6ce36c752117e79cc381df8",
    "docs/conf.py": "6e9c7d805a1d33f0719b14fe28554ab1",
}

My best attempt so far with JMESPath (http://jmespath.org/) isn't very close:

>>> jmespath.search('*.*.md5[]', db)
['952ee56fa6ce36c752117e79cc381df8', '6e9c7d805a1d33f0719b14fe28554ab1']

I've gotten to the same point with ObjectPath (http://objectpath.org):

>>> t = Tree(db)
>>> list(t.execute('$..md5'))
['952ee56fa6ce36c752117e79cc381df8', '6e9c7d805a1d33f0719b14fe28554ab1']

I couldn't make any sense of JSONiq (do I really need to read a 105 page manual to do this?) This is my first time looking at json query languages..

Perpetual answered 12/9, 2015 at 0:11 Comment(2)
so you might be searching a key other than md5? and it can be arbitrarily deep?Weisman
Does this happen to be in the context of a Django application? This can be done via rest_framework serializers but that is overkill outside of that context.Colorful
T
2

Missed the python requirement, but if you are willing to call external program, this will still work. Please note, that jq >= 1.5 is required for this to work.

# If single "key" $p[0] has multiple md5 keys, this will reduce the array to one key.
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] | add '

# this will not create single object, but you'll see all key, md5 combinations
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] '

Get paths with "md5"-key '?'=ignore errors (like testing scalar for key). From resulting paths ($p) filter and surround result with '{}' = object. And then those are in an array ([] surrounding the whole expression) which is then "added/merged" together |add

https://stedolan.github.io/jq/

Tray answered 18/9, 2015 at 9:8 Comment(5)
This looks interesting, but I don't think your query is correct... this seems like it will find all leaf keys (not just md5) and if there are two leaf keys it will pick the value of the last one...Perpetual
This is the closest I've gotten (also wrong..) [[paths(..) | select(index("md5"))] as $p | {(($p[])[0]): getpath($p[])}] | add.Perpetual
@Perpetual You are right. First code had a "brain freeze and the select did nothing. I've now edited and tested with more complex test file. I'm not really a 'jq' expert, but it's been a really usefull tool.Tray
New code removed the leaf requirement from search key. test with: jq-1.5 'paths(has("_status")?) as $p | { ($p[0]): getpath($p)["_status"]} 'Tray
Excellent! (and also the only answer that answered the question asked <wink>)Perpetual
W
6

not sure why you want a query language this is pretty easy

def find_key(data,key="md5"):
    for k,v in data.items():
       if k== key: return v
       if isinstance(v,dict):
          result = find_key(v,key)
          if result:return result

dict((k,find_key(v,"md5")) for k,v in json_result.items()) 

it's even easier if the value dict always has "_status" and "md5" as keys

dict((k,v["_status"]["md5"]) for k,v in json_result.items()) 

alternatively I think you could do something like

t = Tree(db)
>>> dict(zip(t.execute("$."),t.execute('$..md5'))

although I dont know that it would match them up quite right ...

Weisman answered 12/9, 2015 at 0:17 Comment(10)
I want a query language so I can give end-users ad-hoc search capabilities in a very large .json file.Perpetual
I think you might want to look up ad-hoc ;-) Each file path has a number of "categories" as direct children (only _status in this example), and each "categry" has data (potentially deeply nested). I would like to create a command line tool that gives end-users (developers) a way to query this structure (think structured grep).Perpetual
ad hoc ˌad ˈhäk/ adjective & adverb formed, arranged, or done for a particular purpose only.Weisman
ahhh I think i now understand the pairing to md5 was only for demonstration and you want them to be able to arbirarilly decide what key to search for ? I still think the best solution is for you to write a python script ... (which is definitely an ad-hoc solution) ...Weisman
I was using it in the "not pre-planned" meaning (dictionary.cambridge.org/dictionary/english/ad-hoc)Perpetual
Any script I write would be similar to defining a query language, so instead of re-inventing the wheel I figured I'd use something existing and well-tested. For xml one would probably use xquery, even if it is perfectly possible to hand-parse xml too.Perpetual
I see i was using it in the general CS term that means a solution designed for a particular purpose (ie pairing level0 json keys with deeper nested keys) but tbh if they are developers and dont know how to search through a json dict on their own they are not much of a developerWeisman
Well.. take a look at the output for radon cc -s -j <project>, it's rather volumnious. Being able to pipe it through a declarative query makes things easier.Perpetual
Its a pretty simple algorithm above that should work ... if you take the arguments from argv ... I think that this is so domain specific you will not find a prebuild solution ... if it was a hard algorithm i can see your concern ... but its not ... your biggest concern is if it nests dictionaries more than 1000 layers deep you may encounter a recursion overflow errorWeisman
I;ll upvote the question in the hopes that I am wrong and there is totally a way to do this with some other third party library ...Weisman
S
4

Here is the JSONiq code that does the job:

{|
    for $key in keys($document)
    return {
        $key: $document.$key._status.md5
    }
|}

You can execute it here with the Zorba engine.

If the 105-page manual you mention is the specification, I do not recommend reading it as a JSONiq user. I would rather advise reading tutorials or books online, which give a more gentle introduction.

Salpinx answered 12/10, 2015 at 13:9 Comment(2)
Thanks! Can you suggest any tutorials that goes a bit in-depth?Perpetual
Hi thebjorn, Sure! I revived an old tutorial here and updated it to the latest syntax. Also, there are a couple of online talks on youtube by Jonathan Robie and Chris Hillery.Salpinx
A
3

Do in ObjectPath:

l = op.execute("[keys($.*), $..md5]")

you'll get:

[
  [
    "README.rst",
    "docs/conf.py"
  ],
  [
    "952ee56fa6ce36c752117e79cc381df8",
    "6e9c7d805a1d33f0719b14fe28554ab1"
  ]
]

then in Python:

dict(zip(l[0],l[1]))

to get:

{
    'README.rst': '952ee56fa6ce36c752117e79cc381df8', 
    'docs/conf.py': '6e9c7d805a1d33f0719b14fe28554ab1'
}

Hope that helps. :)

PS. I'm using OPs' keys() to show how to make full query that works anywhere in the document not only when keys are in the root of document.

PS2. I might add new function so that it would look like: object([keys($.*), $..md5]). Shoot me tweet http://twitter.com/adriankal if you want that.

Auburta answered 20/11, 2015 at 0:45 Comment(0)
P
2

A solution that implements a new query language:

def keylist(db):
    "Return all the keys in db."

    def _keylist(db, prefix, res):
        if prefix is None:
            prefix = []

        for key, val in db.items():
            if isinstance(val, dict):
                _keylist(val, prefix + [key], res)
            else:
                res.append(prefix + [key])

    res = []
    _keylist(db, [], res)
    return ['::'.join(key) for key in res]

def get_key(db, key):
    "Get path and value from key."

    def _get_key(db, key, path):
        k = key[0]
        if len(key) == 1:
            return path + [k, db[k]]
        return _get_key(db[k], key[1:], path + [k])

    return _get_key(db, key, [])

def search(query, db):
    "Convert query to regex and use it to search key space."
    keys = keylist(db)
    query = query.replace('*', r'(?:.*?)')
    matching = [key for key in keys if re.match(query, key)]
    res = [get_key(db, key.split('::')) for key in matching]
    return dict(('::'.join(r[:-1]), r[-1]) for r in res)

which gives me something that's pretty close to the requirements:

>>> pprint.pprint(search("*::md5", db))
{'README.rst::_status::md5': '952ee56fa6ce36c752117e79cc381df8',
 'docs/conf.py::_status::md5': '6e9c7d805a1d33f0719b14fe28554ab1'}

and a query language that looks like a glob/re hybrid (if we're making a new language, at least make it look familiar):

>>> pprint.pprint(search("docs*::md5", db))
{'docs/conf.py::_status::md5': '6e9c7d805a1d33f0719b14fe28554ab1'}

since the data contains file paths I've randomly used :: as a path separator. (I'm pretty sure it doesn't handle the full json grammar yet, but that should be mostly grunt work).

Perpetual answered 12/9, 2015 at 2:3 Comment(4)
i think its kinda funny that you didnt want to implement a new query language(at least partially because you were worried about bugs...) but then you use a much more complex algorithm than what i was suggesting(well somewhat more complex) ... that said if it solves your issue I am glad :) +1 for an answer that solves your problemWeisman
Conceptually it's simple, it just transforms the problem into a form where the re module can be used to solve the search problem. Serializing the key-space and retrieving a tree-path are both relatively easy to write test for, so I'm relatively comfortable with the code quality issues. It is a very limited query language, not at all on the same level as ObjectPath or JMESPath, and as a solution I think it is unacceptable (why would anyone want to learn my very limited query language?) I'll set a bounty on it as soon as I'm able to see if there are any solutions using standard tools..Perpetual
a better question might be why would a developer want to learn a more complicated query language like JMES when about 4 lines of relativly straightforward python would suffice .... unless when you say its for developers you dont mean software developer ..Weisman
I'm guessing because declarative is better than imperative. It's why we learn xpath and xquery, it's why document.querySelector() was added, not to mention sql. Describing what you want instead of how to get it turns out to be very useful. And 1 (short) line of a query language is more than a 80% savings in developer effort -- and that's just for a very simple case. For a more complex case, coming back to document.querySelectorAll("h2 ~ h2 + [href^=https] > img") after six months is much friendlier than the equivalent recursive tree-traversal code.Perpetual
T
2

Missed the python requirement, but if you are willing to call external program, this will still work. Please note, that jq >= 1.5 is required for this to work.

# If single "key" $p[0] has multiple md5 keys, this will reduce the array to one key.
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] | add '

# this will not create single object, but you'll see all key, md5 combinations
cat /tmp/test.json | \
jq-1.5 '[paths(has("md5")?) as $p | { ($p[0]): getpath($p)["md5"]}] '

Get paths with "md5"-key '?'=ignore errors (like testing scalar for key). From resulting paths ($p) filter and surround result with '{}' = object. And then those are in an array ([] surrounding the whole expression) which is then "added/merged" together |add

https://stedolan.github.io/jq/

Tray answered 18/9, 2015 at 9:8 Comment(5)
This looks interesting, but I don't think your query is correct... this seems like it will find all leaf keys (not just md5) and if there are two leaf keys it will pick the value of the last one...Perpetual
This is the closest I've gotten (also wrong..) [[paths(..) | select(index("md5"))] as $p | {(($p[])[0]): getpath($p[])}] | add.Perpetual
@Perpetual You are right. First code had a "brain freeze and the select did nothing. I've now edited and tested with more complex test file. I'm not really a 'jq' expert, but it's been a really usefull tool.Tray
New code removed the leaf requirement from search key. test with: jq-1.5 'paths(has("_status")?) as $p | { ($p[0]): getpath($p)["_status"]} 'Tray
Excellent! (and also the only answer that answered the question asked <wink>)Perpetual
S
1

If your json is well structured, ie. assured you'll have _status and md5 sub-elements, you could just load the json up and use a list comprehension to spit out the items you're looking for.

>>> import json
>>> my_json = json.loads(json_string)
>>> print [(key, value['_status']['md5']) for key, value in my_json.iteritems()]
[(u'README.rst', u'952ee56fa6ce36c752117e79cc381df8'), (u'docs/conf.py', u'6e9c7d805a1d33f0719b14fe28554ab1')]
Sherillsherilyn answered 21/9, 2015 at 13:4 Comment(1)
This is the exact same solution as Joran Beasley's (starting with "it's even easier..")Perpetual

© 2022 - 2024 — McMap. All rights reserved.