Flatten nested JSON using jq
Asked Answered
J

6

19

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.

I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.

The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).

I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).

Kindly help.

EDIT:

I think the pair of examples 3&4 solves this for me: https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/

I'll try soon.

Jovita answered 31/5, 2016 at 8:34 Comment(0)
M
29

You can also use the following jq command to flatten nested JSON objects in this manner:

[paths(values) as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries

The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.

Metcalf answered 31/5, 2016 at 21:26 Comment(3)
This solution does not work when the JSON contains arrays. For example: {"a":{"b":[1]}} for which an error is raised: jq: error (at <stdin>:1): string (".") and number (0) cannot be addedAmidships
Great answer, though this filters out any values which evaluate to false, i.e. false, null, etc. This is because leaf_paths is a short-hand for paths(scalars), and while scalars does select those, paths only returns entries for which they aren't false. Long story short, substitute leaf_paths with paths(type != "object" and type != "array") to include everything.Espalier
To fix the error jq: error (at <stdin>:1): string (".") and number (0) cannot be added [leaf_paths as $path | {"key": [$path[] | tostring] | join("."), "value": getpath($path)}] | from_entriesWholesale
O
19

This is just a variant of Santiago's jq:

. as $in 
| reduce leaf_paths as $path ({};
     . + { ($path | map(tostring) | join(".")): $in | getpath($path) })

It avoids the overhead of the key/value construction and destruction.

(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)

Two important points about both these jq solutions:

  1. Arrays are also flattened. E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:

    {
      "a.b.0": 0,
      "a.b.1": 1,
      "a.b.2": 2
    }
    
  2. If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:

    {"a.b":0, "a": {"b": 1}}
    
Obtain answered 31/5, 2016 at 23:1 Comment(1)
@SteveAmerige - The answer has been updated so that it will work with jq 1.4 and later.Obtain
T
6

Here is a solution that uses tostream, select, join, reduce and setpath

  reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
     {}
     ; setpath($p; $v)
  )
Townsman answered 6/8, 2017 at 0:24 Comment(0)
D
3

I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.

$ jq . test/odd-values.json
{
  "one": {
    "start-string": "foo",
    "null-value": null,
    "integer-number": 101
  },
  "two": [
    {
      "two-a": {
        "non-integer-number": 101.75,
        "number-zero": 0
      },
      "true-boolean": true,
      "two-b": {
        "false-boolean": false
      }
    }
  ],
  "three": {
    "empty-string": "",
    "empty-object": {},
    "empty-array": []
  },
  "end-string": "bar"
}

$ jqg . test/odd-values.json
{
  "one.start-string": "foo",
  "one.null-value": null,
  "one.integer-number": 101,
  "two.0.two-a.non-integer-number": 101.75,
  "two.0.two-a.number-zero": 0,
  "two.0.true-boolean": true,
  "two.0.two-b.false-boolean": false,
  "three.empty-string": "",
  "three.empty-object": {},
  "three.empty-array": [],
  "end-string": "bar"
}

jqg was tested using jq 1.6

Note: I am the author of the jqg script.

Dulci answered 13/5, 2021 at 16:22 Comment(0)
J
1

As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d @json_file does just this:

{
    "a.b":[1],
    "id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
    "_version_":1535841499921514496
}

EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).

EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d @{} \;.

EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)

Jovita answered 31/5, 2016 at 11:5 Comment(0)
N
0

As @hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:

. as $in
     | reduce paths(type != "object" and type != "array") as $path ({};
          . + { ($path | map(tostring) | join(".")): $in | getpath($path) })
Narcoma answered 17/2, 2023 at 10:12 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.