Getting the object array index in jq
Asked Answered
H

4

28

I have a json object that looks like this (prodused by i3-msg -t get_workspaces.

[
  {
    "name": "1",
    "urgent": false
  },
  {
    "name": "2",
    "urgent": false
  },
  {
    "name": "something",
    "urgent": false
  }
]

I am trying to use jq to figure out which index number in the list is based on a select query. jq have something called index(), but it seams to support only strings?

Using something like i3-msg -t get_workspaces | jq '.[] | select(.name=="something")' gives me the object I want. But I want it's index. In this case 2 (starting counting at 0)

Is this possible using jq alone?

Hoist answered 31/1, 2017 at 13:9 Comment(0)
X
28

So I provided a strategy for a solution to the OP, which OP quickly accepted. Subsequently @peak and @Jeff Mercado offered better and more complete solutions. So I have turned this into a community wiki. Please improve this answer if you can.

A straightforward solution (pointed out by @peak) is to use the builtin function, index:

map(.name == "something") | index(true)

The jq documentation confusingly suggests that index operates on strings, but it operates on arrays as well. Thus index(true) returns the index of the first true in the array of booleans produced by the map. If there is no item satisfying the condition, the result is null.

jq expresions are evaluated in a "lazy" manner, but map will traverse the entire input array. We can verify this by rewriting the above code and introducing some debug statements:

[ .[] | debug | .name == "something" ] | index(true)

As suggested by @peak, the key to doing better is to use the break statement introduced in jq 1.5:

label $out | 
foreach .[] as $item (
  -1; 
  .+1; 
  if $item.name == "something" then 
    ., 
    break $out 
  else 
    empty
  end
) // null

Note that the // is no comment; it is the alternative operator. If the name is not found the foreach will return empty which will be converted to null by the alternative operator.

Another approach is to recursively process the array:

def get_index(name): 
  name as $name | 
  if (. == []) then
    null
  elif (.[0].name == $name) then 
    0 
  else 
    (.[1:] | get_index($name)) as $result |
    if ($result == null) then null else $result+1 end      
end;
get_index("something")

However this recursive implementation will use stack space proportional to the length of the array in the worst case as pointed out by @Jeff Mercado. In version 1.5 jq introduced Tail Call Optimization (TCO) which will allow us to optimize this away using a local helper function (note that this is minor adaptation to a solution provided by @Jeff Mercado so as to be consistent with the above examples):

def get_index(name): 
  name as $name | 
  def _get_index:
    if (.i >= .len) then
      null
    elif (.array[.i].name == $name) then
      .i
    else
      .i += 1 | _get_index
    end;
  { array: ., i: 0, len: length } | _get_index;
get_index("something")

According to @peak obtaining the length of an array in jq is a constant time operation, and apparently indexing an array is inexpensive as well. I will try to find a citation for this.

Now let's try to actually measure. Here is an example of measuring the simple solution:

#!/bin/bash

jq -n ' 

  def get_index(name): 
    name as $name |
    map(.name == $name) | index(true)
  ;

  def gen_input(n):  
    n as $n |
    if ($n == 0) then 
      []
    else
      gen_input($n-1) + [ { "name": $n, "urgent":false } ]
    end
  ;  

  2000 as $n |
  gen_input($n) as $i |
  [(0 | while (.<$n; [ ($i | get_index(.)), .+1 ][1]))][$n-1]
'

When I run this on my machine, I get the following:

$ time ./simple
1999

real    0m10.024s
user    0m10.023s
sys     0m0.008s

If I replace this with the "fast" version of get_index:

def get_index(name): 
  name as $name |
  label $out | 
  foreach .[] as $item (
    -1; 
    .+1; 
  if $item.name == $name then 
    ., 
    break $out 
  else 
    empty
  end
) // null;

Then I get:

$ time ./fast
1999

real    0m13.165s
user    0m13.173s
sys     0m0.000s

And if I replace it with the "fast" recursive version:

def get_index(name): 
  name as $name | 
  def _get_index:
    if (.i >= .len) then
      null
    elif (.array[.i].name == $name) then
      .i
    else
      .i += 1 | _get_index
    end;
  { array: ., i: 0, len: length } | _get_index;

I get:

$ time ./fast-recursive 
1999

real    0m52.628s
user    0m52.657s
sys     0m0.005s

Ouch! But we can do better. @peak mentioned an undocumented switch --debug-dump-disasm which lets you see how jq is compiling your code. With this you can see that modifying and passing the object to _indexof and then extracting the array, length, and index is expensive. Refactoring to just pass the index is a huge improvement, and a further refinement to avoid testing the index against the length makes it competitive with the iterative version:

def indexof($name):
  (.+[{name: $name}]) as $a | # add a "sentinel"
  length as $l | # note length sees original array
  def _indexof:
    if ($a[.].name == $name) then
      if (. != $l) then . else null end
    else
      .+1 | _indexof
    end
  ;


  0 | _indexof
;

I get:

$ time ./fast-recursive2
null

real    0m13.238s
user    0m13.243s
sys     0m0.005s

So it appears that if each element is equally likely, and you want an average case performance, you should stick with the simple implementation. (C-coded functions tend to be fast!)

Xylophagous answered 31/1, 2017 at 13:10 Comment(6)
Glad to see that I didn't missread the docs :) This works perfectly! Thanks!Hoist
@Jim D. -- Jeff Mercado's recursive implementation of indexof is very well-considered whereas your adaptation of it is not. The most serious problem, perhaps, is efficiency -- you might like to run some benchmarks. Your implementation suggests you have made some incorrect assumptions about how arrays are implemented in jq. At the least, it would be appropriate to acknowledge Jeff's efforts. Meanwhile, thanks for acknowledging mine.Levis
@Jim-D - Jeff is an experienced and highly competent jq programmer, so I'm afraid every deviation of your program from his has some kind of downside. The main source of inefficiency is undoubtedly the use of .[1:] as though jq were LISP. Incidentally, arrays are stored with their length, so if the input is an array, the cost of a call to length is trivial, no matter its size.Levis
@Jim-D - The jq wiki (github.com/stedolan/jq/wiki) provides a lot of important information, and jq itself has a disassembler option (jq --debug-dump-disasm ....).Levis
@peak, thanks; I was aware of the wiki, but not the disassembler function.Proselytize
The "simple" answer has a bug in that it doesn't work when there are more than 2000 elements. I was performance testing on 1000000. Also I cannot reproduce your test results. In my tests this "simple" version is much slower than all the others.Planography
L
12

The solution originally proposed by @Jim-D using foreach would only work as intended for arrays of JSON objects, and both the originally proposed solutions are very inefficient. Their behavior in the absence of an item satisfying the condition might also have been surprising.

Solution using index/1

If you just want a quick-and-easy solution, you can use the builtin function, index, as follows:

map(.name == "something") | index(true)

If there is no item satisfying the condition, then the result will be null.

Incidentally, if you wanted ALL indices for which the condition is true, then the above is easily transformed into a super-fast solution by simply changing index to indices:

map(.name == "something") | indices(true)

Efficient solution

Here is a generic and efficient function that returns the index (i.e. offset) of the first occurrence of the item in the input array for which (item|f) is truthy (neither null nor false), and null otherwise. (In jq, javascript, and many others, the index into arrays is always 0-based.)

# 0-based index of item in input array such that f is truthy, else null
def which(f):
  label $out
  | foreach .[] as $x (-1; .+1; if ($x|f) then ., break $out else empty end)
  // null ;

Example usage:

which(.name == "something")
Levis answered 31/1, 2017 at 17:13 Comment(0)
C
5

Converting an array to entries will give you access to both the index and value in the array of the items. You could use that to then find the value you're looking for and get its index.

def indexof(predicate):
    reduce to_entries[] as $i (null;
        if (. == null) and ($i.value | predicate) then
            $i.key
        else
            .
        end
    );
indexof(.name == "something")

This however does not short circuit and will go through the entire array to find the index. You'll want to return as soon as the first index has been found. Taking a more functional approach might be more appropriate.

def indexof(predicate):
    def _indexof:
        if .i >= .len then
            null
        elif (.arr[.i] | predicate) then
            .i
        else
            .i += 1 | _indexof
        end;
    { arr: ., i: 0, len: length } | _indexof;
indexof(.name == "something")

Note that the arguments are passed in to the inner function in this way to take advantage of some optimizations. Namely to take advantage of TCO, the function must not accept any additional parameters.

A still faster version can be obtained by recognizing that the array and its length do not vary:

def indexof(predicate):
  . as $in
  | length as $len
  |  def _indexof:
       if . >= $len then null
       elif ($in[.] | predicate) then .
       else . + 1 | _indexof
       end;
  0 | _indexof;
Cleaves answered 31/1, 2017 at 17:17 Comment(2)
Yes, TCO only applies to 0-arity functions. For readability, you might consider using a JSON object as input to _indexof. The efficiency should be about the same.Levis
If it will work equally fine using objects for the params, then great, it looks much more appealing with those changes.Cleaves
P
2

Here is another version which seems to be slightly faster than the optimized versions from @peak and @jeff-mercado:

label $out | . as $elements | range(length) |
select($elements[.].name == "something") | . , break $out

IMO it is easier to read although it still relies on the break (to get the first match only).

I was doing 100 iterations on a ~1,000,000 element array (with the last element being the one to match). I only counted the user and kernel times, not the wall clock time. On average this solution took 3.4s, @peak's solution took 3.5s, and @jeff-mercado's took 3.6s. This matched what I was seeing in one off runs although to be fair I did have a run where this solution to 3.6s on average so there is unlikely to be any statistical significant difference between each solution.

Planography answered 28/1, 2021 at 10:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.