How to split an array into chunks with jq?
Asked Answered
E

5

24

I have a very large JSON file containing an array. Is it possible to use jq to split this array into several smaller arrays of a fixed size? Suppose my input was this: [1,2,3,4,5,6,7,8,9,10], and I wanted to split it into 3 element long chunks. The desired output from jq would be:

[1,2,3]
[4,5,6]
[7,8,9]
[10]

In reality, my input array has nearly three million elements, all UUIDs.

Evacuation answered 19/7, 2018 at 0:33 Comment(1)
jq 'group_by(. % 3)' <<< '[1,2,3,4,5,6,7,8,9,10]' splits into three groups. Now if I could just get the length of the full input array…Preglacial
D
4

The following stream-oriented definition of window/3, due to Cédric Connes (github:connesc), generalizes _nwise, and illustrates a "boxing technique" that circumvents the need to use an end-of-stream marker, and can therefore be used if the stream contains the non-JSON value nan. A definition of _nwise/1 in terms of window/3 is also included.

The first argument of window/3 is interpreted as a stream. $size is the window size and $step specifies the number of values to be skipped. For example,

window(1,2,3; 2; 1)

yields:

[1,2]
[2,3]

window/3 and _nsize/1

def window(values; $size; $step):
  def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
  checkparam("size"; $size)
| checkparam("step"; $step)
  # We need to detect the end of the loop in order to produce the terminal partial group (if any).
  # For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
    {index: -1, items: [], ready: false};
    (.index + 1) as $index
    # Extract items that must be reused from the previous iteration
    | if (.ready | not) then .items
      elif $step >= $size or $item == null then []
      else .items[-($size - $step):]
      end
    # Append the current item unless it must be skipped
    | if ($index % $step) < $size then . + $item
      else .
      end
    | {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
    if .ready then .items else empty end
  );

def _nwise($n): window(.[]; $n; $n);

Source:

https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58

Diamante answered 19/7, 2018 at 16:8 Comment(2)
Readers: you can use this with streaming by replacing the last line with def streamsplit($n): window(inputs | .[1]; $n $n) and passing --stream. It'll run in 2844k max memory.Evacuation
@EchoNolan: Note that window/2 is a bit slower than the stream-oriented nwise/2 given elsewhere on this page. The same technique using inputs can be used, but in either case, remember to use the -n command-line option.Diamante
D
27

There is an (undocumented) builtin, _nwise, that meets the functional requirements:

$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'

[1,2,3]
[4,5,6]
[7,8,9]
[10]

Also:

$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)' 
[1,2,3]
[4,5,6]
[7,8,9]
[10]

Incidentally, _nwise can be used for both arrays and strings.

(I believe it's undocumented because there was some doubt about an appropriate name.)

TCO-version

Unfortunately, the builtin version is carelessly defined, and will not perform well for large arrays. Here is an optimized version (it should be about as efficient as a non-recursive version):

def nwise($n):
 def _nwise:
   if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
 _nwise;

For an array of size 3 million, this is quite performant: 3.91s on an old Mac, 162746368 max resident size.

Notice that this version (using tail-call optimized recursion) is actually faster than the version of nwise/2 using foreach shown elsewhere on this page.

Diamante answered 19/7, 2018 at 2:49 Comment(0)
D
4

The following stream-oriented definition of window/3, due to Cédric Connes (github:connesc), generalizes _nwise, and illustrates a "boxing technique" that circumvents the need to use an end-of-stream marker, and can therefore be used if the stream contains the non-JSON value nan. A definition of _nwise/1 in terms of window/3 is also included.

The first argument of window/3 is interpreted as a stream. $size is the window size and $step specifies the number of values to be skipped. For example,

window(1,2,3; 2; 1)

yields:

[1,2]
[2,3]

window/3 and _nsize/1

def window(values; $size; $step):
  def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window \(name) must be a positive integer") end;
  checkparam("size"; $size)
| checkparam("step"; $step)
  # We need to detect the end of the loop in order to produce the terminal partial group (if any).
  # For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
    {index: -1, items: [], ready: false};
    (.index + 1) as $index
    # Extract items that must be reused from the previous iteration
    | if (.ready | not) then .items
      elif $step >= $size or $item == null then []
      else .items[-($size - $step):]
      end
    # Append the current item unless it must be skipped
    | if ($index % $step) < $size then . + $item
      else .
      end
    | {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
    if .ready then .items else empty end
  );

def _nwise($n): window(.[]; $n; $n);

Source:

https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58

Diamante answered 19/7, 2018 at 16:8 Comment(2)
Readers: you can use this with streaming by replacing the last line with def streamsplit($n): window(inputs | .[1]; $n $n) and passing --stream. It'll run in 2844k max memory.Evacuation
@EchoNolan: Note that window/2 is a bit slower than the stream-oriented nwise/2 given elsewhere on this page. The same technique using inputs can be used, but in either case, remember to use the -n command-line option.Diamante
E
4

here's a simple one that worked for me:

def chunk(n):
    range(length/n|ceil) as $i | .[n*$i:n*$i+n];

example usage:

jq -n \
'def chunk(n): range(length/n|ceil) as $i | .[n*$i:n*$i+n];
[range(5)] | chunk(2)'
[
  0,
  1
]
[
  2,
  3
]
[
  4
]

bonus: it doesn't use recursion and doesn't rely on _nwise, so it also works with jaq.

Exactly answered 12/10, 2022 at 22:32 Comment(0)
D
2

If the array is too large to fit comfortably in memory, then I'd adopt the strategy suggested by @CharlesDuffy -- that is, stream the array elements into a second invocation of jq using a stream-oriented version of nwise, such as:

def nwise(stream; $n):
  foreach (stream, nan) as $x ([];
    if length == $n then [$x] else . + [$x] end;
    if (.[-1] | isnan) and length>1 then .[:-1]
    elif length == $n then .
    else empty
    end);

The "driver" for the above would be:

nwise(inputs; 3)

But please remember to use the -n command-line option.

To create the stream from an arbitrary array:

$ jq -cn --stream '
    fromstream( inputs | (.[0] |= .[1:])
                | select(. != [[]]) )' huge.json 

So the shell pipeline might look like this:

$ jq -cn --stream '
    fromstream( inputs | (.[0] |= .[1:])
                | select(. != [[]]) )' huge.json |
  jq -n -f nwise.jq

This approach is quite performant. For grouping a stream of 3 million items into groups of 3 using nwise/2,

/usr/bin/time -lp

for the second invocation of jq gives:

user         5.63
sys          0.04
   1261568  maximum resident set size

Caveat: this definition uses nan as an end-of-stream marker. Since nan is not a JSON value, this cannot be a problem for handling JSON streams.

Diamante answered 19/7, 2018 at 3:19 Comment(5)
Ahh -- if you think it adds something useful, I'll pull it back in.Sanbo
Since you are upfront about the hackery, I'd suggest keeping it -- it shows how to use while, try, and input :-)Diamante
It looks like we can get a trailing empty list emitted when . == [nan] (if the total number of items divides evenly into $n), when I'm testing this experimentally. Perhaps the line checking for end-of-stream should be if .[-1] | isnan then if (. | length) > 1 then .[:-1] else empty end? See gist.github.com/charles-dyfis-net/… for test procedure.Sanbo
@CharlesDuffy - Thanks for identifying the problem, which I've fixed in a slightly different way in case it's no-slower.Diamante
To break the output into separate files, change the second jq invocation to e.g. jq -cn -f nwise.jq | awk '{print > "doc00" NR ".json"}' per https://mcmap.net/q/582836/-split-json-array-into-separate-files-objectsJaclynjaco
S
1

The below is hackery, to be sure -- but memory-efficient hackery, even with an arbitrarily long list:

jq -c --stream 'select(length==2)|.[1]' <huge.json \
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'

The first piece of the pipeline streams in your input JSON file, emitting one line per element, assuming the array consists of atomic values (where [] and {} are here included as atomic values). Because it runs in streaming mode it doesn't need to store the entire content in memory, despite being a single document.

The second piece of the pipeline repeatedly reads up to three items and assembles them into a list.

This should avoid needing more than three pieces of data in memory at a time.

Sanbo answered 19/7, 2018 at 1:11 Comment(3)
Using while here results in an error. It would be much better to write: jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'Diamante
I was ignoring the error at the end of processing, but that's a definite improvement. Thank you for the refinement.Sanbo
See elsewhere on this page for the invocation of jq to use for streaming arrays in general.Diamante

© 2022 - 2025 — McMap. All rights reserved.