Using jq on a large json file (50GB)

Asked 29/6, 2021 at 14:10 Answered 13/1, 2023 at 15:59

I want to use jq on a 50GB file. Needless to say the machines memory can't handle it. It's running out of memory.

I tried several options including --stream but it didn't help. Can someone tell me what I'm doing wrong ? And how to fix it

jq -cn --stream 'fromstream(1|truncate_stream(inputs))' file.json |   jq -cr .data[] >> out.json

The file contains data like this:

{"data":[{"id":"id1","value":"value1"},{"id":"id2","value":"value2"},{"id":"id3","value":"value3"}...]}

i want to read each value of the array in the data field and put it line by line in another file. such as below

{"id":"id1","value":"value1"}
{"id":"id2","value":"value2"}
{"id":"id3","value":"value3"}

Right now the command is running out of memory and gets killed.

Sunglasses answered 29/6, 2021 at 14:10 Comment(7)

As soon as you call fromstream you're asking jq to read the stream and make a data structure in memory from it. Using --stream isn't useful unless you somehow filter the stream down to something more manageable before calling fromstream (if you ever do so at all). As for advice on how to do that... it would help if you described the actual problem you're trying to solve in more detail. – Calli 29/6, 2021 at 14:17

How to fix what? What are your error messages / misbehavior / example of content / example of needed output? – Matriarchy 29/6, 2021 at 14:21

Is your goal to extract the data key from the top-level object in the original? Is the value found there small enough to fit in RAM? Are you trying to do something else? – Calli 29/6, 2021 at 16:22

How do you mean the machine "can't handle it."? Do you get an error? Is it slow? – Workable 29/6, 2021 at 22:10

"I want to use jq on a 50GB file" -- isn't that the wrong tool for this job? why does it have to be jq? I generally think that any problem that is stated as "want to use X to solve problem Y" is putting the cart before the horse. It's very rare that the tool to use is already dictated. – Bandage 29/6, 2021 at 23:9

@ChristianFritz what other tool can get the job done? And its running out of memory – Sunglasses 1/7, 2021 at 6:33

I would look at the "big data" tools. I haven't kept up with their development, but maybe look at hadoop or apache spark. Given that the data is an array you can probably also just use mongodb. Look at the mongoimport command for loading the file into mongo. Once it's in, everything else will be trivial. – Bandage 1/7, 2021 at 15:54

For your example, the following would suffice:

jq -cn --stream 'fromstream( inputs|(.[0] |= .[2:]) | select(. != [[]]) )'

If you only wanted the .data array to be itemized, replace inputs in the above by:

inputs|select(first|first=="data")

For the record, you could also use gojq (the Go implementation of jq) in exactly the same way.

Lambdoid answered 13/1, 2023 at 15:29 Comment(0)

what other tool can get the job done?

jm, a command-line wrapper I wrote for “JSON Machine”, is very easy to use and often more economical than jq’s streaming parser. In the present case, to itemize .data, you would write:

jm —-pointer '/data'

Or similarly, using the Python-based script in the same repository:

jm.py -i data.item file.json

Assuming there is just one top-level key, then another alternative in this particular case would be:

jstream -d 2 <  file.json

Lambdoid answered 13/1, 2023 at 15:59 Comment(0)

Assuming that your large file contains a lot on json objects, you can process them one by one and extract the field .data[].

This way, the memory consumption is limited by the size of the largest json from the input and not by the sum of the sizes of all json inputs.

Or is your problem that a single json object is so large that the memory is insufficient?

echo '
{ "key":"A1", "property":"A2", "data":[1,2,3] }{"key":"B1","property":"B2","data":[4,5,6]}
{ 
   "key":"C1", 
   "property":"C2", 
   "data":[7,8,9] 
}' | jq -cr '.data[]'

result

Job answered 29/6, 2021 at 21:54 Comment(0)

Recommended topics

Hot tags