Using python ijson to read a large json file with multiple json objects
Asked Answered
S

2

16

I'm trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. However, after writing some code like this,

with open(filename, 'r') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix == "name":
            print(value)

I found that the code parses only the first line and not the rest of the lines from the file!!

Here is how a portion of my json file looks like:

{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.012000}
{"name":"engine_speed","value":772,"timestamp":1364323939.027000}
{"name":"vehicle_speed","value":0,"timestamp":1364323939.029000}
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.035000}

In my opinion, I think ijson parses only one json object.

Can someone please suggest how to work around this?

Sandie answered 13/5, 2016 at 2:26 Comment(3)
Possible duplicate of #10716128Abyssinia
Well, the chunk you provided looks like a set of JSONs. That is you should read lines one by one and separately parse it.Heresiarch
BTW since each line is short, you don't need ijson, you can use json.loads().Heresiarch
H
14

Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:

# each JSON is small, there's no need in iterative processing
import json 
with open(filename, 'r') as f:
    for line in f:
        data = json.loads(line)
        # data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
        # contain correspoding values
Heresiarch answered 13/5, 2016 at 3:8 Comment(3)
Thanks for answering, i'm asking if this will not load the hole file into RAM ? , if it loads only one line at time, so this is awesomeSandie
certainly for line in f: reads one line a time. Check #17246760Heresiarch
How can I handle custom en- and decoding in ijson? I can do this rather easily with json and the cls= argument, how is it done in ijson? Any links? Thanks!Dedifferentiation
R
12

Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data". See bug reports here:

It's a big limitation. However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently, like this:

import io
import ijson

with open(filename, encoding="UTF-8") as json_file:
    cursor = 0
    for line_number, line in enumerate(json_file):
        print ("Processing line", line_number + 1,"at cursor index:", cursor)
        line_as_file = io.StringIO(line)
        # Use a new parser for each line
        json_parser = ijson.parse(line_as_file)
        for prefix, type, value in json_parser:
            print ("prefix=",prefix, "type=",type, "value=",value)
        cursor += len(line)

You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. It also uses the line streaming technique from: How to jump to a particular line in a huge text file? and uses enumerate() from: Accessing the index in 'for' loops?

Ruddie answered 1/3, 2018 at 21:33 Comment(2)
Thanks @Mr-IDE. I finally able to read something from my 5.5Gb datasets using ijson. Managed to sneak some info from it such as dataID, status, values ,location. Question, how to read through all needed info at once for intance "location"??Medin
As of March 2024, there exists a multiple_values option that allows to parse such files.Doleful

© 2022 - 2024 — McMap. All rights reserved.