Opening A large JSON file
Asked Answered
B

5

17

I have a 1.7 GB JSON file when I am trying to open with json.load() then it is giving memory error, How could read the JSON file in python?

My JSON file is a big array of objects containing specific keys.

Edit: Of course if each item in the (outermost) array appears on a single line, then we could read the file one line at a time, along the lines of:

>>>for line in open('file.json','r').readline():
...    do something with(line) 
Billboard answered 23/5, 2012 at 7:49 Comment(3)
Why do you have such a huge JSON file? A format that is pretty much always read into memory as a whole is pretty much unsuitable for large structures like this. Consider storing your data in a database.Balladry
What are you trying to do with the data? Where does it come from?Brimful
I probably should store them in different files but did not do that :(, I want to use that data for sentiment analysis.Billboard
D
15

You want an incremental json parser like yajl and one of its python bindings. An incremental parser reads as little as possible from the input and invokes a callback when something meaningful is decoded. For example, to pull only numbers from a big json file:

class ContentHandler(YajlContentHandler):
    def yajl_number(self, ctx, val):
         list_of_numbers.append(float(val))

parser = YajlParser(ContentHandler())
parser.parse(some_file)

See http://pykler.github.com/yajl-py/ for more info.

Dacosta answered 23/5, 2012 at 7:53 Comment(9)
My json file is one big array of objects does that help me to parse ?Billboard
@HirakSarkar: yes. You'll need to define appropriate callbacks in your ContentHandler class.Dacosta
I am having a wired problem after even installing yajl by easy_install in python shell it is giving error that yajl module is not there. What should I do. My python version is 2.6Billboard
You need to install yajl first and make sure "libyajl.so" is in your library path.Dacosta
find libyajl.so not returning anything how could I add that in $PYTHONPATH.I mean if I know the path probably I could sys.append but where to find it ?Billboard
The "so" file must be in /usr/lib or wherever your system keeps its shared libraries.Dacosta
I really do not know what to do I have seen another module ijson which is a wrapper on yajl but could not run it either giving me the following error: raise Exception('YAJL shared object not found.') Exception: YAJL shared object not found.Billboard
Ok I solved the installation problem. Here is what I did I create a file at /etc/ld.so.conf.d/ named libyajl.lib there I store the path of the yajl x.x.x/lib path then I do an ldconfig After that I installed yajl-py from the source after that I run python import yajl is working fine except the warning of version mismatch. I mailed the developer about that problem, he said it was ok.Billboard
I found yajl to be slower than a simple parse using module json. Obviously that doesn't scale to very large files, but for json that consists of e.g. one json object per line, you might try just splitging it up in by line then using module json on each line.Afrit
F
5

I have found another python wrapper around yajl library, which is ijson.

It works better for me than yajl-py due to the following reasons:

  • yajl-py did not detect yajl library on my system, I had to hack the code in order to make it work
  • ijson code is more compact and easier to use
  • ijson can work with both yajl v1 and yajl v2, and it even has pure python yajl replacement
  • ijson has very nice ObjectBuilder, which helps extracting not just events but meaningful sub-objects from parsed stream, and at the level you specify
Fastigiate answered 17/4, 2015 at 22:25 Comment(0)
T
1

I've used Dask for large telemetry JSON-Lines files (newline delimited)...
The nice thing with Dask is it does a lot of work for you.
With it, you can read the data, process it, and write to disk without reading it all into memory.
Dask will also parallelize for you and use multiple cores (threads)...

More info on Dask bags here:
https://examples.dask.org/bag.html

import ujson as json #ujson for speed and handling NaNs which are not covered by JSON spec
import dask.bag as db

def update_dict(d):
    d.update({'new_key':'new_value', 'a':1, 'b':2, 'c':0})
    d['c'] = d['a'] + d['b']
    return d

def read_jsonl(filepaths):
    """Read's a JSON-L file with a Dask Bag

    :param filepaths: list of filepath strings OR a string with wildcard
    :returns: a dask bag of dictionaries, each dict a JSON object
    """
    return db.read_text(filepaths).map(json.loads)



filepaths = ['file1.jsonl.gz','file2.jsonl.gz']
#OR
filepaths = 'file*.jsonl.gz' #wildcard to match multiple files

#(optional) if you want Dask to use multiple processes instead of threads
# from dask.distributed import Client, progress
# client = Client(threads_per_worker=1, n_workers=6) #6 workers for 6 cores
# print(client)

#define bag containing our data with the JSON parser
dask_bag = read_jsonl(filepaths)

#modify our data
#note, this doesn't execute, it just adds it to a queue of tasks
dask_bag.map(update_dict)

#(optional) if you're only reading one huge file but want to split the data into multiple files you can use repartition on the bag
# dask_bag = dask_bag.repartition(10)

#write our modified data back to disk, this is when Dask actually performs execution
dask_bag.map(json.dumps).to_textfiles('file_mod*.jsonl.gz') #dask will automatically apply compression if you use .gz
Touchline answered 28/10, 2020 at 23:27 Comment(0)
A
0

I found yajl (hence ijson) to be much slower than module json when a large data file was accessed from local disk. Here is a module that claims to perform better than yajl/ijson (still slower than json), when used with Cython:

http://pietrobattiston.it/jsaone

As the author points out, performance may be better than json when the file is received over the network since an incremental parser can start parsing sooner.

Afrit answered 12/8, 2015 at 21:15 Comment(0)
K
0

There's a CLI wrapper around ijson that I created precisely for ease of processing very large JSON documents.

In your case you could simply pipe the "big array of objects" to jm.py, and it will emit each top-level object on a separate line for piping into another process.

jm.py has various options which you might also find relevant.

The same repository has a similar script, jm, which I mention as it seems typically to be significantly faster, but it is PHP-based.

Kaput answered 21/12, 2022 at 19:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.