Load an element with Python from large json file
Asked Answered
M

3

2

Here is my json file. I want to load the data list from it, one by one, and only it. And then, for example plot it...

This is an example, because I am dealing with large data set, with which I could not load all the file (that would create a memory error).

{
  "earth": {
    "europe": [
      {"name": "Paris", "type": "city"},
      {"name": "Thames", "type": "river"}, 
      {"par": 2, "data": [1,7,4,7,5,7,7,6]}, 
      {"par": 2, "data": [1,0,4,1,5,1,1,1]}, 
      {"par": 2, "data": [1,0,0,0,5,0,0,0]}
        ],
    "america": [
      {"name": "Texas", "type": "state"}
    ]
  }
}

Here is what I tried:

import ijson
filename = "testfile.json"

f = open(filename)
mylist = ijson.items(f, 'earth.europe[2].data.item')
print mylist

It returns me nothing, even when I try to convert it into a list:

[]
Mychael answered 30/10, 2016 at 15:50 Comment(4)
I didn't put the code I used, because I don't think that a good way to do... import ijson as ijson filename = "myfile.json" with open(myfile,'r') as f: voila=ijson.items(f,'earth.data.item') print voilaGilbye
I developped, do you have an ideaGilbye
Thanks for updating, I've reopened this.Kweisui
I think I will delete... no one respond...Gilbye
K
3

You need to specify a valid prefix; ijson prefixes are either keys in a dictionary or the word item for list entries. You can't select a specific list item (so [2] doesn't work).

If you wanted all the data keys dictionaries in the europe list, then the prefix is:

earth.europe.item.data
# ^ ------------------- outermost key must be 'earth'
#       ^ ------------- next key must be 'europe'
#              ^ ------ any value in the array
#                   ^   the value for the 'data' key

This produces each such list:

>>> l = ijson.items(f, 'earth.europe.item.data')
>>> for data in l:
...     print data
...
[1, 7, 4, 7, 5, 7, 7, 6]
[1, 0, 4, 1, 5, 1, 1, 1]
[1, 0, 0, 0, 5, 0, 0, 0]

You can't put wildcards in that, so you can't get earth.*.item.data for example.

If you need to do more complex prefixing matching, you'd have to use the ijson.parse() function and handle the events this produces. You can reuse the ijson.ObjectBuilder() class to turn events you are interested in into Python objects:

parser = ijson.parse(f)
for prefix, event, value in parser:
    if event != 'start_array':
        continue
    if prefix.startswith('earth.') and prefix.endswith('.item.data'):
        continent = prefix.split('.', 2)[1]
        builder = ijson.ObjectBuilder()
        builder.event(event, value)
        for nprefix, event, value in parser:
            if (nprefix, event) == (prefix, 'end_array'):
                break
            builder.event(event, value)
        data = builder.value
        print continent, data

This will print every array that's in a list under a 'data' key (so lives under a prefix that ends with '.item.data'), with the 'earth' key. It also extracts the continent key.

Kweisui answered 2/11, 2016 at 18:2 Comment(14)
Thanks a lot! Even if I will have to focus a little bit for the second part, that's the best explanation I found on internet :))Gilbye
I think you have answered to this question, but is there any way to load those data one by one? Because, if I want to treat them, I have to store them (for example) in a list. And the same problem happens again: a "memory error". Any idea?Gilbye
@JeanneDiderot: yes, where I use print right now, you can treat just that one list, then discard it. Or you could wrap the whole thing in to a function, use yield continent, data to have it produce each data list one by one as you iterate, and again if you then don't add more references to the list it'll be cleared again.Kweisui
Ok, it works! But very slow... I guess that because of the file's size. But one thing is strange: if I want to load ONE element (for exemple Paris), it will be very very slow (as for a long array). And more generally, even if your explications are good, ijson don't seems very fast...Gilbye
@JeanneDiderot: the default backend for ijson is the pure-python parser, which is slow. Install YAJL 2.x and use import ijson.backends.yajl2_cffi as ijson to import a much, much faster backend.Kweisui
@JeanneDiderot: see lloyd.github.io/yajl; different platforms may already have an installable package available. I used brew install yajl on my Mac.Kweisui
I have a win 10. Do I have to install Git? And then enter: $ git clone git://github.com/lloyd/yajl ?Gilbye
@JeanneDiderot: there is a ready Windows binary here: github.com/LearningRegistry/LearningRegistry/wiki/…. No idea what version that is. There may be others.Kweisui
@JeanneDiderot: do experiment with the different documented backends; if you can only get yajl 1.x, then that'll still be faster than the pure-Python version. If you can get 2.x, and you can install the cffi package, then you get the fastest option of all, however.Kweisui
You will get tired with me... I load the "yajl-2.1.0.zip". I don't succed in intalling this... and "brew install yajl" is not recognize as an intern command. :(( BUTI succeed to install cffi !!Gilbye
@JeanneDiderot: perhaps you need to start asking a question on Super User then. brew wouldn't work, that's a Mac OS X tool. You'll either need to compile the project (no need to install git, there are download links, but you would need Visual Studio), or you need to find a compiled version (like the zip file) and install that in the right location. What the right location is, I don't know, I don't use Windows, sorry.Kweisui
Ok, I did. No idea of an other way to read and treat data, in a faster way?Gilbye
@AgapeGal'lo: sorry, I'm not aware of other options than to break up your data set into smaller JSON files by some other means or to use a streaming parser, for which on Python all libraries that support this use yajl.Kweisui
I tested again my program. The difficulty is may be not where I thought. That's very strange (or may be interesting...). The organisation is kind of the same than in the example. There are about 800 points in each "data", but loading the "2" of the dictionary "par" take me 130 more time!! I use this code: object= ijson.items(f, 'earth.europe.par') for i in object: speed = np.float(list(object)[0]) #As there is only one element it works But more the file is big, more it takes time (but in a non reasonable way...) to extract this single float!Gilbye
R
0

Given the structure of your json I would do this:

import json

filename = "test.json"

with open(filename) as data_file:
    data = json.load(data_file)
print data['earth']['europe'][2]['data']
print type(data['earth']['europe'][2]['data'])
Recollected answered 30/10, 2016 at 18:51 Comment(1)
No, I want to load only the data lists from the json file; not all of the json file. The problem is that I have a 500 Mo file and python return me a "memory error" when I try to load everything.Gilbye
M
0

So, I will explain how I finally solved this problem. The first answer will work. But you have to know that loading elements one per one with ijson will be very long... and by the end, you do not have the loaded file.

So, the important information is that windows limit your memory per process to 2 or 4 GB, depending on wich windows you use (32 or 64). If you use pythonxy, that will be 2 GB (it only exists in 32). Anyway, in both way, that's very very low!

I solved this problem by installing a virtual Linux in my windows, and it works. Here are the main step to do so:

  1. Install Virtual Box
  2. Install Ubuntu (for exemple)
  3. Install python for scientist on your computer, like SciPy
  4. Create a share file between the 2 "computers" (you will find tutorial on google)
  5. Execute your code on your ubuntu "computer": it sould work ;)

NB: Do not forget to allow sufficient RAM and memory to you virtual computer.

This works for me. I don't have anymore this "memory error" problem.

Mychael answered 8/11, 2016 at 14:53 Comment(5)
No, not until you have a larger JSON file still. Streamed parsing is still the better option. If you were willing to install a virtual machine with Linux for this, why not also try to use the ijson with yajl as the backend?Kweisui
Because the BIG advantage with this method is that at the end you have the file loaded. And often, when you do data processing, you want to modify the parameters of the analysis, and it goes much faster if you have the file already loaded. After, if the file is really too big (more than a few GB), I would clearly recommend your method. But my files are "only 1-2 GB... And I think it's the case of many people who ask this question.Gilbye
At any rate, this isn't really an answer to your question posted here, which appeared to concern the use of the ijson library. You are answering the question 'how to load a large JSON file', which is a problem that may have led to the actual question posted. :-)Kweisui
You clearly right! I re-put your answer as the best one ;) That's clearly the most complete!Gilbye
Thanks, that's much appreciated. Not just for me, but also for future visitors that probably come here to see how to use ijson specifically. :-)Kweisui

© 2022 - 2024 — McMap. All rights reserved.