Python json.loads shows ValueError: Extra data
Asked Answered
C

12

230

I am getting some data from a JSON file "new.json", and I want to filter some data and store it into a new JSON file. Here is my code:

import json
with open('new.json') as infile:
    data = json.load(infile)
for item in data:
    iden = item.get["id"]
    a = item.get["a"]
    b = item.get["b"]
    c = item.get["c"]
    if c == 'XYZ' or  "XYZ" in data["text"]:
        filename = 'abc.json'
    try:
        outfile = open(filename,'ab')
    except:
        outfile = open(filename,'wb')
    obj_json={}
    obj_json["ID"] = iden
    obj_json["VAL_A"] = a
    obj_json["VAL_B"] = b

And I am getting an error, the traceback is:

  File "rtfav.py", line 3, in <module>
    data = json.load(infile)
  File "/usr/lib64/python2.7/json/__init__.py", line 278, in load
    **kw)
  File "/usr/lib64/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 369, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 88 column 2 - line 50607 column 2 (char 3077 - 1868399)

Here is a sample of the data in new.json, there are about 1500 more such dictionaries in the file

{
    "contributors": null, 
    "truncated": false, 
    "text": "@HomeShop18 #DreamJob to professional rafter", 
    "in_reply_to_status_id": null, 
    "id": 421584490452893696, 
    "favorite_count": 0, 
    "source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Mobile Web (M2)</a>", 
    "retweeted": false, 
    "coordinates": null, 
    "entities": {
        "symbols": [], 
        "user_mentions": [
            {
                "id": 183093247, 
                "indices": [
                    0, 
                    11
                ], 
                "id_str": "183093247", 
                "screen_name": "HomeShop18", 
                "name": "HomeShop18"
            }
        ], 
        "hashtags": [
            {
                "indices": [
                    12, 
                    21
                ], 
                "text": "DreamJob"
            }
        ], 
        "urls": []
    }, 
    "in_reply_to_screen_name": "HomeShop18", 
    "id_str": "421584490452893696", 
    "retweet_count": 0, 
    "in_reply_to_user_id": 183093247, 
    "favorited": false, 
    "user": {
        "follow_request_sent": null, 
        "profile_use_background_image": true, 
        "default_profile_image": false, 
        "id": 2254546045, 
        "verified": false, 
        "profile_image_url_https": "https://pbs.twimg.com/profile_images/413952088880594944/rcdr59OY_normal.jpeg", 
        "profile_sidebar_fill_color": "171106", 
        "profile_text_color": "8A7302", 
        "followers_count": 87, 
        "profile_sidebar_border_color": "BCB302", 
        "id_str": "2254546045", 
        "profile_background_color": "0F0A02", 
        "listed_count": 1, 
        "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", 
        "utc_offset": null, 
        "statuses_count": 9793, 
        "description": "Rafter. Rafting is what I do. Me aur mera Tablet.  Technocrat of Future", 
        "friends_count": 231, 
        "location": "", 
        "profile_link_color": "473623", 
        "profile_image_url": "http://pbs.twimg.com/profile_images/413952088880594944/rcdr59OY_normal.jpeg", 
        "following": null, 
        "geo_enabled": false, 
        "profile_banner_url": "https://pbs.twimg.com/profile_banners/2254546045/1388065343", 
        "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", 
        "name": "Jayy", 
        "lang": "en", 
        "profile_background_tile": false, 
        "favourites_count": 41, 
        "screen_name": "JzayyPsingh", 
        "notifications": null, 
        "url": null, 
        "created_at": "Fri Dec 20 05:46:00 +0000 2013", 
        "contributors_enabled": false, 
        "time_zone": null, 
        "protected": false, 
        "default_profile": false, 
        "is_translator": false
    }, 
    "geo": null, 
    "in_reply_to_user_id_str": "183093247", 
    "lang": "en", 
    "created_at": "Fri Jan 10 10:09:09 +0000 2014", 
    "filter_level": "medium", 
    "in_reply_to_status_id_str": null, 
    "place": null
} 
Cereal answered 11/1, 2014 at 5:36 Comment(4)
This is the error you get whenever the input JSON has more than one object per line. Many of the answer here assume there is only one object per line, or construct examples obeying that, but would break if that wasn't the case.Kuebbing
@Kuebbing : Can you explain the line more than one object per lineQuintin
@Kuebbing I think you meant "more than one line per object"?Anu
Yes, "more than one line per object", silly me...Kuebbing
S
208

As you can see in the following example, json.loads (and json.load) does not decode multiple json object.

>>> json.loads('{}')
{}
>>> json.loads('{}{}') # == json.loads(json.dumps({}) + json.dumps({}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 368, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 3 - line 1 column 5 (char 2 - 4)

If you want to dump multiple dictionaries, wrap them in a list, dump the list (instead of dumping dictionaries multiple times)

>>> dict1 = {}
>>> dict2 = {}
>>> json.dumps([dict1, dict2])
'[{}, {}]'
>>> json.loads(json.dumps([dict1, dict2]))
[{}, {}]
Spiracle answered 11/1, 2014 at 5:39 Comment(18)
Can you please explain again with reference to the code I gave above? I am a newbie, and at times take long to grasp such things.Cereal
@ApoorvAshutosh, It seems like new.json contains a json and another redundant data. json.load, json.loads can only decode a json. It raise a ValueError when it encounter addtional data as you see.Spiracle
Have pasted a sample from new.json, and I am filtering out some data from it, so I don't get where I am getting extra data fromCereal
@ApoorvAshutosh, You said 1500 more such dictionaries in the edited question. That's the additional data. If you're the one who made a new.json, just put a single json in a file.Spiracle
@ApoorvAshutosh, If you need to dump multiple dictionaries as json, wrap them in a list, and dump the list.Spiracle
the issue here is not about loading into a JSON file, that has already happened. Can you tell me how to retrieve data from there? I already have a file that has dictionaries in it. I now have to retrieve each of those dictionaries. https://mcmap.net/q/119922/-python-json-parser-closedCereal
@ApoorvAshutosh, BTW, trailing ',' is missing in the json (in the new question). (at the line "x": []) => invalid json.Spiracle
sure, asap. And could you just look into one more thing, as I said, about how to read from a file with multiple dictionariesCereal
@ApoorvAshutosh, I'm doing research that issue. I will post answer there if research is done.Spiracle
Thats just a sample, I mentioned it in a commentCereal
@ApoorvAshutosh, Please post a valid sample!Spiracle
@ApoorvAshutosh, No, I mean the sample in the new question.Spiracle
Its for this very sample, the structure of the dictionaries is basically the same. However, I'll edit that question with this very sampleCereal
@ApoorvAshutosh, I posted an answer that workaround the issue. Check it out.Spiracle
Can I ask that why it still works when I use json.dump instead of json.dumps? I am using Python 3.5.2Sera
@ShuruiLiu, Please post a separated question.Spiracle
as someone who has an issue such as this from a json web scrape. I ran the code through a linter to see if it is valid json. It seems that it is, so why would this error still call?Coarse
I was trying with this option, but I saw another useful way to get all items : file.readlines() which returns a list of sentences.Choreograph
L
209

Iterate over the file, loading each line as JSON in the loop:

tweets = []
with open('tweets.json', 'r') as file:
    for line in file:
        tweets.append(json.loads(line))

This avoids storing intermediate python objects. As long as you write one full tweet per append() call, this should work.

Lighten answered 28/3, 2015 at 1:27 Comment(3)
The accepted answer addresses how to fix the source of the problem if you control the process of exporting, but if you are using someone else's data and you just have to deal with it, this is a great low-overhead method.Duralumin
Many datasets (e.g.: Yelp dataset) nowadays are provided as "set" of Json objects and your approach it's convenient to load them.Traditional
This only works for inputs that have one complete JSON object per line. That is a common input format (it is not JSON, but a related format sometimes called either JSONL or NDJSON), but it is not what is shown in the OP.Anu
S
208

As you can see in the following example, json.loads (and json.load) does not decode multiple json object.

>>> json.loads('{}')
{}
>>> json.loads('{}{}') # == json.loads(json.dumps({}) + json.dumps({}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 368, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 3 - line 1 column 5 (char 2 - 4)

If you want to dump multiple dictionaries, wrap them in a list, dump the list (instead of dumping dictionaries multiple times)

>>> dict1 = {}
>>> dict2 = {}
>>> json.dumps([dict1, dict2])
'[{}, {}]'
>>> json.loads(json.dumps([dict1, dict2]))
[{}, {}]
Spiracle answered 11/1, 2014 at 5:39 Comment(18)
Can you please explain again with reference to the code I gave above? I am a newbie, and at times take long to grasp such things.Cereal
@ApoorvAshutosh, It seems like new.json contains a json and another redundant data. json.load, json.loads can only decode a json. It raise a ValueError when it encounter addtional data as you see.Spiracle
Have pasted a sample from new.json, and I am filtering out some data from it, so I don't get where I am getting extra data fromCereal
@ApoorvAshutosh, You said 1500 more such dictionaries in the edited question. That's the additional data. If you're the one who made a new.json, just put a single json in a file.Spiracle
@ApoorvAshutosh, If you need to dump multiple dictionaries as json, wrap them in a list, and dump the list.Spiracle
the issue here is not about loading into a JSON file, that has already happened. Can you tell me how to retrieve data from there? I already have a file that has dictionaries in it. I now have to retrieve each of those dictionaries. https://mcmap.net/q/119922/-python-json-parser-closedCereal
@ApoorvAshutosh, BTW, trailing ',' is missing in the json (in the new question). (at the line "x": []) => invalid json.Spiracle
sure, asap. And could you just look into one more thing, as I said, about how to read from a file with multiple dictionariesCereal
@ApoorvAshutosh, I'm doing research that issue. I will post answer there if research is done.Spiracle
Thats just a sample, I mentioned it in a commentCereal
@ApoorvAshutosh, Please post a valid sample!Spiracle
@ApoorvAshutosh, No, I mean the sample in the new question.Spiracle
Its for this very sample, the structure of the dictionaries is basically the same. However, I'll edit that question with this very sampleCereal
@ApoorvAshutosh, I posted an answer that workaround the issue. Check it out.Spiracle
Can I ask that why it still works when I use json.dump instead of json.dumps? I am using Python 3.5.2Sera
@ShuruiLiu, Please post a separated question.Spiracle
as someone who has an issue such as this from a json web scrape. I ran the code through a linter to see if it is valid json. It seems that it is, so why would this error still call?Coarse
I was trying with this option, but I saw another useful way to get all items : file.readlines() which returns a list of sentences.Choreograph
C
70

I came across this because I was trying to load a JSON file dumped from MongoDB. It was giving me an error

JSONDecodeError: Extra data: line 2 column 1

The MongoDB JSON dump has one object per line, so what worked for me is:

import json
data = [json.loads(line) for line in open('data.json', 'r')]
Cooker answered 13/8, 2018 at 21:19 Comment(1)
I still get json.decoder.JSONDecodeError: Extra data: line 1 column 954 (char 953) with this answer's code. My data file must have a different problem.Featherbrain
P
17

This may also happen if your JSON file is not just 1 JSON record. A JSON record looks like this:

[{"some data": value, "next key": "another value"}]

It opens and closes with a bracket [ ], within the brackets are the braces { }. There can be many pairs of braces, but it all ends with a close bracket ]. If your json file contains more than one of those:

[{"some data": value, "next key": "another value"}]
[{"2nd record data": value, "2nd record key": "another value"}]

then loads() will fail.

I verified this with my own file that was failing.

import json

guestFile = open("1_guests.json",'r')
guestData = guestFile.read()
guestFile.close()
gdfJson = json.loads(guestData)

This works because 1_guests.json has one record []. The original file I was using all_guests.json had 6 records separated by newline. I deleted 5 records, (which I already checked to be bookended by brackets) and saved the file under a new name. Then the loads statement worked.

Error was

   raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1 - line 10 column 1 (char 261900 - 6964758)

PS. I use the word record, but that's not the official name. Also, if your file has newline characters like mine, you can loop through it to loads() one record at a time into a json variable.

Parrot answered 29/10, 2015 at 0:19 Comment(2)
Is there a way to get json.loads to read newline-delimited json chunks? That is, to act like [json.loads(x) for x in text.split('\n')]? Related: Is there a guarantee that json.dumps will not include literal newlines in its output with default indenting?React
@Ben, by default json.dumps will change newlines in text content to "\n", keeping your json to a single line.Poundfoolish
D
16

I just got the same error while my json file is like this

{"id":"1101010","city_id":"1101","name":"TEUPAH SELATAN"}
{"id":"1101020","city_id":"1101","name":"SIMEULUE TIMUR"}

And I found it malformed, so I changed it to:

{
  "datas":[
    {"id":"1101010","city_id":"1101","name":"TEUPAH SELATAN"},
    {"id":"1101020","city_id":"1101","name":"SIMEULUE TIMUR"}
  ]
}
Davisson answered 15/5, 2019 at 9:32 Comment(4)
loading just like yours, json.load(infile)Davisson
For the record, if this is the entire JSON file, an outer map is redundant. The root can be an array, which lets you simplify the second JSON to just be an array. No need for a useless key in a useless map if you're storing array data - just throw it in a root arrayBirdt
@Zoe oh that's interesting, could you provide us some example?Davisson
It's not exactly hard. Just wrap the two maps in an array: [{"id":"1101010","city_id":"1101","name":"TEUPAH SELATAN"}, {"id":"1101020","city_id":"1101","name":"SIMEULUE TIMUR"}]. Parsing is identical, access is obj[0], obj[1], ... (read: just like accessing a normal array), and the objects you get are identical. The one you have in your answer would require obj["datas"][0], so it's functionally identicalBirdt
B
12

One-liner for your problem:

data = [json.loads(line) for line in open('tweets.json', 'r')]
Baronetcy answered 22/2, 2019 at 5:17 Comment(1)
This is not a general solution, it assumes the input has one JSON object per line, and breaks it it doesn't.Kuebbing
C
9

If you want to solve it in a two-liner you can do it like this:

with open('data.json') as f:
    data = [json.loads(line) for line in f]
Constraint answered 4/12, 2018 at 18:34 Comment(0)
P
5

I think saving dicts in a list is not an ideal solution here proposed by @falsetru.

Better way is, iterating through dicts and saving them to .json by adding a new line.

Our 2 dictionaries are

d1 = {'a':1}

d2 = {'b':2}

you can write them to .json

import json
with open('sample.json','a') as sample:
    for dict in [d1,d2]:
        sample.write('{}\n'.format(json.dumps(dict)))

And you can read json file without any issues

with open('sample.json','r') as sample:
    for line in sample:
        line = json.loads(line.strip())

Simple and efficient

Pavlodar answered 8/3, 2019 at 15:6 Comment(1)
This is not a general solution, it assumes the input has one JSON object per line, and breaks it it doesn't.Kuebbing
C
4

My json file was formatted exactly as the one in the question but none of the solutions here worked out. Finally I found a workaround on another Stackoverflow thread. Since this post is the first link in Google search, I put the that answer here so that other people come to this post in the future will find it more easily.

As it's been said there the valid json file needs "[" in the beginning and "]" in the end of file. Moreover, after each json item instead of "}" there must be a "},". All brackets without quotations! This piece of code just modifies the malformed json file into its correct format.

https://mcmap.net/q/119924/-can-39-t-parse-json-file-json-decoder-jsondecodeerror-extra-data

Catamenia answered 15/8, 2020 at 12:0 Comment(0)
E
4

The error is due to the \nsymbol if you use the read()method of the file descriptor... so don't bypass the problem by using readlines()& co but just remove such character!

import json

path = # contains for example {"c": 4} also on multy-lines

new_d = {'new': 5}
with open(path, 'r') as fd:
    d_old_str = fd.read().replace('\n', '') # remove all \n
    old_d = json.loads(d_old_str)

# update new_d (python3.9 otherwise new_d.update(old_d))
new_d |= old_d
          
with open(path2, 'w') as fd:
    fd.write(json.dumps(new_d)) # save the dictionary to file (in case needed)

... and if you really really want to use readlines() here an alternative solution

new_d = {'new': 5}
with open('some_path', 'r') as fd:
    d_old_str = ''.join(fd.readlines()) # concatenate the lines
    d_old = json.loads(d_old_str)

# then as above
Edie answered 5/9, 2021 at 12:48 Comment(0)
B
3

If you have a document (file, string, etc) with multiple json objects that are not delimited (e.g., no newline between them) you can use json.JSONDecoder().raw_decode(). This returns where the parsing ended, so you can call it again starting at that point. It does not raise the extra data error.

Behl answered 19/9, 2023 at 20:10 Comment(0)
C
2

If your data is from a source outside your control, use this

def load_multi_json(line: str) -> [dict]:
    """
    Fix some files with multiple objects on one line
    """
    try:
        return [json.loads(line)]
    except JSONDecodeError as err:
        if err.msg == 'Extra data':
            head = [json.loads(line[0:err.pos])]
            tail = FrontFile.load_multi_json(line[err.pos:])
            return head + tail
        else:
            raise err
Conceivable answered 13/7, 2022 at 2:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.