Parsing large JSON file in .NET
Asked Answered
P

5

39

I have used the "JsonConvert.Deserialize(json)" method of Json.NET so far which worked quite well and to be honest, I didn't need anything more than this.

I am working on a background (console) application which constantly downloads the JSON content from different URLs, then deserializes the result into a list of .NET objects.

 using (WebClient client = new WebClient())
 {
      string json = client.DownloadString(stringUrl);

      var result = JsonConvert.DeserializeObject<List<Contact>>(json);

 }

The simple code snippet above doesn't probably seem perfect, but it does the job. When the file is large (15,000 contacts - 48 MB file), JsonConvert.DeserializeObject isn't the solution and the line throws an exception type of JsonReaderException.

The downloaded JSON content is an array and this is how a sample looks like. Contact is a container class for the deserialized JSON object.

[
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  }
]

My initial guess is it runs out of memory. Just out of curiosity, I tried to parse it as JArray which caused the same exception too.

I have started to dive into Json.NET documentation and read similar threads. As I haven't managed to produce a working solution yet, I decided to post a question here.

UPDATE: While deserializing line by line, I got the same error: " [. Path '', line 600003, position 1." So downloaded two of them and checked them in Notepad++. I noticed that if the array length is more than 12,000, after 12000th element, the "[" is closed and another array starts. In other words, the JSON looks exactly like this:

[
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  }
]
[
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  },
  {
    "firstname": "sometext",
    "lastname": "sometext"
  }
]
Perseverance answered 26/8, 2015 at 13:3 Comment(14)
and the line throws an exception type of JsonReaderException. What is the exception message? Any inner exception?Insure
Additional text encountered after finished reading JSON content: [. Path '', line 600003, position 1." - this is the exception messagePerseverance
@Yavarski Are you sure you're JSON is valid?Observable
@Yavarski As you can see, it is not related to the size of the json. There are some extra characters at the end of your json..Insure
Are you saying one input is 48MB or you are combining several inputs into one that reaches 48MB?Silence
There's something wrong with the format .Coruscate
Consider using Async. It improves performance for the processes.Brawn
I am using the third party api which generates a link with the list of contacts(json array). The file I get is a json file and it is constructed as posted above. @YuvalItzchakov , i believe it's valid json because. I have repeated this for 100 different urls and never had an issue. However, json arrays contained less than 10000 contacts in all of them.Perseverance
@DStanley it's a downloadable link. For instance, the current file I work with is like 48MB. What I assume is the reader runs out memory while reading the json and probably it's the middle of json, that's why the exception is thrown with that message. I may be totally wrong but this is what comes to my mind for now.Perseverance
If you think you're running out of memory, you could try processing the JSON incrementally instead of deserializing into one giant list. See Deserialize json array stream one item at a time.Rab
Can you try specifying the encoding to UTF8? There might be some special characters messing with the json format. You can do this by using client.Encoding = Encoding.UTF8;Weariful
Thanks @BrianRogers it really helped. I am updating the question now.Perseverance
Your source data is two arrays but you are telling it to to deserialize into into a single array (List<Contact>). Since you're already going line by line you should merge the two arrays.Calise
If it runs out of memory, shouldn't an OutOfMemoryException be thrown? I don't think JSON.NET would be so stupid to catch that kind of exception and return invalid data.Bohlen
R
59

As you've correctly diagnosed in your update, the issue is that the JSON has a closing ] followed immediately by an opening [ to start the next set. This format makes the JSON invalid when taken as a whole, and that is why Json.NET throws an error.

Fortunately, this problem seems to come up often enough that Json.NET actually has a special setting to deal with it. If you use a JsonTextReader directly to read the JSON, you can set the SupportMultipleContent flag to true, and then use a loop to deserialize each item individually.

This should allow you to process the non-standard JSON successfully and in a memory efficient manner, regardless of how many arrays there are or how many items in each array.

    using (WebClient client = new WebClient())
    using (Stream stream = client.OpenRead(stringUrl))
    using (StreamReader streamReader = new StreamReader(stream))
    using (JsonTextReader reader = new JsonTextReader(streamReader))
    {
        reader.SupportMultipleContent = true;

        var serializer = new JsonSerializer();
        while (reader.Read())
        {
            if (reader.TokenType == JsonToken.StartObject)
            {
                Contact c = serializer.Deserialize<Contact>(reader);
                Console.WriteLine(c.FirstName + " " + c.LastName);
            }
        }
    }

Full demo here: https://dotnetfiddle.net/2TQa8p

Rab answered 26/8, 2015 at 22:36 Comment(1)
I was this close to build my own parser. this is awesome, thanks Brian.Boardman
I
26

Json.NET supports deserializing directly from a stream. Here is a way to deserialize your JSON using a StreamReader reading the JSON string one piece at a time instead of having the entire JSON string loaded into memory.

using (WebClient client = new WebClient())
{
    using (StreamReader sr = new StreamReader(client.OpenRead(stringUrl)))
    {
        using (JsonReader reader = new JsonTextReader(sr))
        {
            JsonSerializer serializer = new JsonSerializer();

            // read the json from a stream
            // json size doesn't matter because only a small piece is read at a time from the HTTP request
            IList<Contact> result = serializer.Deserialize<List<Contact>>(reader);
        }
    }
}

Reference: JSON.NET Performance Tips

Interpretative answered 26/8, 2015 at 20:57 Comment(1)
This code may not load the entire stream into memory, but will certainly load the entire list of contacts into memory. Unless the Contact object throws away large amounts of data from the stream, you've just pushed your memory problem downstream.Jackass
F
6

I have done a similar thing in Python for the file size of 5 GB. I downloaded the file in some temporary location and read it line by line to form an JSON object similar on how SAX works.

For C# using Json.NET, you can download the file, use a stream reader to read the file, and pass that stream to JsonTextReader and parse it to JObject using JTokens.ReadFrom(your JSonTextReader object).

Fivestar answered 26/8, 2015 at 13:21 Comment(2)
It makes sense. I will try this and post the updates here.Thanks a mil.Perseverance
Look for "Kristian" answer below. He has the code implementation its pretty similar concept on what i have explained above but i like "Kristian" approach better :)Fivestar
H
1

This might still be relevant to some now that the "new" System.Text.Json is out.

await using FileStream file = File.OpenRead("files/data.json");
var options = new JsonSerializerOptions {
    PropertyNamingPolicy = JsonNamingPolicy.CamelCase
};

// Switch the JsonNode type with one of your own if
// you have a specific type you want to deserialize to.
IAsyncEnumerable<JsonNode?> enumerable = JsonSerializer.DeserializeAsyncEnumerable<JsonNode>(file, options);

await foreach (JsonNode? obj in enumerable) {
    var firstname = obj?["firstname"]?.GetValue<string>();
}

If you're interested in more, such as how to parse zipped JSON, there's this blog post that I wrote: Parsing 60GB Json Files using Streams in .NET.

Haw answered 23/8, 2022 at 8:36 Comment(3)
this is copy/paste from Medium.Pedropedrotti
@Pedropedrotti Sure it is, I literally wrote that medium article.Haw
@Haw that comment made my day :). Thanks for the great article.Betulaceous
F
0

Here is another easy way to parse large JSON file using Cinchoo ETL, an open source library (Uses JSON.NET under the hood to parse the json in stream manner)

using (WebClient client = new WebClient())
using (Stream stream = client.OpenRead("*** YOUR JSON FILE URL ***"))
using (StreamReader streamReader = new StreamReader(stream))
using (var r = new ChoJSONReader<MyObject>(streamReader)
       )
{
    foreach (var rec in r)
        Console.WriteLine(rec.Dump());
}

Sample fiddle: https://dotnetfiddle.net/i5qJ5R

Disclaimer: I'm author of this library

Fixed answered 27/5 at 1:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.