Reading large JSON file in Deno
Asked Answered
H

4

8

I often find myself reading a large JSON file (usually an array of objects) then manipulating each object and writing back to a new file.

To achieve this in Node (at least the reading the data portion) I usually do something like this using the stream-json module.

const fs = require('fs');
const StreamArray = require('stream-json/streamers/StreamArray');

const pipeline = fs.createReadStream('sample.json')
  .pipe(StreamArray.withParser());

pipeline.on('data', data => {
    //do something with each object in file
});

I've recently discovered Deno and would love to be able to do this workflow with Deno.

It looks like the readJSON method from the Standard Library reads the entire contents of the file into memory so I don't know if it would be a good fit for processing a large file.

Is there a way this can be done by streaming the data from the file using some of the lower level methods that are built into Deno?

Hollister answered 23/9, 2019 at 21:21 Comment(1)
I don't think deno has a streaming API yet, but it's one of the design goals.Yurt
H
4

Circling back on this now that Deno 1.0 is out and in case anyone else is interested in doing something like this. I was able to piece together a small class that works for my use case. It's not nearly as robust as something like the stream-json package but it handles large JSON arrays just fine.

import { EventEmitter } from "https://deno.land/std/node/events.ts";

export class JSONStream extends EventEmitter {

    private openBraceCount = 0;
    private tempUint8Array: number[] = [];
    private decoder = new TextDecoder();

    constructor (private filepath: string) {
        super();
        this.stream();
    }

    async stream() {
        console.time("Run Time");
        let file = await Deno.open(this.filepath);
        //creates iterator from reader, default buffer size is 32kb
        for await (const buffer of Deno.iter(file)) {

            for (let i = 0, len = buffer.length; i < len; i++) {
                const uint8 = buffer[ i ];

                //remove whitespace
                if (uint8 === 10 || uint8 === 13 || uint8 === 32) continue;

                //open brace
                if (uint8 === 123) {
                    if (this.openBraceCount === 0) this.tempUint8Array = [];
                    this.openBraceCount++;
                };

                this.tempUint8Array.push(uint8);

                //close brace
                if (uint8 === 125) {
                    this.openBraceCount--;
                    if (this.openBraceCount === 0) {
                        const uint8Ary = new Uint8Array(this.tempUint8Array);
                        const jsonString = this.decoder.decode(uint8Ary);
                        const object = JSON.parse(jsonString);
                        this.emit('object', object);
                    }
                };
            };
        }
        file.close();
        console.timeEnd("Run Time");
    }
}

Example usage

const stream = new JSONStream('test.json');

stream.on('object', (object: any) => {
    // do something with each object
});

Processing a ~4.8 MB json file with ~20,000 small objects in it

[
    {
      "id": 1,
      "title": "in voluptate sit officia non nesciunt quis",
      "urls": {
         "main": "https://www.placeholder.com/600/1b9d08",
         "thumbnail": "https://www.placeholder.com/150/1b9d08"
      }
    },
    {
      "id": 2,
      "title": "error quasi sunt cupiditate voluptate ea odit beatae",
      "urls": {
          "main": "https://www.placeholder.com/600/1b9d08",
          "thumbnail": "https://www.placeholder.com/150/1b9d08"
      }
    }
    ...
]

Took 127 ms.

❯ deno run -A parser.ts
Run Time: 127ms
Hollister answered 21/5, 2020 at 4:30 Comment(2)
thanks for the sample code; it seems not handling if the JSON strings contains unbalanced {} pairs? wonder is there a more battle tested version available as of now in 2021? since this answered May 21 '20Balbriggan
@Balbriggan see https://mcmap.net/q/1335962/-reading-large-json-file-in-deno for a full-blown streaming JSON parser libraryAnthracosis
R
2

I think that a package like stream-json would be as useful on Deno as it is on NodeJs, so one way to go might surely be to grab the source code of that package and make it work on Deno. (And this answer will be outdated soon, because there are lots of people out there who do such things and it won't take long until someone – maybe you – makes their result public and importable into any Deno script.)

Alternatively, although this doesn't directly answer your question, a common pattern to treat large data sets of Json data is to have files which contain Json objects separated by newlines. (One Json object per line.) For example, Hadoop and Spark, AWS S3 select, and probably many others use this format. If you can get your input data in that format, that might help you to use a lot more tools. Also you could then stream the data with the readString('\n') method in Deno's standard library: https://github.com/denoland/deno_std/blob/master/io/bufio.ts

Has the additional advantage of less dependency on third-party packages. Example code:

    import { BufReader } from "https://deno.land/std/io/bufio.ts";

    async function stream_file(filename: string) {
        const file = await Deno.open(filename);
        const bufReader = new BufReader(file);
        console.log('Reading data...');
        let line: string;
        let lineCount: number = 0;
        while ((line = await bufReader.readString('\n')) != Deno.EOF) {
            lineCount++;
            // do something with `line`.
        }
        file.close();
        console.log(`${lineCount} lines read.`)
    }
Rage answered 22/10, 2019 at 7:58 Comment(1)
Hi Robert, I found your answer useful, and I want to know if it is possible to write line by line with bufio. I checked the documentation for bufio go and there seems to be a writeString method but it doesn't exist in the deno std module, do you know how a polyfill should be?Kroll
M
2

this is the code I used for a file with 13,147,089 lines of text. Notice it's same as Roberts's code but used readLine() instead of readString('\n'). readLine() is a low-level line-reading primitive. Most callers should use readString('\n') instead or use a Scanner.`

import { BufReader } from "https://deno.land/std/io/bufio.ts";

export async function stream_file(filename: string) {
  const file = await Deno.open(filename);
  const bufReader = new BufReader(file);
  console.log("Reading data...");
  let line: string | any;
  let lineCount: number = 0;
  while ((line = await bufReader.readLine()) != Deno.EOF) {
    lineCount++;
    // do something with `line`.
  }
  file.close();
  console.log(`${lineCount} lines read.`);
}
Minyan answered 15/2, 2020 at 20:39 Comment(0)
A
0

July 2021 update: I had the same need and found no workable solution, so I wrote a library that solves exactly this problem for Deno: https://github.com/xtao-org/jsonhilo

Can be used like a typical SAX-based parser:

import {JsonHigh} from 'https://deno.land/x/[email protected]/mod.js'
const stream = JsonHigh({
  openArray: () => console.log('<array>'),
  openObject: () => console.log('<object>'),
  closeArray: () => console.log('</array>'),
  closeObject: () => console.log('</object>'),
  key: (key) => console.log(`<key>${key}</key>`),
  value: (value) => console.log(`<value type="${typeof value}">${value}</value>`),
})
stream.push('{"tuple": [null, true, false, 1.2e-3, "[demo]"]}')

/* OUTPUT:
<object>
<key>tuple</key>
<array>
<value type="object">null</value>
<value type="boolean">true</value>
<value type="boolean">false</value>
<value type="number">0.0012</value>
<value type="string">[demo]</value>
</array>
</object>
*/

Also has a unique low-level interface which enables very fast (benchmarks here: https://github.com/xtao-org/jsonhilo-benchmarks) lossless parsing.

It's released under MIT, so enjoy! I hope it solves your problems. :)

Anthracosis answered 23/7, 2021 at 22:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.