Querying a large dataset in-browser using webassembly

ID	Country	Amount
1	US	12
2	GB	11
3	DE	7

Country

Amount

Javascript File API

By studying the File API it turns out that when reading a file the browser will always handle you a Blob.This gives the impression that all the file is fetched by the browser to the RAM. The Blob has also a .stream() function that returns a ReadableStream to stream the very same Blob.

It turns out (at least in Chrome) that the handled Blob is virtual and the underlying file is not loaded until requested. Nor file object slicing nor an instantiating a reader loads the entire file:

file.slice(file.size - 100)
(await reader.read()).value.slice(0, 100)

Here is a test Sandbox and the sourcecode

The example lets you select a file ad will display the last 100 characters (using .slice()) and the first 100 by using the ReadableStream (note that the stream function does not have seek functionality)

I've tested this up to 10GB (the largest .csv I have laying around) and no RAM gets consumed by the browser

This answers the first part of the question. With the capability to stream (or perform chunked access) a file without consuming RAM you can consume an arbitrarily large file and search for your content (binary search or a table scan).

Webassembly

In Rust using stdweb there is no .read() function (hence the content can not be streamed). But File does have .slice() function to slice the underlying blob (same as in javascript). This is a minimal working example:

#[macro_use]
extern crate stdweb;

use stdweb::js_export;

use std::convert::From;
use stdweb::web::IBlob;
use stdweb::web::File;
use stdweb::web::FileReader;
use stdweb::web::FileReaderResult;

#[js_export]
fn read_file(file: File) {
    let blob = file.slice(..2048);
    let len = stdweb::Number::from(blob.len() as f64);

    js! {
        var _len = @{len};
        console.log("length=" + _len);
        var _blob = @{blob};
        console.log(_blob);
    }
}

fn main() {
}

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>WASM</title>
</head>
<body>
    <input type="file" id="field" />

    <script src="the_compiled_wasm_binding.js"></script>
    <script>
        async function onChange(e) {
            const files = e.target.files;
            if (files.length === 0) return;
            const file = files[0];

            // Slice
            Rust.the_compiled_wasm_binding.then(module => {
                module.read_file(file);
            })
        }

        document.getElementById("field").onchange = onChange;
    </script>
</body>
</html>

The .slice() function is behaving the same as in javascript (the entire file is NOT loaded in RAM) hence you can load chunks of the file in WASM and perform a search.

Please note that stdweb implementation of slice() uses slice_blob() which internally performs:

js! (
    return @{reference}.slice(@{start}, @{end}, @{content_type});
).try_into().unwrap()

As you can see it uses the javascript under the hood, so no optimization here.

Conclusions

IMHO the file reading implementation is more effective in javascript due to:

stdweb::File API using raw javascript under the hood (hence not being faster)
stdweb::File having less functionalities than the javascript counterpart (lack of streaming and few other functions).

Then indeed the searching algorithm could/should be implemented in WASM. The algorithm can be handled directly a chunk (Blob) to be processed.

Javascript File API

Webassembly

Conclusions

Recommended topics

Hot tags