Querying a large dataset in-browser using webassembly
Asked Answered
O

1

5

For argument's sake, let's say that a browser allows 4GB of memory in WebAssembly applications. Ignoring compression and other data-storage considerations, if a user had a 3GB local csv file, we could query that data entirely in-memory using webassembly (or javascript, of course). For example, if the user's data was of the following format:

ID Country Amount
1 US 12
2 GB 11
3 DE 7

Then in a few lines of code we could do a basic algorithm to filter to ID=2, i.e., the SQL equivalent of SELECT * FROM table WHERE id=2.

Now, my question is whether it's possible in any browser (and possibly with experimental flags and/or certain user preferences selected) such that a query could be done on a file that would not fit in memory, even if properly compressed. For example, in this blog post, a ~500GB file is loaded and then queried. I know that the 500GB of data is not loaded entirely in memory, and there's probably a column-oriented data structure so that only certain columns need to be read, but either way the OS has access to the file system and so files much larger than available memory can be used.

Is this possible to do in any way within a webassembly browser application? If so, what would be an outline of how it could be done? I know this question might require some research, so when it's available for a bounty I can add a 500-point bounty to it to encourage answers. (Note that the underlying language being used is C++-compiled-to-wasm, but I don't think that should matter for this question.)

I suppose one possibility might be along the lines of something like: https://rreverser.com/webassembly-shell-with-a-real-filesystem-access-in-a-browser/.

Olag answered 20/9, 2021 at 22:57 Comment(7)
This post is tagged with [rust], but if your underlying language is C++ would use of Rust crates and Rust wasm bindings be valuable to you?Compensatory
@Compensatory that's fine, either C++ or Rust is fine.Olag
I know that the 500GB of data is not loaded entirely in memory, and there's probably a column-oriented data structure so that only certain columns need to be read, - As you don't need to read the entire file, what exactly is the problem ?Poulterer
Although I'm aware this is unsatisfying, I think there's really not much point to doing this. From WASM in the browser, the only way you are going to interact with files is by importing browser functions into WASM and using the existing browser functionality. In this case, that's going to be probably FileReader and Blob. Since those APIs are asynchronous, you are also going to need to export/import a bunch of the Promise API. In the end, you'll be doing a lot of marshalling work for something that is much simpler to do in JS. If WASM had its own file API, things might be different...Autoionization
See also #51047646 for using FileReader from WASM (via Rust).Autoionization
@Olag I finally found the time to make the Rust implementation too ;)Leveller
@Leveller awesome, thank you.Olag
L
9

Javascript File API

By studying the File API it turns out that when reading a file the browser will always handle you a Blob.This gives the impression that all the file is fetched by the browser to the RAM. The Blob has also a .stream() function that returns a ReadableStream to stream the very same Blob.

It turns out (at least in Chrome) that the handled Blob is virtual and the underlying file is not loaded until requested. Nor file object slicing nor an instantiating a reader loads the entire file:

file.slice(file.size - 100)
(await reader.read()).value.slice(0, 100)

Here is a test Sandbox and the sourcecode

The example lets you select a file ad will display the last 100 characters (using .slice()) and the first 100 by using the ReadableStream (note that the stream function does not have seek functionality)

I've tested this up to 10GB (the largest .csv I have laying around) and no RAM gets consumed by the browser

This answers the first part of the question. With the capability to stream (or perform chunked access) a file without consuming RAM you can consume an arbitrarily large file and search for your content (binary search or a table scan).

Webassembly

In Rust using stdweb there is no .read() function (hence the content can not be streamed). But File does have .slice() function to slice the underlying blob (same as in javascript). This is a minimal working example:

#[macro_use]
extern crate stdweb;

use stdweb::js_export;

use std::convert::From;
use stdweb::web::IBlob;
use stdweb::web::File;
use stdweb::web::FileReader;
use stdweb::web::FileReaderResult;

#[js_export]
fn read_file(file: File) {
    let blob = file.slice(..2048);
    let len = stdweb::Number::from(blob.len() as f64);

    js! {
        var _len = @{len};
        console.log("length=" + _len);
        var _blob = @{blob};
        console.log(_blob);
    }
}

fn main() {
}
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>WASM</title>
</head>
<body>
    <input type="file" id="field" />

    <script src="the_compiled_wasm_binding.js"></script>
    <script>
        async function onChange(e) {
            const files = e.target.files;
            if (files.length === 0) return;
            const file = files[0];

            // Slice
            Rust.the_compiled_wasm_binding.then(module => {
                module.read_file(file);
            })
        }

        document.getElementById("field").onchange = onChange;
    </script>
</body>
</html>

The .slice() function is behaving the same as in javascript (the entire file is NOT loaded in RAM) hence you can load chunks of the file in WASM and perform a search.

Please note that stdweb implementation of slice() uses slice_blob() which internally performs:

js! (
    return @{reference}.slice(@{start}, @{end}, @{content_type});
).try_into().unwrap()

As you can see it uses the javascript under the hood, so no optimization here.

Conclusions

IMHO the file reading implementation is more effective in javascript due to:

  • stdweb::File API using raw javascript under the hood (hence not being faster)
  • stdweb::File having less functionalities than the javascript counterpart (lack of streaming and few other functions).

Then indeed the searching algorithm could/should be implemented in WASM. The algorithm can be handled directly a chunk (Blob) to be processed.

Leveller answered 23/9, 2021 at 19:30 Comment(5)
thanks for this. Could you please elaborate on this point: IMHO this implementation is more effective in javascript? For example, do you just mean the js (Chrome for example) c++ implementation of string-search is going to be much faster than passing blobs to WASM as the 'transfer' takes a long time? source.chromium.org/chromium/chromium/src/+/main:third_party/…Olag
@Olag (I edited the answer) The file reading part will be as efficient in JS as in WASM (using JS under-the-hood). So I would suggest reading chunks in JS by streaming them with FileReader (not available in stdweb) and then passing Blobs to WASM. If your search is as simple as a string search then I think you need no WASM at all (the string search is already optimized). For complex table scans / binary searches then go with WASM (but still only for Blob processing).Leveller
do you know if c/c++ via emscripten also just uses javascript under the hood for reading of the file and so it makes no difference whether the file is read in C++ or js?Olag
Maybe not, see emscripten.org/docs/api_reference/Filesystem-API.html#workerfs . But File with emscripten is even more messy. To make any assumption I would need to dive in the sources as I did for stdweb.Leveller
Need help! When I tried to compile rust code (with dependencies stdweb = "0.4.20", wasm-bindgen = "0.2.63") I got error: cannot find function __web_free in module stdweb::private. Mainly #[js_export] is not working as it is not finding stdweb::js_export.Aldis

© 2022 - 2024 — McMap. All rights reserved.