How do I iterate over a large input file?
Asked Answered
B

5

5

I'm trying to get access to an Iterator over the contents of a file uploaded via an input field.

I can pass the JS file into Wasm just fine via web-sys, but I cannot for the life of me figure out how to access anything other then length and name of the passed file in Rust.

I think I could pass the whole file into Wasm as a ByteArray and iterate over that, but preferably I would like to iterate straight over the file contents without copying since the files itself will be quite large (~1 GB).

I found in the Mozilla JS docs that I should be able to access the underlying file blob, get a ReadableStream from that via the .stream() method and get a Reader from that which should be able to be iterated over. But in web-sys, the .getReader() method of the ReadableStream returns a simple JSValue which I can't do anything useful with.

Am I missing something here or is this functionality simply missing in web-sys or is there some other way to do this? Maybe create the Iterator in JS and pass that to Rust?

Blase answered 11/6, 2021 at 23:42 Comment(4)
Have you tried casting the JsValue into a usable type using .dyn_into::<ReadableStream>().unwrap()? If you have any examples of code you tried, we can start from that. And maybe link the reference you mentioned...Burkett
Or something like let reader: Reader = rstream.getReader().try_into().unwrap();Burkett
There is no Reader in web-sys.Call
There's a bug report about this for wasm-bindgen with some pointers in it.Konya
M
3

I managed to get a working example using read_as_binary_string.

Here's the code

lib.rs

use js_sys::JsString;
use std::cell::RefCell;
use std::rc::Rc;
use wasm_bindgen::prelude::*;
use wasm_bindgen::JsCast;
use web_sys::{console, Event, FileReader, HtmlInputElement};

#[wasm_bindgen(start)]
pub fn main_wasm() {
    let my_file: Rc<RefCell<Vec<u8>>> = Rc::new(RefCell::new(Vec::new()));
    set_file_reader(&my_file);
}

fn set_file_reader(file: &Rc<RefCell<Vec<u8>>>) {
    let filereader = FileReader::new().unwrap().dyn_into::<FileReader>().unwrap();
    let my_file = Rc::clone(&file);

    let onload = Closure::wrap(Box::new(move |event: Event| {
        let element = event.target().unwrap().dyn_into::<FileReader>().unwrap();
        let data = element.result().unwrap();
        let file_string: JsString = data.dyn_into::<JsString>().unwrap();
        let file_vec: Vec<u8> = file_string.iter().map(|x| x as u8).collect();
        *my_file.borrow_mut() = file_vec;
        console::log_1(&format!("file loaded: {:?}", file_string).into());
    }) as Box<dyn FnMut(_)>);

    filereader.set_onloadend(Some(onload.as_ref().unchecked_ref()));
    onload.forget();

    let fileinput: HtmlInputElement = web_sys::window()
        .unwrap()
        .document()
        .expect("should have a document.")
        .create_element("input")
        .unwrap()
        .dyn_into::<HtmlInputElement>()
        .unwrap();

    fileinput.set_id("file-upload");
    fileinput.set_type("file");

    web_sys::window()
        .unwrap()
        .document()
        .unwrap()
        .body()
        .expect("document should have a body")
        .append_child(&fileinput)
        .unwrap();

    let callback = Closure::wrap(Box::new(move |event: Event| {
        let element = event
            .target()
            .unwrap()
            .dyn_into::<HtmlInputElement>()
            .unwrap();
        let filelist = element.files().unwrap();

        let _file = filelist.get(0).expect("should have a file handle.");
        filereader.read_as_binary_string(&_file).unwrap();
    }) as Box<dyn FnMut(_)>);

    fileinput
        .add_event_listener_with_callback("change", callback.as_ref().unchecked_ref())
        .unwrap();
    callback.forget();
}

index.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
  </head>
  <body>
    <noscript
      >This page contains webassembly and javascript content, please enable
      javascript in your browser.</noscript
    >
    <script src="./stack.js"></script>
    <script>
      wasm_bindgen("./stack_bg.wasm");
    </script>
  </body>
</html>

and the Cargo.toml

[package]
name = "stack"
version = "0.1.0"
authors = [""]
edition = "2018"


[lib]
crate-type = ["cdylib", "rlib"]

[dependencies]
js-sys = "0.3.55"

wee_alloc = { version = "0.4.2", optional = true }


[dependencies.web-sys]
version = "0.3.4"
features = [
  'Document',
  'Window',
  'console',
  'Event',
  'FileReader',
  'File',
  'FileList',
  'HtmlInputElement']

[dev-dependencies]
wasm-bindgen-test = "0.2"

[dependencies.wasm-bindgen]
version = "0.2.70"

[profile.release]
# Tell `rustc` to optimize for small code size.
opt-level = "s"
debug = false


You can check the example working here: http://rustwasmfileinput.glitch.me/

Monaghan answered 16/1, 2022 at 18:5 Comment(1)
Do add comments in the code baseStuder
M
2

Your best bet would be to use the wasm_streams crate which bridges the Web stream APIs like ReadableStream you're getting from the .stream() method to Rust async stream APIs.

The official example uses Fetch API as a source, but this part will be relevant for your File usecase as well: https://github.com/MattiasBuelens/wasm-streams/blob/f6dacf58a8826dc67923ab4a3bae87635690ca64/examples/fetch_as_stream.rs#L25-L33

let body = ReadableStream::from_raw(raw_body.dyn_into().unwrap_throw());

// Convert the JS ReadableStream to a Rust stream
let mut stream = body.into_stream();

// Consume the stream, logging each individual chunk
while let Some(Ok(chunk)) = stream.next().await {
    console::log_1(&chunk);
}
Manhole answered 1/7, 2021 at 18:51 Comment(0)
C
1

I think you can do something similar using FileReader.

Here is an example, where I log the text content of a file:

use wasm_bindgen::prelude::*;
use web_sys::{Event, FileReader, HtmlInputElement};

use wasm_bindgen::JsCast;

#[wasm_bindgen]
extern "C" {
    #[wasm_bindgen(js_namespace = console)]
    fn log(s: &str);
}

#[wasm_bindgen(start)]
pub fn main() -> Result<(), JsValue> {
    let window = web_sys::window().expect("no global `window` exists");
    let document = window.document().expect("should have a document on window");
    let body = document.body().expect("document should have a body");

    let filereader = FileReader::new().unwrap().dyn_into::<FileReader>()?;

    let closure = Closure::wrap(Box::new(move |event: Event| {
        let element = event.target().unwrap().dyn_into::<FileReader>().unwrap();
        let data = element.result().unwrap();
        let js_data = js_sys::Uint8Array::from(data);
        let rust_str: String = js_data.to_string().into();
        log(rust_str.as_str());
    }) as Box<dyn FnMut(_)>);
 
    filereader.set_onloadend(Some(closure.as_ref().unchecked_ref()));
    closure.forget();

    let fileinput: HtmlInputElement = document.create_element("input").unwrap().dyn_into::<HtmlInputElement>()?;
    fileinput.set_type("file");

    let closure = Closure::wrap(Box::new(move |event: Event| {
        let element = event.target().unwrap().dyn_into::<HtmlInputElement>().unwrap();
        let filelist = element.files().unwrap();

        let file = filelist.get(0).unwrap();

        filereader.read_as_text(&file).unwrap();
        //log(filelist.length().to_string().as_str());
    }) as Box<dyn FnMut(_)>);
    fileinput.add_event_listener_with_callback("change", closure.as_ref().unchecked_ref())?;
    closure.forget();

    body.append_child(&fileinput)?;

    Ok(())
}

And the HTML:

<html>
  <head>
    <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  </head>
  <body>
    <script type="module">
      import init from './pkg/without_a_bundler.js';

      async function run() {
        await init();
      }

      run();
    </script>
  </body>
</html>

and Cargo.toml

[package]
name = "without-a-bundler"
version = "0.1.0"
authors = [""]
edition = "2018"

[lib]
crate-type = ["cdylib"]

[dependencies]
js-sys = "0.3.51"
wasm-bindgen = "0.2.74"

[dependencies.web-sys]
version = "0.3.4"
features = [
  'Blob',
  'BlobEvent',
  'Document',
  'Element',
  'Event',
  'File',
  'FileList',
  'FileReader',
  'HtmlElement',
  'HtmlInputElement',
  'Node',
  'ReadableStream',
  'Window',
]

However I have no idea how to use get_reader() of ReadableStream, because according to the linked documentation, it should return either a ReadableStreamDefaultReader or a ReadableStreamBYOBReader. While the latter is experimental and I think it is therefore understandable, that it is not present in web-sys, I do not know why ReadableStreamDefaultReader is also not present.

Call answered 13/6, 2021 at 10:22 Comment(0)
O
0

You should use ReadableStreamDefaultReader::new().

let stream: ReadableStream = response.body().unwrap();
let reader = ReadableStreamDefaultReader::new(&stream)?;

Then you can use ReadableStreamDefaultReader.read() the same way as in JS.

You also will need struct for deserialization:

#[derive(serde::Serialize, serde::Deserialize)]
struct ReadableStreamDefaultReadResult<T> {
    pub value: T,
    pub done: bool,
}

Here is example of usage:

loop {
    let reader_promise = JsFuture::from(reader.read());
    let result = reader_promise.await?;

    let result: ReadableStreamDefaultReadResult<Option<Vec<u8>>> =
        serde_wasm_bindgen::from_value(result).unwrap();

    if result.done {
        break;
    }

    // here you can read chunk of bytes from `result.value`
}

Overcheck answered 24/6, 2022 at 18:20 Comment(0)
H
0

I will share here a repository I came across yesterday, which illustrates exactly this point.

In practice, it reads the first byte of the file. Considering that the result is instantaneous, it clearly does not load the whole file into memory but is reading from the File Handle.

Also, you can use the FileSystemSyncAccessHandle which provides a fine-grained read() method, including an offset.

Headstrong answered 21/9, 2023 at 12:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.