Reading line-by-line file in JavaScript on client side
Asked Answered
J

3

24

Could you please help me with following issue.

Goal

Read file on client side (in browser via JS and HTML5 classes) line by line, without loading whole file to memory.

Scenario

I'm working on web page which should parse files on client side. Currently, I'm reading file as it described in this article.

HTML:

<input type="file" id="files" name="files[]" />

JavaScript:

$("#files").on('change', function(evt){
    // creating FileReader
    var reader = new FileReader();

    // assigning handler
    reader.onloadend = function(evt) {      
        lines = evt.target.result.split(/\r?\n/);

        lines.forEach(function (line) {
            parseLine(...);
        }); 
    };

    // getting File instance
    var file = evt.target.files[0];

    // start reading
    reader.readAsText(file);
}

The problem is that FileReader reads whole file at once, which causes crashed tab for big files (size >= 300 MB). Using reader.onprogress doesn't solve a problem, as it just increments a result till it will hit the limit.

Inventing a wheel

I've done some research in internet and have found no simple way to do this (there are bunch of articles describing this exact functionality but on server side for node.js).

As only way to solve it I see only following:

  1. Split file by chunks (via File.split(startByte, endByte) method)
  2. Find last new line character in that chunk ('/n')
  3. Read that chunk except part after last new line character and convert it to the string and split by lines
  4. Read next chunk starting from last new line character found on step 2

But I'll better use something already existing to avoid entropy growth.

Joleenjolene answered 9/7, 2014 at 7:21 Comment(0)
J
16

Eventually I've created new line-by-line reader, which is totally different from previous one.

Features are:

  • Index-based access to File (sequential and random)
  • Optimized for repeat random reading (milestones with byte offset saved for lines already navigated in past), so after you've read all file once, accessing line 43422145 will be almost as fast as accessing line 12.
  • Searching in file: find next and find all.
  • Exact index, offset and length of matches, so you can easily highlight them

Check this jsFiddle for examples.

Usage:

// Initialization
var file; // HTML5 File object
var navigator = new FileNavigator(file);

// Read some amount of lines (best performance for sequential file reading)
navigator.readSomeLines(startingFromIndex, function (err, index, lines, eof, progress) { ... });

// Read exact amount of lines
navigator.readLines(startingFromIndex, count, function (err, index, lines, eof, progress) { ... });

// Find first from index
navigator.find(pattern, startingFromIndex, function (err, index, match) { ... });

// Find all matching lines
navigator.findAll(new RegExp(pattern), indexToStartWith, limitOfMatches, function (err, index, limitHit, results) { ... });

Performance is same to previous solution. You can measure it invoking 'Read' in jsFiddle.

GitHub: https://github.com/anpur/client-line-navigator/wiki

Joleenjolene answered 28/11, 2014 at 14:57 Comment(1)
npm package comming soonJoleenjolene
J
9

Update: check LineNavigator from my second answer instead, that reader is way better.

I've made my own reader, which fulfills my needs.

Performance

As the issue is related only to huge files performance was the most important part. enter image description here

As you can see, performance is almost the same as direct read (as described in question above). Currently I'm trying to make it better, as bigger time consumer is async call to avoid call stack limit hit, which is not unnecessary for execution problem. Performance issue solved.

Quality

Following cases were tested:

  • Empty file
  • Single line file
  • File with new line char on the end and without
  • Check parsed lines
  • Multiple runs on same page
  • No lines are lost and no order problems

Code & Usage

Html:

<input type="file" id="file-test" name="files[]" />
<div id="output-test"></div>

Usage:

$("#file-test").on('change', function(evt) {
    var startProcessing = new Date();
    var index = 0;
    var file = evt.target.files[0];
    var reader = new FileLineStreamer();
    $("#output-test").html("");

    reader.open(file, function (lines, err) {
        if (err != null) {
            $("#output-test").append('<span style="color:red;">' + err + "</span><br />");
            return;
        }
        if (lines == null) {
            var milisecondsSpend = new Date() - startProcessing;
            $("#output-test").append("<strong>" + index + " lines are processed</strong> Miliseconds spend: " + milisecondsSpend + "<br />");           
            return;
        }

        // output every line
        lines.forEach(function (line) {
            index++;
            //$("#output-test").append(index + ": " + line + "<br />");
        });
        
        reader.getNextBatch();
    });
    
    reader.getNextBatch();  
});

Code:

function FileLineStreamer() {   
    var loopholeReader = new FileReader();
    var chunkReader = new FileReader(); 
    var delimiter = "\n".charCodeAt(0); 
    
    var expectedChunkSize = 15000000; // Slice size to read
    var loopholeSize = 200;         // Slice size to search for line end

    var file = null;
    var fileSize;   
    var loopholeStart;
    var loopholeEnd;
    var chunkStart;
    var chunkEnd;
    var lines;
    var thisForClosure = this;
    var handler;
    
    // Reading of loophole ended
    loopholeReader.onloadend = function(evt) {
        // Read error
        if (evt.target.readyState != FileReader.DONE) {
            handler(null, new Error("Not able to read loophole (start: )"));
            return;
        }
        var view = new DataView(evt.target.result);
        
        var realLoopholeSize = loopholeEnd - loopholeStart;     
        
        for(var i = realLoopholeSize - 1; i >= 0; i--) {                    
            if (view.getInt8(i) == delimiter) {
                chunkEnd = loopholeStart + i + 1;
                var blob = file.slice(chunkStart, chunkEnd);
                chunkReader.readAsText(blob);
                return;
            }
        }
        
        // No delimiter found, looking in the next loophole
        loopholeStart = loopholeEnd;
        loopholeEnd = Math.min(loopholeStart + loopholeSize, fileSize);
        thisForClosure.getNextBatch();
    };
    
    // Reading of chunk ended
    chunkReader.onloadend = function(evt) {
        // Read error
        if (evt.target.readyState != FileReader.DONE) {
            handler(null, new Error("Not able to read loophole"));
            return;
        }
        
        lines = evt.target.result.split(/\r?\n/);       
        // Remove last new line in the end of chunk
        if (lines.length > 0 && lines[lines.length - 1] == "") {
            lines.pop();
        }
        
        chunkStart = chunkEnd;
        chunkEnd = Math.min(chunkStart + expectedChunkSize, fileSize);
        loopholeStart = Math.min(chunkEnd, fileSize);
        loopholeEnd = Math.min(loopholeStart + loopholeSize, fileSize);
                
        thisForClosure.getNextBatch();
    };
    
    this.getProgress = function () {
        if (file == null)
            return 0;
        if (chunkStart == fileSize)
            return 100;         
        return Math.round(100 * (chunkStart / fileSize));
    }

    // Public: open file for reading
    this.open = function (fileToOpen, linesProcessed) {
        file = fileToOpen;
        fileSize = file.size;
        loopholeStart = Math.min(expectedChunkSize, fileSize);
        loopholeEnd = Math.min(loopholeStart + loopholeSize, fileSize);
        chunkStart = 0;
        chunkEnd = 0;
        lines = null;
        handler = linesProcessed;
    };

    // Public: start getting new line async
    this.getNextBatch = function() {
        // File wasn't open
        if (file == null) {     
            handler(null, new Error("You must open a file first"));
            return;
        }
        // Some lines available
        if (lines != null) {
            var linesForClosure = lines;
            setTimeout(function() { handler(linesForClosure, null) }, 0);
            lines = null;
            return;
        }
        // End of File
        if (chunkStart == fileSize) {
            handler(null, null);
            return;
        }
        // File part bigger than expectedChunkSize is left
        if (loopholeStart < fileSize) {
            var blob = file.slice(loopholeStart, loopholeEnd);
            loopholeReader.readAsArrayBuffer(blob);
        }
        // All file can be read at once
        else {
            chunkEnd = fileSize;
            var blob = file.slice(chunkStart, fileSize);
            chunkReader.readAsText(blob);
        }
    };
};
Joleenjolene answered 16/7, 2014 at 10:10 Comment(2)
Updated, faster version coming soon (with milestones to speed up random access to already read parts).Joleenjolene
You can find actual, proper version here: github.com/anpur/line-navigatorJoleenjolene
C
3

I have written a module named line-reader-browser for the same purpose. It uses Promises.

Syntax (Typescript):-

import { LineReader } from "line-reader-browser"

// file is javascript File Object returned from input element
// chunkSize(optional) is number of bytes to be read at one time from file. defaults to 8 * 1024
const file: File
const chunSize: number
const lr = new LineReader(file, chunkSize)

// context is optional. It can be used to inside processLineFn   
const context = {}
lr.forEachLine(processLineFn, context)
  .then((context) => console.log("Done!", context))

// context is same Object as passed while calling forEachLine
function processLineFn(line: string, index: number, context: any) {
   console.log(index, line)
}

Usage:-

import { LineReader } from "line-reader-browser"

document.querySelector("input").onchange = () => {
   const input = document.querySelector("input")
   if (!input.files.length) return
   const lr = new LineReader(input.files[0], 4 * 1024)
   lr.forEachLine((line: string, i) => console.log(i, line)).then(() => console.log("Done!"))
}

Try following code snippet to see module working.

<html>
   <head>
      <title>Testing line-reader-browser</title>
   </head>
   <body>
      <input type="file">
      <script src="https://cdn.rawgit.com/Vikasg7/line-reader-browser/master/dist/tests/bundle.js"></script>
   </body>
</html>

Hope it saves someone's time!
Costumier answered 21/7, 2017 at 0:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.