Nodejs Read very large file(~10GB), Process line by line then write to other file
Asked Answered
J

4

7

I have a 10 GB log file in a particular format, I want to process this file line by line and then write the output to other file after applying some transformations. I am using node for this operation.

Though this method is fine but it takes a hell lot of time to do this. I was able to do this within 30-45 mins in JAVA, but in node it is taking more than 160 minutes to do the same job. Following is the code:

Following is the initiation code which reads each line from the input.

var path = '../10GB_input_file.txt';
var output_file = '../output.txt';

function fileopsmain(){

    fs.exists(output_file, function(exists){
        if(exists) {
            fs.unlink(output_file, function (err) {
                if (err) throw err;
                console.log('successfully deleted ' + output_file);
            });
        }
    });

    new lazy(fs.createReadStream(path, {bufferSize: 128 * 4096}))
        .lines
        .forEach(function(line){
            var line_arr = line.toString().split(';');
            perform_line_ops(line_arr, line_arr[6], line_arr[7], line_arr[10]);
        }
    );

}

This is the method that performs some operation over that line and passes the input to write method to write it into the output file.

function perform_line_ops(line_arr, range_start, range_end, daynums){

    var _new_lines = '';
    for(var i=0; i<days; i++){
        //perform some operation to modify line pass it to print
    }

    write_line_ops(_new_lines);
}

Following method is used to write data into a new file.

function write_line_ops(line) {
    if(line != null && line != ''){
        fs.appendFileSync(output_file, line);
    }
}

I want to bring this time down to 15-20 mins. Is it possible to do so.

Also for the record I'm trying this on a intel i7 processor with 8 GB of RAM.

Jidda answered 17/7, 2015 at 15:29 Comment(10)
One operative question is whether the lazy module is reading the entire file into memory before processing it rather than streaming it line by line? You might be interesting in the node-byline module.Upgrade
First step if I were working on this would be to time each step on a much smaller file to see what exactly is causing the slowdown. From there, you can begin to optimize that portion of the code.Unmoor
@Upgrade No lazy module is not loading the entire file into memory as I am monitoring the memory usage simultaneously.Jidda
@Kevin B I'm doing the same I'm working on a 400MB file which gets processed in ~2.5 minutes. Though I'm not exactly sure sure of what is causing the problem here.Jidda
What I would suggest is that you bound the problem first here. Create a simple test app that just creates a readstream and reads through the entire file with no worrying about lines and no writing to disk. See how long that takes. If that is quick, then you can add one piece to the puzzle at a time and track your progress as you go. Next add piping it to a new filename and see the performance. If the original reading is slow, then the problem is lower down in nodejs streaming and you will have to go lower level to fix the performance.Upgrade
@Upgrade Thanks for the suggestion but what I am trying to look for is "Am I using the correct approach here"? Because if not than at least I can be guided to run in the right direction.Jidda
So, I would prove with several test apps whether plain nodejs streaming is performant enough or not before introducing lazy into the equation so you know which sub-system is causing you the issue.Upgrade
And, I'm advising you on how to decide whether this is an approach that will perform well enough. It all depends upon the performance profile of the tools you are using. There's nothing architecturally wrong with your approach unless the tools you are using simply aren't fast enough.Upgrade
@Upgrade Ok =) . then what tools do you suggest for performing this task?Jidda
I suggested you write a simple test app to see if plain nodejs streams are fast enough for you. Remove all the other variables (like lazy and line processing) from the equation. Run a simple test app on your large file to just read it chunk by chunk using streams. I think I've already described this multiple times. You have to do a few tests to see what will work for you.Upgrade
S
6

You can do this easily without a module. For example:

var fs = require('fs');
var inspect = require('util').inspect;

var buffer = '';
var rs = fs.createReadStream('foo.log');
rs.on('data', function(chunk) {
  var lines = (buffer + chunk).split(/\r?\n/g);
  buffer = lines.pop();
  for (var i = 0; i < lines.length; ++i) {
    // do something with `lines[i]`
    console.log('found line: ' + inspect(lines[i]));
  }
});
rs.on('end', function() {
  // optionally process `buffer` here if you want to treat leftover data without
  // a newline as a "line"
  console.log('ended on non-empty buffer: ' + inspect(buffer));
});
Squamulose answered 17/7, 2015 at 16:33 Comment(7)
Yes, one can write their own line handling code. But, if you look at the OP's needs, they need to be able to access a number of lines at once so now your code needs to add buffering of groups of lines and so on. The whole point of the OP's approach is to attempt to use existing tools that solve these problems for you rather than write your own from scratch. And, it's not clear how this solves the OP's problem.Upgrade
The OP's code is reading the file line by line, which is also exactly what my code is doing. My point is that in this particular case, doing it yourself is very simple while also ensuring that the entire file is not being buffered at once before processing.Squamulose
You've made a wild guess what is causing the performance issue and offered one piece of an alternate solution. Seems to me we don't really know yet where the performance issue is without running some tests.Upgrade
@Squamulose I'm testing this code now side by side so this can be kept as an open question now. This solution may be helpful. Will surely let everyone know the result of the above approach.Jidda
@Squamulose It still takes same amount of time. I guess there needs to be some delay in writing into the file.Jidda
@Jidda What node/io.js version are you using? Also, you might show the Java code you're using to process the logs.Squamulose
@Squamulose JAVA code is simple using I/O streams, and logic applied for transformations is same. No third party libraries are used in it.Jidda
D
0

I can't guess where the possible bottleneck is in your code.

  • Can you add the library or the source code of the lazy function?
  • How many operations does your perform_line_ops do? (if/else, switch/case, function calls)

I've created a example based on your given code, I know that this does not answer your question but maybe helps you understand how node handles such case.

const fs = require('fs')
const path = require('path')

const inputFile = path.resolve(__dirname, '../input_file.txt')
const outputFile = path.resolve(__dirname, '../output_file.txt')

function bootstrap() {
    // fs.exists is deprecated
    // check if output file exists
    // https://nodejs.org/api/fs.html#fs_fs_exists_path_callback
    fs.exists(outputFile, (exists) => {
        if (exists) {
            // output file exists, delete it
            // https://nodejs.org/api/fs.html#fs_fs_unlink_path_callback
            fs.unlink(outputFile, (err) => {
                if (err) {
                    throw err
                }

                console.info(`successfully deleted: ${outputFile}`)
                checkInputFile()
            })
        } else {
            // output file doesn't exist, move on
            checkInputFile()
        }
    })
}

function checkInputFile() {
    // check if input file can be read
    // https://nodejs.org/api/fs.html#fs_fs_access_path_mode_callback
    fs.access(inputFile, fs.constants.R_OK, (err) => {
        if (err) {
            // file can't be read, throw error
            throw err
        }

        // file can be read, move on
        loadInputFile()
    })
}

function saveToOutput() {
    // create write stream
    // https://nodejs.org/api/fs.html#fs_fs_createwritestream_path_options
    const stream = fs.createWriteStream(outputFile, {
        flags: 'w'
    })

    // return wrapper function which simply writes data into the stream
    return (data) => {
        // check if the stream is writable
        if (stream.writable) {
            if (data === null) {
                stream.end()
            } else if (data instanceof Array) {
                stream.write(data.join('\n'))
            } else {
                stream.write(data)
            }
        }
    }
}

function parseLine(line, respond) {
    respond([line])
}

function loadInputFile() {
    // create write stream
    const saveOutput = saveToOutput()
    // create read stream
    // https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
    const stream = fs.createReadStream(inputFile, {
        autoClose: true,
        encoding: 'utf8',
        flags: 'r'
    })

    let buffer = null

    stream.on('data', (chunk) => {
        // append the buffer to the current chunk
        const lines = (buffer !== null)
            ? (buffer + chunk).split('\n')
            : chunk.split('\n')

        const lineLength = lines.length
        let lineIndex = -1

        // save last line for later (last line can be incomplete)
        buffer = lines[lineLength - 1]

        // loop trough all lines
        // but don't include the last line
        while (++lineIndex < lineLength - 1) {
            parseLine(lines[lineIndex], saveOutput)
        }
    })

    stream.on('end', () => {
        if (buffer !== null && buffer.length > 0) {
            // parse the last line
            parseLine(buffer, saveOutput)
        }

        // Passing null signals the end of the stream (EOF)
        saveOutput(null)
    })
}

// kick off the parsing process
bootstrap()
Discrepancy answered 28/4, 2017 at 10:44 Comment(0)
F
0

I know this is old but...

At a guess appendFileSync() _write()_s to the file system and waits for the response. Lots of small writes are generally expensive, presuming you use a BufferedWriter in Java you might get faster results by skipping some write()s.

Use one of the async writes and see if node buffers sensibly, or write the lines to large node Buffer until it is full and always write a full (or nearly full) Buffer. By tuning the buffer size you could validate if the number of writes affects perf. I suspect it would.

Fluxion answered 17/11, 2017 at 21:37 Comment(0)
M
0

The execution is slow, because you're not using node's asynchronous operations. In essence, you're executing the code like this:

> read some lines
> transform
> write some lines
> repeat

While you could be doing everything at once, or at least reading and writing. Some examples in the answers here do that, but the syntax is at least complicated. Using scramjet you can do it in a couple simple lines:

const {StringStream} = require('scramjet');

fs.createReadStream(path, {bufferSize: 128 * 4096})
    .pipe(new StringStream({maxParallel: 128})    // I assume this is an utf-8 file
    .split("\n")                                  // split per line
    .parse((line) => line.split(';'))             // parse line
    .map([line_arr, range_start, range_end, daynums] => {
        return simplyReturnYourResultForTheOtherFileHere(
            line_arr, range_start, range_end, daynums
        );                                         // run your code, return promise if you're doing some async work
    })
    .stringify((result) => result.toString())
    .pipe(fs.createWriteStream)
    .on("finish", () => console.log("done"))
    .on("error", (e) => console.log("error"))

This will probably run much faster.

Munoz answered 19/11, 2017 at 22:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.