Parse Remote CSV File using Nodejs / Papa Parse?
Asked Answered
C

7

13

I am currently working on parsing a remote csv product feed from a Node app and would like to use Papa Parse to do that (as I have had success with it in the browser in the past).

Papa Parse Github: https://github.com/mholt/PapaParse

My initial attempts and web searching haven't turned up exactly how this would be done. The Papa readme says that Papa Parse is now compatible with Node and as such Baby Parse (which used to serve some of the Node parsing functionality) has been depreciated.

Here's a link to the Node section of the docs for anyone stumbling on this issue in the future: https://github.com/mholt/PapaParse#papa-parse-for-node

From that doc paragraph it looks like Papa Parse in Node can parse a readable stream instead of a File. My question is;

Is there any way to utilize Readable Streams functionality to use Papa to download / parse a remote CSV in Node some what similar to how Papa in the browser uses XMLHttpRequest to accomplish that same goal?

For Future Visibility For those searching on the topic (and to avoid repeating a similar question) attempting to utilize the remote file parsing functionality described here: http://papaparse.com/docs#remote-files will result in the following error in your console:

"Unhandled rejection ReferenceError: XMLHttpRequest is not defined"

I have opened an issue on the official repository and will update this Question as I learn more about the problems that need to be solved.

Contrapuntist answered 14/12, 2017 at 22:45 Comment(0)
J
11

Actually you could use a lightweight stream transformation library called scramjet - parsing CSV straight from http stream is one of my main examples. It also uses PapaParse to parse CSVs.

All you wrote above, with any transforms in between, can be done in just couple lines:

const {StringStream} = require("scramjet");
const request = require("request");

request.get("https://srv.example.com/main.csv")   // fetch csv
    .pipe(new StringStream())                       // pass to stream
    .CSVParse()                                   // parse into objects
    .consume(object => console.log("Row:", object))  // do whatever you like with the objects
    .then(() => console.log("all done"))

In your own example you're saving the file to disk, which is not necessary even with PapaParse.

Jubal answered 18/3, 2018 at 20:22 Comment(8)
Happy to hear this. :)Encephalic
.pipe(new StringStream) when I used this I will get below Error: [ts] Cannot use 'new' with an expression whose type lacks a call or construct signature.Frasch
Add parenthesis after like this: new StringStream().Encephalic
Where does the .csvParse() method come from?Montalvo
That's a method exposed by scramjet.StringStream. See scramjet.org for more info.Encephalic
I keep getting the error request.get(...).pipe(...).csvParse is not a function...?Libava
Hi @SamSverko I'd be happy to help - can you open a new question and link to it there - I'd need to see a little more of your code to responEncephalic
@SamSverko now I've noticed - indeed it's CSVParse - I'll correct this above.Encephalic
F
21

After lots of tinkering I finally got a working example of this using asynchronous streams and with no additional libraries (except fs/request). It works for remote and local files.

I needed to create a data stream, as well as a PapaParse stream (using papa.NODE_STREAM_INPUT as the first argument to papa.parse()), then pipe the data into the PapaParse stream. Event listeners need to be implemented for the data and finish events on the PapaParse stream. You can then use the parsed data inside your handler for the finish event.

See the example below:

const papa = require("papaparse");
const request = require("request");

const options = {/* options */};

const dataStream = request.get("https://example.com/myfile.csv");
const parseStream = papa.parse(papa.NODE_STREAM_INPUT, options);

dataStream.pipe(parseStream);

let data = [];
parseStream.on("data", chunk => {
    data.push(chunk);
});

parseStream.on("finish", () => {
    console.log(data);
    console.log(data.length);
});

The data event for the parseStream happens to run once for each row in the CSV (though I'm not sure this behaviour is guaranteed). Hope this helps someone!

To use a local file instead of a remote file, you can do the same thing except the dataStream would be created using fs:

const dataStream = fs.createReadStream("./myfile.csv");

(You may want to use path.join and __dirname to specify a path relative to where the file is located rather than relative to where it was run)

Funky answered 3/4, 2020 at 13:56 Comment(5)
David if this works (haven't tried it) it should be the accept answer! Nice work man!Contrapuntist
This should be the accepted answer since it actually answers the question with papa parse, both for remote and local filesCoke
the best answerOversubtlety
Really helpfull just wondering how to handle errors you will use parseStream.on("error"... ) ?Mulholland
Thank you very much. I would not have found this without you.Vosges
T
15

OK, so I think I have an answer to this. But I guess only time will tell. Note that my file is .txt with tab delimiters.

var fs = require('fs');
var Papa = require('papaparse');
var file = './rawData/myfile.txt';
// When the file is a local file when need to convert to a file Obj.
//  This step may not be necissary when uploading via UI
var content = fs.readFileSync(file, "utf8");

var rows;
Papa.parse(content, {
    header: false,
    delimiter: "\t",
    complete: function(results) {
        //console.log("Finished:", results.data);
    rows = results.data;
    }
});
Tafia answered 9/3, 2018 at 14:23 Comment(0)
J
11

Actually you could use a lightweight stream transformation library called scramjet - parsing CSV straight from http stream is one of my main examples. It also uses PapaParse to parse CSVs.

All you wrote above, with any transforms in between, can be done in just couple lines:

const {StringStream} = require("scramjet");
const request = require("request");

request.get("https://srv.example.com/main.csv")   // fetch csv
    .pipe(new StringStream())                       // pass to stream
    .CSVParse()                                   // parse into objects
    .consume(object => console.log("Row:", object))  // do whatever you like with the objects
    .then(() => console.log("all done"))

In your own example you're saving the file to disk, which is not necessary even with PapaParse.

Jubal answered 18/3, 2018 at 20:22 Comment(8)
Happy to hear this. :)Encephalic
.pipe(new StringStream) when I used this I will get below Error: [ts] Cannot use 'new' with an expression whose type lacks a call or construct signature.Frasch
Add parenthesis after like this: new StringStream().Encephalic
Where does the .csvParse() method come from?Montalvo
That's a method exposed by scramjet.StringStream. See scramjet.org for more info.Encephalic
I keep getting the error request.get(...).pipe(...).csvParse is not a function...?Libava
Hi @SamSverko I'd be happy to help - can you open a new question and link to it there - I'd need to see a little more of your code to responEncephalic
@SamSverko now I've noticed - indeed it's CSVParse - I'll correct this above.Encephalic
C
1

I am adding this answer (and will update it as I progress) in case anyone else is still looking into this.

It seems like previous users have ended up downloading the file first and then processing it. This SHOULD NOT be necessary since Papa Parse should be able to process a read stream and it should be possible to pipe 'http' GET to that stream.

Here is one instance of someone discussing what I am trying to do and falling back to downloading the file and then parsing it: https://forums.meteor.com/t/processing-large-csvs-in-meteor-js-with-papaparse/32705/4

Note: in the above Baby Parse is discussed, now that Papa Parse works with Node Baby Parse has been depreciated.

Download File Workaround

While downloading and then Parsing with Papa Parse is not an answer to my question, it is the only workaround I have as of now and someone else may want to use this methodology.

My code to download and then parse currently looks something like this:

// Papa Parse for parsing CSV Files
var Papa = require('papaparse');
// HTTP and FS to enable Papa parse to download remote CSVs via node streams.
var http = require('http');
var fs = require('fs');

var destinationFile = "yourdestination.csv";

var download = function(url, dest, cb) {
  var file = fs.createWriteStream(dest);
  var request = http.get(url, function(response) {
    response.pipe(file);
    file.on('finish', function() {
      file.close(cb);  // close() is async, call cb after close completes.
    });
  }).on('error', function(err) { // Handle errors
    fs.unlink(dest); // Delete the file async. (But we don't check the result)
    if (cb) cb(err.message);
  });
};

download(feedURL, destinationFile, parseMe);

var parseMe = Papa.parse(destinationFile, {
  header: true,
  dynamicTyping: true,
  step: function(row) {
    console.log("Row:", row.data);
  },
  complete: function() {
    console.log("All done!");
  }
});
Contrapuntist answered 15/12, 2017 at 22:56 Comment(0)
F
1

Http(s) actually has a readable stream as parameter in the callback, so here is a simple solution

 try {
    var streamHttp = await new Promise((resolve, reject) =>
       https.get("https://example.com/yourcsv.csv", (res) => {
          resolve(res);
       })
    );
 } catch (e) {
    console.log(e);
 }

 Papa.parse(streamHttp, config);
Fictile answered 24/7, 2019 at 15:36 Comment(0)
B
1
const Papa = require("papaparse");
const { StringStream } = require("scramjet");
const request = require("request");

const req = request
  .get("https://example.com/yourcsv.csv")
  .pipe(new StringStream());

Papa.parse(req, {
  header: true,
  complete: (result) => {
    console.log(result);
  },
});
Bowstring answered 7/8, 2020 at 14:24 Comment(1)
Congratulations on posting your first answer! It would be good to provide some context or guidance on why your answer is appropriate to the question.Pneumograph
I
0

David Liao's solution worked for me, I did tweak it a little bit since I am using local file. He did not include the example how to solve the file access in node if you did get Error: ENOENT: no such file or directory message in your console.

To test your actual working directory and to understand where you must point your path to console log the following, this gave me better understanding of the file location: console.log(process.cwd()).

const fs = require('fs');
const papa = require('papaparse');
const request = require('request');
const path = require('path');

const options = {
  /* options */
};

const fileName = path.resolve(__dirname, 'ADD YOUR ABSOLUTE FILE LOCATION HERE');
const dataStream = fs.createReadStream(fileName);
const parseStream = papa.parse(papa.NODE_STREAM_INPUT, options);

dataStream.pipe(parseStream);

let data = [];
parseStream.on('data', chunk => {
  data.push(chunk);
});

parseStream.on('finish', () => {
  console.log(data);
  console.log(data.length);
});
Indemonstrable answered 11/1, 2023 at 14:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.