Passing string stored in memory to pdftotext, antiword, catdoc, etc
Asked Answered
C

1

7

Is it possible to call CLI tools like pdftotext, antiword, catdoc (text extractor scripts) passing a string instead of a file?

Currently, I read PDF files calling pdftotext with child_process.spawn. I spawn a new process and store the result in a new variable. Everything works fine.

I’d like to pass the binary from a fs.readFile instead of the file itself:

fs.readFile('./my.pdf', (error, binary) => {
    // Call pdftotext with child_process.spawn passing the binary.
    let event = child_process.spawn('pdftotext', [
        // Args here!
    ]);
});

How can I do that?

Crepuscular answered 22/7, 2016 at 19:9 Comment(0)
B
2

It's definitely possible, if the command can handle piped input.

spawn returns a ChildProcess object, you can pass the string (or binary) in memory to it by write to its stdin. The string should be converted to a ReadableStream first, then you can write the string to stdin of the CLI by pipe.

createReadStream creates a ReadableStream from a file.

The following example download a pdf file and pipe the content to pdftotext, then show first few bytes of the result.

const source = 'http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf'
const http = require('http')
const spawn = require('child_process').spawn

download(source).then(pdftotext)
.then(result => console.log(result.slice(0, 77)))

function download(url) {
  return new Promise(resolve => http.get(url, resolve))
}

function pdftotext(binaryStream) {
  //read input from stdin and write to stdout
  const command = spawn('pdftotext', ['-', '-'])
  binaryStream.pipe(command.stdin)

  return new Promise(resolve => {
    const result = []
    command.stdout.on('data', chunk => result.push(chunk.toString()))
    command.stdout.on('end', () => resolve(result.join('')))
  })
}

For CLIs have no option to read from stdin, you can use named pipes.

Edit: Add another example with named pipes.

Once the named pipes are created, you can use them like files. The following example creates temporary named pipes to send input and get output, and show first few bytes of the result.

const fs = require('fs')
const spawn = require('child_process').spawn

pipeCommand({
  name: 'wvText',
  input: fs.createReadStream('document.doc'),
}).then(result => console.log(result.slice(0, 77)))

function createPipe(name) {
  return new Promise(resolve =>
    spawn('mkfifo', [name]).on('exit', () => resolve()))
}

function pipeCommand({name, input}) {
  const inpipe = 'input.pipe'
  const outpipe = 'output.pipe'
  return Promise.all([inpipe, outpipe].map(createPipe)).then(() => {
    const result = []
    fs.createReadStream(outpipe)
    .on('data', chunk => result.push(chunk.toString()))
    .on('error', console.log)

    const command = spawn(name, [inpipe, outpipe]).on('error', console.log)
    input.pipe(fs.createWriteStream(inpipe).on('error', console.log))
    return new Promise(resolve =>
      command.on('exit', () => {
        [inpipe, outpipe].forEach(name => fs.unlink(name))
        resolve(result.join(''))
      }))
  })
}
Beverage answered 23/8, 2016 at 13:8 Comment(12)
Hei @DarkKnight, tranks a lot!! If i'm not askinh to much, could u provide a working exemple with named pipes? It turns out that i'm using other scripts that doesnt support the other method.Crepuscular
All of the tools you mentioned can accept stdin by specifying -. I added another example, anyway.Beverage
Hei DarkKnight, some how i'm seeing events.js:160 throw er; // Unhandled 'error' event ^ Error: EPIPE: broken pipe, write at Error (native) now... do you know what could be this?Crepuscular
Most likely the command exits before consuming all input data(and EOF), to inspect it further, you can add .on('error' ... to streams and the child process.Beverage
It's returning { Error: EPIPE: broken pipe, write at Error (native) errno: -32, code: 'EPIPE', syscall: 'write' }... is it working on your machine?Crepuscular
All of the examples works(node v6.4.0 on Linux). What command and input data did you use? Does it work with regular input file?Beverage
Is working with regular input file. I'm using the antiword script with your script. I'm trying also to do the following: codepen.io/anon/pen/EyBLWa?editors=0110 (which is a example of i will do in my app)Crepuscular
If i try to read a "real" file, everything will work just fine.Crepuscular
cat document.doc > input.pipe & antiword input.pipe, antiword says: I can't get the size of 'input.pipe'. It tries to seek, but pipes are non-seekable.Beverage
So, it's a antiword problem. Is it fixable?Crepuscular
There is something really strange.. i can't execute antiword, but i can read the file with fs.readFile? lol.. :XCrepuscular
This is really not that strange, @FXAMN. Pipes are not seekable, while ordinary files are. When reading a PDF file, you often need to seek, unless you are OK with reading the whole file into memory (in which case you may have hard time processing large PDFs). Many tools accept input from a pipe, but they write it immediately into a file and only then process it.Toback

© 2022 - 2024 — McMap. All rights reserved.