How to extract text from a PDF in JavaScript
Asked Answered
A

10

74

I wonder if is possible to get the text inside of a PDF file by using only Javascript? If yes, can anyone show me how?

I know there are some server-side java, c#, etc libraries but I would prefer not using a server. thanks

Aubrette answered 12/10, 2009 at 12:26 Comment(0)
I
95

Because pdf.js has been developing over the years, I would like to give a new answer. That is, it can be done locally without involving any server or external service. The new pdf.js has a function: page.getTextContent(). You can get the text content from that. I've done it successfully with the following code.

  1. What you get in each step is a promise. You need to code this way: .then( function(){...}) to proceed to the next step.
  1. PDFJS.getDocument( data ).then( function(pdf) {

  2. pdf.getPage(i).then( function(page){

  3. page.getTextContent().then( function(textContent){

  1. What you finally get is an string array textContent.bidiTexts[]. You concatenate them to get the text of 1 page. Text blocks' coordinates are used to judge whether newline or space need to be inserted. (This may not be totally robust, but from my test it seems ok.)

  2. The input parameter data needs to be either a URL or ArrayBuffer type data. I used the ReadAsArrayBuffer(file) function in FileReader API to get the data.

Note: According to some other user, the library has updated and caused the code to break. According to the comment by async5 below, you need to replace textContent.bidiTexts with textContent.items.

    function Pdf2TextClass(){
     var self = this;
     this.complete = 0;

    /**
     *
     * @param data ArrayBuffer of the pdf file content
     * @param callbackPageDone To inform the progress each time
     *        when a page is finished. The callback function's input parameters are:
     *        1) number of pages done;
     *        2) total number of pages in file.
     * @param callbackAllDone The input parameter of callback function is 
     *        the result of extracted text from pdf file.
     *
     */
     this.pdfToText = function(data, callbackPageDone, callbackAllDone){
     console.assert( data  instanceof ArrayBuffer  || typeof data == 'string' );
     PDFJS.getDocument( data ).then( function(pdf) {
     var div = document.getElementById('viewer');
    
     var total = pdf.numPages;
     callbackPageDone( 0, total );        
     var layers = {};        
     for (i = 1; i <= total; i++){
        pdf.getPage(i).then( function(page){
        var n = page.pageNumber;
        page.getTextContent().then( function(textContent){
          if( null != textContent.bidiTexts ){
            var page_text = "";
            var last_block = null;
            for( var k = 0; k < textContent.bidiTexts.length; k++ ){
                var block = textContent.bidiTexts[k];
                if( last_block != null && last_block.str[last_block.str.length-1] != ' '){
                    if( block.x < last_block.x )
                        page_text += "\r\n"; 
                    else if ( last_block.y != block.y && ( last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null ))
                        page_text += ' ';
                }
                page_text += block.str;
                last_block = block;
            }

            textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);
            layers[n] =  page_text + "\n\n";
          }
          ++ self.complete;
          callbackPageDone( self.complete, total );
          if (self.complete == total){
            window.setTimeout(function(){
              var full_text = "";
              var num_pages = Object.keys(layers).length;
              for( var j = 1; j <= num_pages; j++)
                  full_text += layers[j] ;
              callbackAllDone(full_text);
            }, 1000);              
          }
        }); // end  of page.getTextContent().then
      }); // end of page.then
    } // of for
  });
 }; // end of pdfToText()
}; // end of class
Infinitive answered 11/12, 2013 at 14:54 Comment(8)
Ancient question but excellent answer. You have any idea how to get the textLayer to not render characters in individual divs but to render them as whole words? I'm getting quite a big performance hit from trying to use the text layer overlap with the divs absolute positioned as there are so many of them. If you'd prefer this as a separate actual StackOverflow question I'll make one.Gearing
@Infinitive I have been trying to extract text from a PDF using your function. However, I am unable to extract the text. The full_text returns an empty string at the end. Can you please help.Conceivable
I couldn't get this to work either (API has changed). Added my own example below.Adenine
Adding few more examples to the answer: github.com/mozilla/pdf.js/blob/master/examples/text-only/… and github.com/mozilla/pdf.js/blob/master/examples/node/getinfo.jsBeyond
replace textContent.bidiTexts with textContent.itemsBeyond
Is there a way to use it by input file- that comes from the user?Pavel
@Itsik Mauyhas : This is another question. You see, what pdfToText() expect is an ArrayBuffer. In theory, what you need to do is to read the file (See https://mcmap.net/q/73560/-how-to-read-a-local-text-file-in-the-browser), and then convert the text string loaded into ArrayBuffer. ( See https://mcmap.net/q/20884/-converting-between-strings-and-arraybuffers). Haven't tried. Hope it helps.Infinitive
@ItsikMauyhas: I don't think that comment is a different question since the asker didn't say it was supposed to be for an online file. See my answer below (https://mcmap.net/q/270058/-how-to-extract-text-from-a-pdf-in-javascript) that uses nodejs to parse local files.Biodynamics
A
13

I couldn't get gm2008's example to work (the internal data structure on pdf.js has changed apparently), so I wrote my own fully promise-based solution that doesn't use any DOM elements, queryselectors or canvas, using the updated pdf.js from the example at mozilla

It eats a file path for the upload since i'm using it with node-webkit. You need to make sure you have the cmaps downloaded and pointed somewhere and you nee pdf.js and pdf.worker.js to get this working.

    /**
     * Extract text from PDFs with PDF.js
     * Uses the demo pdf.js from https://mozilla.github.io/pdf.js/getting_started/
     */
    this.pdfToText = function(data) {

        PDFJS.workerSrc = 'js/vendor/pdf.worker.js';
        PDFJS.cMapUrl = 'js/vendor/pdfjs/cmaps/';
        PDFJS.cMapPacked = true;

        return PDFJS.getDocument(data).then(function(pdf) {
            var pages = [];
            for (var i = 0; i < pdf.numPages; i++) {
                pages.push(i);
            }
            return Promise.all(pages.map(function(pageNumber) {
                return pdf.getPage(pageNumber + 1).then(function(page) {
                    return page.getTextContent().then(function(textContent) {
                        return textContent.items.map(function(item) {
                            return item.str;
                        }).join(' ');
                    });
                });
            })).then(function(pages) {
                return pages.join("\r\n");
            });
        });
    }

usage:

 self.pdfToText(files[0].path).then(function(result) {
      console.log("PDF done!", result);
 })
Adenine answered 17/3, 2015 at 22:48 Comment(2)
See also github.com/mozilla/pdf.js/blob/master/examples/text-only/… and github.com/mozilla/pdf.js/blob/master/examples/node/getinfo.jsBeyond
"PDFJS.getDocument(...).then is not a function"Mentalist
R
9

Just leaving here a full working sample.

<html>
    <head>
        <script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>
    </head>
    <body>
        <input id="pdffile" name="pdffile" type="file" />
        <button id="btn" onclick="convert()">Process</button>
        <div id="result"></div>
    </body>
</html>

<script>

    function convert() {
        var fr=new FileReader();
        var pdff = new Pdf2TextClass();
        fr.onload=function(){
            pdff.pdfToText(fr.result, null, (text) => { document.getElementById('result').innerText += text; });
        }
        fr.readAsDataURL(document.getElementById('pdffile').files[0])
        
    }

    function Pdf2TextClass() {
        var self = this;
        this.complete = 0;

        this.pdfToText = function (data, callbackPageDone, callbackAllDone) {
            console.assert(data instanceof ArrayBuffer || typeof data == 'string');
            var loadingTask = pdfjsLib.getDocument(data);
            loadingTask.promise.then(function (pdf) {


                var total = pdf._pdfInfo.numPages;
                //callbackPageDone( 0, total );        
                var layers = {};
                for (i = 1; i <= total; i++) {
                    pdf.getPage(i).then(function (page) {
                        var n = page.pageNumber;
                        page.getTextContent().then(function (textContent) {

                            //console.log(textContent.items[0]);0
                            if (null != textContent.items) {
                                var page_text = "";
                                var last_block = null;
                                for (var k = 0; k < textContent.items.length; k++) {
                                    var block = textContent.items[k];
                                    if (last_block != null && last_block.str[last_block.str.length - 1] != ' ') {
                                        if (block.x < last_block.x)
                                            page_text += "\r\n";
                                        else if (last_block.y != block.y && (last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null))
                                            page_text += ' ';
                                    }
                                    page_text += block.str;
                                    last_block = block;
                                }

                                textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);
                                layers[n] = page_text + "\n\n";
                            }
                            ++self.complete;
                            //callbackPageDone( self.complete, total );
                            if (self.complete == total) {
                                window.setTimeout(function () {
                                    var full_text = "";
                                    var num_pages = Object.keys(layers).length;
                                    for (var j = 1; j <= num_pages; j++)
                                        full_text += layers[j];
                                    callbackAllDone(full_text);
                                }, 1000);
                            }
                        }); // end  of page.getTextContent().then
                    }); // end of page.then
                } // of for
            });
        }; // end of pdfToText()
    }; // end of class

</script>
Rubalcava answered 2/10, 2021 at 12:13 Comment(1)
> Deprecated API usage: No "GlobalWorkerOptions.workerSrc" specified.Behling
S
7

Here's some JavaScript code that does what you want using Pdf.js from http://hublog.hubmed.org/archives/001948.html:

var input = document.getElementById("input");  
var processor = document.getElementById("processor");  
var output = document.getElementById("output");  

// listen for messages from the processor  
window.addEventListener("message", function(event){  
  if (event.source != processor.contentWindow) return;  

  switch (event.data){  
    // "ready" = the processor is ready, so fetch the PDF file  
    case "ready":  
      var xhr = new XMLHttpRequest;  
      xhr.open('GET', input.getAttribute("src"), true);  
      xhr.responseType = "arraybuffer";  
      xhr.onload = function(event) {  
        processor.contentWindow.postMessage(this.response, "*");  
      };  
      xhr.send();  
    break;  

    // anything else = the processor has returned the text of the PDF  
    default:  
      output.textContent = event.data.replace(/\s+/g, " ");  
    break;  
  }  
}, true);

...and here's an example:

http://git.macropus.org/2011/11/pdftotext/example/

Substructure answered 14/9, 2012 at 17:1 Comment(2)
While those links may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.Elisabetta
hi, i'm trying this, but this still requires a file be uploaded to the server. how can i process files locally, client-side?Despatch
B
4

Note: This code assumes you're using nodejs. That means you're parsing a local file instead of one from a web page since the original question doesn't explicitly ask about parsing pdfs on a web page.

@gm2008's answer was a great starting point (please read it and its comments for more info), but needed some updates (08/19) and had some unused code. I also like examples that are more full. There's more refactoring and tweaking that could be done (e.g. with await), but for now it's as close to that original answer as it could be.

As before, this uses Mozilla's PDFjs library. The npmjs package is at https://www.npmjs.com/package/pdfjs-dist.

In my experience, this doesn't do well in finding where to put spaces, but that's a problem for another time.

[Edit: I believe the update to the use of .transform has restored the whitespace as it originally behaved.]

// This file is called myPDFfileToText.js and is in the root folder
let PDFJS = require('pdfjs-dist');

let pathToPDF = 'path/to/myPDFfileToText.pdf';

let toText = Pdf2TextObj();
let onPageDone = function() {}; // don't want to do anything between pages
let onFinish = function(fullText) { console.log(fullText) };
toText.pdfToText(pathToPDF, onPageDone, onFinish);

function Pdf2TextObj() {
    let self = this;
    this.complete = 0;

    /**
     *
     * @param path Path to the pdf file.
     * @param callbackPageDone To inform the progress each time
     *        when a page is finished. The callback function's input parameters are:
     *        1) number of pages done.
     *        2) total number of pages in file.
     *        3) the `page` object itself or null.
     * @param callbackAllDone Called after all text has been collected. Input parameters:
     *        1) full text of parsed pdf.
     *
     */
    this.pdfToText = function(path, callbackPageDone, callbackAllDone) {
        // console.assert(typeof path == 'string');
        PDFJS.getDocument(path).promise.then(function(pdf) {

            let total = pdf.numPages;
            callbackPageDone(0, total, null);

            let pages = {};
            // For some (pdf?) reason these don't all come in consecutive
            // order. That's why they're stored as an object and then
            // processed one final time at the end.
            for (let pagei = 1; pagei <= total; pagei++) {
                pdf.getPage(pagei).then(function(page) {
                    let pageNumber = page.pageNumber;
                    page.getTextContent().then(function(textContent) {
                        if (null != textContent.items) {
                            let page_text = "";
                            let last_item = null;
                            for (let itemsi = 0; itemsi < textContent.items.length; itemsi++) {
                                let item = textContent.items[itemsi];
                                // I think to add whitespace properly would be more complex and
                                // would require two loops.
                                if (last_item != null && last_item.str[last_item.str.length - 1] != ' ') {
                                    let itemX = item.transform[5]
                                    let lastItemX = last_item.transform[5]
                                    let itemY = item.transform[4]
                                    let lastItemY = last_item.transform[4]
                                    if (itemX < lastItemX)
                                        page_text += "\r\n";
                                    else if (itemY != lastItemY && (last_item.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null))
                                        page_text += ' ';
                                } // ends if may need to add whitespace

                                page_text += item.str;
                                last_item = item;
                            } // ends for every item of text

                            textContent != null && console.log("page " + pageNumber + " finished.") // " content: \n" + page_text);
                            pages[pageNumber] = page_text + "\n\n";
                        } // ends if has items

                        ++self.complete;

                        callbackPageDone(self.complete, total, page);


                        // If all done, put pages in order and combine all
                        // text, then pass that to the callback
                        if (self.complete == total) {
                            // Using `setTimeout()` isn't a stable way of making sure 
                            // the process has finished. Watch out for missed pages.
                            // A future version might do this with promises.
                            setTimeout(function() {
                                let full_text = "";
                                let num_pages = Object.keys(pages).length;
                                for (let pageNum = 1; pageNum <= num_pages; pageNum++)
                                    full_text += pages[pageNum];
                                callbackAllDone(full_text);
                            }, 1000);
                        }
                    }); // ends page.getTextContent().then
                }); // ends page.then
            } // ends for every page
        });
    }; // Ends pdfToText()

    return self;
}; // Ends object factory

Run in the terminal:

node myPDFfileToText.js

Biodynamics answered 13/8, 2019 at 14:42 Comment(1)
"Cannot set property 'complete' of undefined"Mentalist
S
2

Updated 02/2021

<script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>
    <script>
    
function Pdf2TextClass(){
    var self = this;
    this.complete = 0;

    this.pdfToText = function(data, callbackPageDone, callbackAllDone){
    console.assert( data  instanceof ArrayBuffer  || typeof data == 'string' );
    var loadingTask = pdfjsLib.getDocument(data);
    loadingTask.promise.then(function(pdf) {


    var total = pdf._pdfInfo.numPages;
    //callbackPageDone( 0, total );        
    var layers = {};        
    for (i = 1; i <= total; i++){
       pdf.getPage(i).then( function(page){
       var n = page.pageNumber;
       page.getTextContent().then( function(textContent){
       
       //console.log(textContent.items[0]);0
         if( null != textContent.items ){
           var page_text = "";
           var last_block = null;
           for( var k = 0; k < textContent.items.length; k++ ){
               var block = textContent.items[k];
               if( last_block != null && last_block.str[last_block.str.length-1] != ' '){
                   if( block.x < last_block.x )
                       page_text += "\r\n"; 
                   else if ( last_block.y != block.y && ( last_block.str.match(/^(\s?[a-zA-Z])$|^(.+\s[a-zA-Z])$/) == null ))
                       page_text += ' ';
               }
               page_text += block.str;
               last_block = block;
           }

           textContent != null && console.log("page " + n + " finished."); //" content: \n" + page_text);
           layers[n] =  page_text + "\n\n";
         }
         ++ self.complete;
         //callbackPageDone( self.complete, total );
         if (self.complete == total){
           window.setTimeout(function(){
             var full_text = "";
             var num_pages = Object.keys(layers).length;
             for( var j = 1; j <= num_pages; j++)
                 full_text += layers[j] ;
             console.log(full_text);
           }, 1000);              
         }
       }); // end  of page.getTextContent().then
     }); // end of page.then
   } // of for
 });
}; // end of pdfToText()
}; // end of class
var pdff = new Pdf2TextClass();
pdff.pdfToText('PDF_URL');
    </script>
Shotten answered 15/2, 2021 at 5:15 Comment(0)
R
1

@SchizoDuckie's solution, made shorter:

import { getDocument as loadPdf } from 'pdfjs-dist';

...

async function pdfToTxt(file: File): Promise<string> {

  const pdf = await loadPdf(await file.arrayBuffer()).promise;

  return Promise.all([...Array(pdf.numPages).keys()]
    .map(async num => (await (await pdf.getPage(num + 1)).getTextContent())
      .items.map(item => (<any>item).str).join(' ')))
    .then(pages => pages.join('\n'));

}
Run answered 5/4, 2023 at 22:18 Comment(4)
I needed to ' import "pdfjs-dist/build/pdf.worker.entry" ' to work. github.com/mozilla/pdf.js/issues/10478#issuecomment-1560704162. Thank you Tom for your solution.Supplication
I get the error "Attempted import error: 'getDocument' is not exported from 'pdfjs-dist' (imported as 'loadPdf').". I am using next.js app router and this is in a api route. The d.ts file for the library shows getDocument being exported. I don't know why it's not working. Found related issue: github.com/vercel/next.js/issues/58313Pivotal
@Run any chance you can provide an example of how you are reading and passing the pdf file into the pdftoTxt() function?Libbielibbna
There is a type="file" <input> field on the page. On it's onchange event you can access its files property, which is of FileList type. The item(0) property of this file list will be your file, that you can pass to the function.Run
T
0
npm install pdf-parse

required file:

/node_modules/pdf-parse/lib/pdf.js/v2.0.550/build/pdf.js

that loads:

pdf.worker.js

usage:

var pdf = await pdfjsLib.getDocument({ data: new Uint8Array(buffer) }).promise;
var numPages = pdf.numPages;
var texts = [];

for (let i = 1; i <= numPages; i++) {
    let page = await pdf.getPage(i);
    let textContent = await page.getTextContent();
    let textItems = textContent.items;
    let pageText = textItems.map(item => item.str).join(" ").replace(/\s+/g," ");
    texts.push(pageText);
}

console.log(texts); 

buffer can come from:

var file = $("input[type=file]");
file.onchange = function () {
    var file = this.files[0];
    var reader = new FileReader();
    reader.onload = async () => {
        var buffer = reader.result;
        // use buffer
    }

    if (file && file.type == "application/pdf") {
        reader.readAsArrayBuffer(file);
    } 

}
Triglyceride answered 13/3 at 22:46 Comment(0)
B
-2

For all the people who actually want to use it on a node server:

/**
 * Created by velten on 25.04.16.
 */
"use strict";
let pdfUrl = "http://example.com/example.pdf";
let request = require('request');
var pdfParser = require('pdf2json');

let pdfPipe = request({url: pdfUrl, encoding:null}).pipe(pdfParser);

pdfPipe.on("pdfParser_dataError", err => console.error(err) );
pdfPipe.on("pdfParser_dataReady", pdf => {
    //optionally:
    //let pdf = pdfParser.getMergedTextBlocksIfNeeded();

    let count1 = 0;
    //get text on a particular page
    for (let page of pdf.formImage.Pages) {
        count1 += page.Texts.length;
    }

    console.log(count1);
    pdfParser.destroy();
});
Bunche answered 27/4, 2016 at 6:56 Comment(2)
"dest.on is not a function"Mentalist
@Mentalist foo.bar is also not a function ;)Bunche
L
-3

It is possible but:

  • you would have to use the server anyway, there's no way you can get content of a file on user computer without transferring it to server and back
  • I don't thing anyone has written such library yet

So if you have some free time you can learn pdf format and write such a library yourself, or you can just use server side library of course.

Licking answered 12/10, 2009 at 12:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.