JavaScript library to read doc and docx on client

Asked 22/6, 2017 at 12:1 Answered 9/7, 2020 at 21:16

I am searching for a JavaScript library, which can read .doc - and .docx - files. The focus is only on the text content. I am not interested in pictures, formulas or other special structures in MS-Word file.

It would be great if the library works with to JavaScript FileReader as shown in the code below.

function readExcel(currfile) {
  var reader = new FileReader();

  reader.onload = (function (_file) {
      return function (e) {
          //here should the magic happen
      };
  })(currfile);

  reader.onabort = function (e) {
      alert('File read canceled');
  };

  reader.readAsBinaryString(currfile);
}

I searched through the internet, but I could not get what I was looking for.

Bout answered 22/6, 2017 at 12:1 Comment(2)

I'm not aware of any JS libraries that can display doc/docx contents on front end only. But if you fetch these files from a backend, you can extract the text content of doc/docx files in the backend before sending the text content to the front end by using Apache Tika, e.g. Tika#parseToString() method. – Mathura 22/6, 2017 at 12:14

Thanks for your reply, but my backend is Microsoft Dynamics NAV. So your solution is sadly not working for me. And as further information it has to be a JS AddIn for NAV. – Bout 22/6, 2017 at 13:21

You can use docxtemplater for this (even if normally, it is used for templating, it can also just get the text of the document) :

const zip = new PizZip(content);

// This will parse the template, and will throw an error if the template is
// invalid, for example, if the template is "{user" (no closing tag)
const doc = new Docxtemplater(zip, {
    paragraphLoop: true,
    linebreaks: true,
});

const text = doc.getFullText();

See the Doc for installation information (I'm the maintainer of this project)

However, it only handles docx, not doc

Supinator answered 23/6, 2017 at 11:2 Comment(7)

Thanks, that is what i was looking for. You did great work. – Bout 26/6, 2017 at 11:39

I get an error when I use this as a zip file zip.file('yo.docx', element.data, {base64: true}); – Wheelwright 3/3, 2018 at 12:57

What kind of error ? Are you using jzip version 2 ? If you are using JSZip version 3, it will fail. – Supinator 3/3, 2018 at 15:18

@Supinator where can I find documentation for doc object? – Millesimal 11/7, 2021 at 12:52

Here : docxtemplater.readthedocs.io/en/latest/generate.html – Supinator 11/7, 2021 at 17:20

Seems the link is dead now. Is this project still maintained? – Inactive 8/12, 2023 at 15:10

Here is the uptodate documentation : docxtemplater.com/docs/get-started-node – Supinator 8/12, 2023 at 20:30

now you can extract the text content from doc/docx without installing external dependencies.

You can use the node library called any-text

Currently, it supports a number of file extensions like PDF, XLSX, XLS, CSV etc

Usage is very simple:

Install the library as a dependency (/dev-dependency)

npm i -D any-text

Make use of the getText method to read the text content

var reader = require('any-text');

reader.getText(`path-to-file`).then(function (data) {
  console.log(data);
});

You can also use the async/await notation

var reader = require('any-text');

const text = await reader.getText(`path-to-file`);

console.log(text);

Sample Test

var reader = require('any-text');

const chai = require('chai');
const expect = chai.expect;

describe('file reader checks', () => {
  it('check docx file content', async () => {
    expect(
      await reader.getText(`${process.cwd()}/test/files/dummy.doc`)
    ).to.contains('Lorem ipsum');
  });
});

I hope it will help!

Cheongsam answered 9/7, 2020 at 21:16 Comment(2)

It is customary to inform the users that you are the author of the library mentioned in your answer. – Kimberleekimberley 28/1, 2021 at 12:54

this question is client-side specific – Equerry 6/12, 2023 at 11:46

Sample Test

Recommended topics

Hot tags