Get docx file contents using javascript/jquery

Asked 10/2, 2015 at 19:27 Answered 17/10, 2022 at 6:37

I want to open / read docx file using client side technologies (HTML/JS).

I have found a Javascript library named docx.js but personally cannot seem to locate any documentation for it. (http://blog.innovatejs.com/?p=184)

The goal is to make a browser based search tool for docx files and txt files.

Chesterton answered 10/2, 2015 at 19:27 Comment(3)

Is this helpful? github.com/PinZhang/docx.js-demo – Gehlenite 10/2, 2015 at 19:36

@KennyJohnson that demo served here seems to not work: pinzhang.github.io/docx.js-demo – Beilul 8/2, 2017 at 7:9

My apologies. I don't remember if I tested the demo. The asker stated he couldn't find any documentation for it. I remember posting this link for the documentation, but I can't find any at that link now. (This was posted nearly 2 years ago). – Gehlenite 8/2, 2017 at 20:11

With docxtemplater, you can easily get the full text of a word (works with docx only) by using the doc.getFullText() method.

HTML code:

<body>
    <button onclick="gettext()">Get document text</button>
</body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/docxtemplater/3.26.2/docxtemplater.js"></script>
<script src="https://unpkg.com/[email protected]/dist/pizzip.js"></script>
<script src="https://unpkg.com/[email protected]/dist/pizzip-utils.js"></script>
<script>
    function loadFile(url, callback) {
        PizZipUtils.getBinaryContent(url, callback);
    }
    function gettext() {
        loadFile(
            "https://docxtemplater.com/tag-example.docx",
            function (error, content) {
                if (error) {
                    throw error;
                }
                var zip = new PizZip(content);
                var doc = new window.docxtemplater(zip);
                var text = doc.getFullText();
                console.log(text);
                alert("Text is " + text);
            }
        );
    }
</script>

Lowlife answered 11/2, 2015 at 15:28 Comment(11)

thank you for the reply. will look into it. although it seems to solve the issue. – Chesterton 12/2, 2015 at 13:12

your code is not working with jszip version 3.0.0. Would u please update it? – Endoplasm 13/6, 2016 at 7:30

Docxtemplater still depends on [email protected] , you can still install it so it should be working. In future versions, docxtemplater will work with JSZip 3.x – Lowlife 13/6, 2016 at 7:31

Why does that API squash all the newlines? – Cage 26/7, 2017 at 5:30

It is how it works, to just return a single string, or we would have to use formatting (array of strings or HTML) – Lowlife 28/7, 2017 at 12:55

You could use pandoc for that : Convert docx to html for example : github.com/jgm/pandoc – Lowlife 11/4, 2019 at 12:42

Use DocxGen() instead – Anthony 25/10, 2021 at 19:30

Uncaught Error: The constructor with parameters has been removed in JSZip 3.0, please check the upgrade guide. Docxgen is old – Anthony 25/10, 2021 at 19:34

Hi, thanks for the answer. Is there a way we could get the link break for it as well. getFullText seems have no line break. Thanks – Kurtiskurtosis 14/6, 2022 at 21:50

Hello @James, I've released a new enhanced code part here that will get the different paragraphs. docxtemplater.com/faq/… – Lowlife 16/6, 2022 at 16:56

@edi9999, thanks for the link, but the problem is that it is node.js version which seems to be runned over server side. Any idea of client side use user's broswer only? Thanks – Kurtiskurtosis 16/6, 2022 at 17:2

I know this is an old post, but doctemplater has moved on and the accepted answer no longer works. This worked for me:

function loadDocx(filename) {
  // Read document.xml from docx document
  const AdmZip = require("adm-zip");
  const zip = new AdmZip(filename);
  const xml = zip.readAsText("word/document.xml");
  // Load xml DOM
  const cheerio = require('cheerio');
  $ = cheerio.load(xml, {
    normalizeWhitespace: true,
    xmlMode: true
  })
  // Extract text
  let out = new Array()
  $('w\\:t').each((i, el) => {
    out.push($(el).text())
  })
  return out
}

Photocopy answered 5/6, 2019 at 11:36 Comment(2)

Life saver, thanks for this! – Hedron 19/12, 2021 at 21:11

Is this node JS? What is cheerio? – Tactless 12/4, 2022 at 22:55

You can try docxyz.

let {Document} = require('docxyz');
let fileName = 'yourfile.docx';
let document = new Document(fileName);
let text = document.text;
console.log(text);

No tables.

let {Document} = require('docxyz');
let fileName = 'yourfile.docx';
let document = new Document(fileName);
let a = [];
for(let paragraph of document.paragraphs){
    a.push(paragraph.text);
}
let text = a.join('\n');
console.log(text);

Thallophyte answered 17/10, 2022 at 6:37 Comment(0)

This solution will give you an array of strings, one element for each paragraph in the docx :

const PizZip = require("pizzip");
const { DOMParser, XMLSerializer } = require("@xmldom/xmldom");
const fs = require("fs");
const path = require("path");

function str2xml(str) {
    if (str.charCodeAt(0) === 65279) {
        // BOM sequence
        str = str.substr(1);
    }
    return new DOMParser().parseFromString(str, "text/xml");
}

function getParagraphs(content) {
    const zip = new PizZip(content);
    const xml = str2xml(zip.files["word/document.xml"].asText());
    const paragraphsXml = xml.getElementsByTagName("w:p");
    const paragraphs = [];

    for (let i = 0, len = paragraphsXml.length; i < len; i++) {
        let fullText = "";
        const textsXml =
            paragraphsXml[i].getElementsByTagName("w:t");
        for (let j = 0, len2 = textsXml.length; j < len2; j++) {
            const textXml = textsXml[j];
            if (textXml.childNodes) {
                fullText += textXml.childNodes[0].nodeValue;
            }
        }

        paragraphs.push(fullText);
    }
    return paragraphs;
}

// Load the docx file as binary content
const content = fs.readFileSync(
    path.resolve(__dirname, "examples/cond-image.docx"),
    "binary"
);

// Will print ['Hello John', 'how are you ?'] if the document has two paragraphs.
console.log(getParagraphs(content));

Source : https://docxtemplater.com/faq/#how-can-i-retrieve-the-docx-content-as-text

Lowlife answered 16/6, 2022 at 17:1 Comment(0)

-1

If you want to be able to display the docx files in a web browser, you might be interested in Native Documents' recently released commercial Word File Editor; try it at https://nativedocuments.com/test_drive.html

You'll get much better layout fidelity if you do it this way, than if you try to convert to (X)HTML and view it that way.

It is designed specifically for embedding in a webapp, so there is an API for loading documents, and it will sit happily within the security context of your webapp.

Disclosure: I have a commercial interest in Native Documents

Prothonotary answered 26/4, 2018 at 23:0 Comment(0)

HTML code:

Recommended topics

Hot tags