Get docx file contents using javascript/jquery
Asked Answered
C

5

7

I want to open / read docx file using client side technologies (HTML/JS).

I have found a Javascript library named docx.js but personally cannot seem to locate any documentation for it. (http://blog.innovatejs.com/?p=184)

The goal is to make a browser based search tool for docx files and txt files.

Chesterton answered 10/2, 2015 at 19:27 Comment(3)
Is this helpful? github.com/PinZhang/docx.js-demoGehlenite
@KennyJohnson that demo served here seems to not work: pinzhang.github.io/docx.js-demoBeilul
My apologies. I don't remember if I tested the demo. The asker stated he couldn't find any documentation for it. I remember posting this link for the documentation, but I can't find any at that link now. (This was posted nearly 2 years ago).Gehlenite
L
9

With docxtemplater, you can easily get the full text of a word (works with docx only) by using the doc.getFullText() method.

HTML code:

<body>
    <button onclick="gettext()">Get document text</button>
</body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/docxtemplater/3.26.2/docxtemplater.js"></script>
<script src="https://unpkg.com/[email protected]/dist/pizzip.js"></script>
<script src="https://unpkg.com/[email protected]/dist/pizzip-utils.js"></script>
<script>
    function loadFile(url, callback) {
        PizZipUtils.getBinaryContent(url, callback);
    }
    function gettext() {
        loadFile(
            "https://docxtemplater.com/tag-example.docx",
            function (error, content) {
                if (error) {
                    throw error;
                }
                var zip = new PizZip(content);
                var doc = new window.docxtemplater(zip);
                var text = doc.getFullText();
                console.log(text);
                alert("Text is " + text);
            }
        );
    }
</script>
Lowlife answered 11/2, 2015 at 15:28 Comment(11)
thank you for the reply. will look into it. although it seems to solve the issue.Chesterton
your code is not working with jszip version 3.0.0. Would u please update it?Endoplasm
Docxtemplater still depends on [email protected] , you can still install it so it should be working. In future versions, docxtemplater will work with JSZip 3.xLowlife
Why does that API squash all the newlines?Cage
It is how it works, to just return a single string, or we would have to use formatting (array of strings or HTML)Lowlife
You could use pandoc for that : Convert docx to html for example : github.com/jgm/pandocLowlife
Use DocxGen() insteadAnthony
Uncaught Error: The constructor with parameters has been removed in JSZip 3.0, please check the upgrade guide. Docxgen is oldAnthony
Hi, thanks for the answer. Is there a way we could get the link break for it as well. getFullText seems have no line break. ThanksKurtiskurtosis
Hello @James, I've released a new enhanced code part here that will get the different paragraphs. docxtemplater.com/faq/…Lowlife
@edi9999, thanks for the link, but the problem is that it is node.js version which seems to be runned over server side. Any idea of client side use user's broswer only? ThanksKurtiskurtosis
P
5

I know this is an old post, but doctemplater has moved on and the accepted answer no longer works. This worked for me:

function loadDocx(filename) {
  // Read document.xml from docx document
  const AdmZip = require("adm-zip");
  const zip = new AdmZip(filename);
  const xml = zip.readAsText("word/document.xml");
  // Load xml DOM
  const cheerio = require('cheerio');
  $ = cheerio.load(xml, {
    normalizeWhitespace: true,
    xmlMode: true
  })
  // Extract text
  let out = new Array()
  $('w\\:t').each((i, el) => {
    out.push($(el).text())
  })
  return out
}
Photocopy answered 5/6, 2019 at 11:36 Comment(2)
Life saver, thanks for this!Hedron
Is this node JS? What is cheerio?Tactless
T
1

You can try docxyz.

let {Document} = require('docxyz');
let fileName = 'yourfile.docx';
let document = new Document(fileName);
let text = document.text;
console.log(text);

No tables.

let {Document} = require('docxyz');
let fileName = 'yourfile.docx';
let document = new Document(fileName);
let a = [];
for(let paragraph of document.paragraphs){
    a.push(paragraph.text);
}
let text = a.join('\n');
console.log(text);
Thallophyte answered 17/10, 2022 at 6:37 Comment(0)
L
0

This solution will give you an array of strings, one element for each paragraph in the docx :

const PizZip = require("pizzip");
const { DOMParser, XMLSerializer } = require("@xmldom/xmldom");
const fs = require("fs");
const path = require("path");

function str2xml(str) {
    if (str.charCodeAt(0) === 65279) {
        // BOM sequence
        str = str.substr(1);
    }
    return new DOMParser().parseFromString(str, "text/xml");
}

function getParagraphs(content) {
    const zip = new PizZip(content);
    const xml = str2xml(zip.files["word/document.xml"].asText());
    const paragraphsXml = xml.getElementsByTagName("w:p");
    const paragraphs = [];

    for (let i = 0, len = paragraphsXml.length; i < len; i++) {
        let fullText = "";
        const textsXml =
            paragraphsXml[i].getElementsByTagName("w:t");
        for (let j = 0, len2 = textsXml.length; j < len2; j++) {
            const textXml = textsXml[j];
            if (textXml.childNodes) {
                fullText += textXml.childNodes[0].nodeValue;
            }
        }

        paragraphs.push(fullText);
    }
    return paragraphs;
}

// Load the docx file as binary content
const content = fs.readFileSync(
    path.resolve(__dirname, "examples/cond-image.docx"),
    "binary"
);

// Will print ['Hello John', 'how are you ?'] if the document has two paragraphs.
console.log(getParagraphs(content));

Source : https://docxtemplater.com/faq/#how-can-i-retrieve-the-docx-content-as-text

Lowlife answered 16/6, 2022 at 17:1 Comment(0)
P
-1

If you want to be able to display the docx files in a web browser, you might be interested in Native Documents' recently released commercial Word File Editor; try it at https://nativedocuments.com/test_drive.html

You'll get much better layout fidelity if you do it this way, than if you try to convert to (X)HTML and view it that way.

It is designed specifically for embedding in a webapp, so there is an API for loading documents, and it will sit happily within the security context of your webapp.

Disclosure: I have a commercial interest in Native Documents

Prothonotary answered 26/4, 2018 at 23:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.