Anyone know of anything they can recommend in order to extract just the plain text from a .doc
or .docx
?
I've found this - wondered if there were any other suggestions?
Anyone know of anything they can recommend in order to extract just the plain text from a .doc
or .docx
?
I've found this - wondered if there were any other suggestions?
If you want the pure plain text(my requirement) then all you need is
unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
Which I found at command line fu
It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.
unzip -p document.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
Note the additional sed argument, replacing XML representations of newlines with the actual newline character, and I edited the last sed argument to not strip newline characters. This makes the above command far more useful for diff-ing Word documents. –
Hypaethral s/<w:br/>/\n/g;
too ;) –
Distorted One option is libreoffice/openoffice in headless mode (make sure all other instances of libreoffice are closed first):
libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc
For more details see e.g. this link: http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/
For a list of libreoffice filters see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters
Since the openoffice command line syntax is a bit too complicated, there is a handy wrapper which can make the process easier: unoconv.
Another option is Apache POI — a well supported Java library which unlike antiword can read, create and convert .doc
, .docx
, .xls
, .xlsx
, .ppt
, .pptx
files.
Here is the simplest possible Java code for converting a .doc
or .docx
document to plain text:
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.xmlbeans.XmlException;
public class WordToTextConverter {
public static void main(String[] args) {
try {
convertWordToText(args[0], args[1]);
} catch (ArrayIndexOutOfBoundsException aiobe) {
System.out.println("Usage: java WordToTextConverter <word_file> <text_file>");
}
}
public static void convertWordToText(String src, String desc) {
try {
FileInputStream fs = new FileInputStream(src);
final POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
FileWriter fw = new FileWriter(desc);
fw.write(extractor.getText());
fw.flush();
fs.close();
fw.close();
} catch (IOException | OpenXML4JException | XmlException e) {
e.printStackTrace();
}
}
}
# Maven dependencies (pom.xml):
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>my.wordconv</groupId>
<artifactId>my.wordconv.converter</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.17</version>
</dependency>
</dependencies>
</project>
NOTE: You will need to add the apache poi libraries to the classpath. On ubuntu/debian the libraries can be installed with sudo apt-get install libapache-poi-java
— this will install them under /usr/share/java
. For other systems you'll need to download the library and unpack the archive to a folder that you should use instead of /usr/share/java
. If you use maven/gradle (the recommended option), then include the org.apache.poi dependencies as shown in the code snippet.
The same code will work for both .doc
and .docx
as the required converter implementation will be chosen by inspecting the binary stream.
Compile the class above (assuming it's in the default package, and the apache poi jars are under /usr/share/java
):
javac -cp /usr/share/java/*:. WordToTextConverter.java
Run the conversion:
java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt
A clonable gradle project which pulls all necessary dependencies and generates the wrapper shell script (with gradle installDist
).
Error: Please reverify input parameters...
, which I disappeared when I switched to --convert-to "txt:Text (encoded):UTF8"
, so I'd recommend that (even if you don't have non-ascii characters). –
Myrticemyrtie brew install libreoffice
. Then, the command that worked was soffice --headless ...
instead of libreoffice --headless ...
. Although this question is closed, it's the very first google result, so it might be worth adding this to the answer to help us hapless searchers. –
Mamoun /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --help
–
Pacificas brew cask info libreoffice
points to the formula at github.com/Homebrew/homebrew-cask/blob/master/Casks/… where you can see it additionally puts a wrapper script under /usr/local/bin/soffice
. It's useful to know what exactly is going on just in case the formula gets removed, or in case you need a newer version than the one provided by brew. –
Pacificas Try Apache Tika. It supports most document formats (every MS Office format, OpenOffice/LibreOffice formats, PDF, etc.) using Java-based libraries (among others, Apache POI). It's very simple to use:
java -jar tika-app-1.4.jar --text ./my-document.doc
My favorite is antiword:
And here's a similar project which claims support for docx:
I find wv to be better than catdoc or antiword. It can deal with .docx and convert to text or html. Here is a function I added to my .bashrc to temporarily view the file in the terminal. Change it as required.
# open word in less (ie worl document.doc)
worl() {
DOC=$(mktemp /tmp/output.XXXXXXXXXX)
wvText $1 $DOC
less $DOC
rm $DOC
}
brew install wv && brew install elinks
. –
Cambridge I recently dealt with this issue and found OpenOffice/LibreOffice commandline tools to be unreliable in production (thousands of docs processed, dozens concurrently).
Ultimately, I built a light-weight wrapper, DocRipper that is much faster and grabs all text from .doc, .docx and .pdf without formatting. DocRipper utilizes Antiword, grep and pdftotext to grab text and return it.
© 2022 - 2024 — McMap. All rights reserved.
Software Recommendations
why to do not transfer here? I also search software for similar tasks and do not found there best answer. But could recommendpandoc
as best solution which even tables convert correctly. So I suggest reopen question. – Risorgimento