Extract text from doc and docx
Asked Answered
S

9

16

I would like to know how can I read the contents of a doc or docx. I'm using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.

Skiffle answered 4/4, 2011 at 15:39 Comment(1)
See #4587716, #173746 and #188952 for potential solutions.Ecumenicist
L
15

This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF

function docx2text($filename) {
   return readZippedXML($filename, "word/document.xml");
 }

function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;

// Open received archive file
if (true === $zip->open($archiveFile)) {
    // If done, search for the data file in the archive
    if (($index = $zip->locateName($dataFile)) !== false) {
        // If found, read it to the string
        $data = $zip->getFromIndex($index);
        // Close archive file
        $zip->close();
        // Load XML from a string
        // Skip errors and warnings
        $xml = new DOMDocument();
    $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
        // Return data without XML formatting tags
        return strip_tags($xml->saveXML());
    }
    $zip->close();
}

// In case of failure return empty string
return "";
}

echo docx2text("test.docx"); // Save this contents to file
Lacteous answered 10/9, 2011 at 11:12 Comment(4)
id does not work with .doc extension. it do not have word/document.xml instead it has _rels/.rels.xml what to do is such case??????Recitativo
You helped me a lot. I was thinking about how to count the number of words in docx in PHP. And I just didn't think about strip_tags.Pecos
This doesn't seem to handle carriage returns correctly. The word at the end of a paragraph is merged into the incoming word of the next paragraph. Seems like it needs something along the lines of: $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content); $content = str_replace('</w:r></w:p>', "\r\n", $content); proposed by "M Khalid Junaid" answerHoloblastic
I'v used $xml->formatOutput = true; before load xml and it helped me for carriage return.Sandy
C
15

Here i have added the solution to get the text from .doc,.docx word files

How to extract text from word file .doc,docx php

For .doc

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

For .docx

private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }
Cough answered 22/10, 2013 at 20:10 Comment(2)
hi i used this to open doc files but i only get random characters any ideas what im doing wrong?Labile
Thank you .doc files are working fine. But .docx files are not working. I used the above code. Mime type of my .docx file is shows 'application/msword'. Am I missing anything to add?Gabfest
F
7

Parse .docx, .odt, .doc and .rtf documents

I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.

The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. em and strong tags. This means that if you're using the library for a CMS, text formatting is not lost

You can get it here

Floris answered 5/4, 2016 at 14:58 Comment(5)
Awesome! Using this to be able to merge .doc and .docx in a PDF created with mPDF.Bellda
This single class performs better than the other libraries :)Impresa
can this library also get the images?Tman
@Akintunde-Rotimi I can have a look for youFloris
Thank you Luke.. I also posted on the github repoTman
D
6

My solution is Antiword for .doc and docx2txt for .docx

Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:

Antiword: make global_install
docx2txt: make install

Then to use these tools to extract the text into a string in php:

//for .doc
$text = shell_exec('/usr/local/bin/antiword -w 0 ' . 
    escapeshellarg($docFilePath));

//for .docx
$text = shell_exec('/usr/local/bin/docx2txt.pl ' . 
    escapeshellarg($docxFilePath) . ' -');

docx2txt requires perl

no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.

Drakensberg answered 15/1, 2013 at 22:54 Comment(0)
V
1

Try ApachePOI. It works well for Java. I suppose you won't have any difficulties installing Java on Linux.

Versed answered 5/5, 2011 at 7:35 Comment(0)
C
1

I would suggest, Extract text using apache Tika, you can extract multiple type of file content like .doc/.docx and pdf and many other.

Commons answered 5/7, 2020 at 8:37 Comment(0)
Z
0

I used docxtotxt to extract docx file content. My code is as follows:

if($extention == "docx")
{   
    $docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
    $content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl     
    '.escapeshellarg($docxFilePath) . ' -');
}
Zirkle answered 1/3, 2014 at 15:21 Comment(0)
A
0

I insert little improvements in doc to txt converter function

private function read_doc() {
    $line_array = array();
    $fileHandle = fopen( $this->filename, "r" );
    $line       = @fread( $fileHandle, filesize( $this->filename ) );
    $lines      = explode( chr( 0x0D ), $line );
    $outtext    = "";
    foreach ( $lines as $thisline ) {
        $pos = strpos( $thisline, chr( 0x00 ) );
        if (  $pos !== false )  {

        } else {
            $line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $thisline );

        }
    }

    return implode("\n",$line_array);
}

Now it saves empty rows and txt file looks row by row .

Angeliaangelic answered 30/1, 2016 at 5:28 Comment(0)
S
0

You can use Apache Tika as complete solution it provides REST API.

Another good library is RawText, as it can do an OCR over images, and extract text from any doc. It's non-free, and it works over REST API.

The sample code extracting your file with RawText:

$result = $rawText->extract($your_file)
Swinson answered 16/5, 2017 at 7:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.