Reading DOC file in php
Asked Answered
M

6

14

I'm trying to read .doc .docx file in php. All is working fine. But at last line I'm getting awful characters. Please help me. Here is code which is developed by someone.

    function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "k.doc";

Here is screenshot. enter image description here

Machellemachete answered 9/9, 2011 at 7:53 Comment(2)
#188952Irv
Managed to solve this @Machellemachete ? I've tried https://mcmap.net/q/440522/-reading-writing-a-ms-word-file-in-php but it resulted in a partial file.Deneb
K
8

DOC files are not plain text.

Try a library such as PHPWord (old CodePlex site).

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.

Karnak answered 9/9, 2011 at 8:5 Comment(5)
And .docx is a .zipped package of multiple filesConnally
@Karnak I want to count all the words from doc file. Is it possible with PHPword?Machellemachete
ALso, please note that PHPWord is NOT by Microsoft: it's written as a free Open Source library by a group of independent developers and not supported or endorsed by MS in any wayEldwin
Note also that PHPWord does now read BIFF format .doc files, as well as OfficeOpenXML .docx files, and a number of other formats as wellEldwin
Note also that PHPWord is officially hosted on github now, not on codeplexEldwin
S
15

You can read .docx files in PHP but you can't read .doc files. Here is the code to read .docx files:

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) return false;

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}
$filename = "filepath";// or /var/www/html/file.docx

$content = read_file_docx($filename);
if($content !== false) {

    echo nl2br($content);
}
else {
    echo 'Couldn\'t the file. Please check that file.';
}
Sumer answered 12/11, 2012 at 7:51 Comment(3)
Welcome on SO, here, it is a good practice to explain why to use your solution and not just how. That will make your answer more valuable and help further reader to have a better understanding of how you do it. I also suggest that you have a look on our FAQ : stackoverflow.com/faq.Accentuate
Thank you for your answer but how to write into that file?Turbofan
@Sumer It is only reading text from doc, How to get the images with it? images as binary data will also worksShae
K
8

DOC files are not plain text.

Try a library such as PHPWord (old CodePlex site).

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.

Karnak answered 9/9, 2011 at 8:5 Comment(5)
And .docx is a .zipped package of multiple filesConnally
@Karnak I want to count all the words from doc file. Is it possible with PHPword?Machellemachete
ALso, please note that PHPWord is NOT by Microsoft: it's written as a free Open Source library by a group of independent developers and not supported or endorsed by MS in any wayEldwin
Note also that PHPWord does now read BIFF format .doc files, as well as OfficeOpenXML .docx files, and a number of other formats as wellEldwin
Note also that PHPWord is officially hosted on github now, not on codeplexEldwin
K
4

I am using this function working well for me :) try it

function read_doc_file($filename) {
     if(file_exists($filename))
    {
        if(($fh = fopen($filename, 'r')) !== false ) 
        {
           $headers = fread($fh, 0xA00);

           // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
           $n1 = ( ord($headers[0x21C]) - 1 );

           // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
           $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

           // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
           $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

           // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
           $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

           // Total length of text in the document
           $textLength = ($n1 + $n2 + $n3 + $n4);

           $extracted_plaintext = fread($fh, $textLength);

           // simple print character stream without new lines
           //echo $extracted_plaintext;

           // if you want to see your paragraphs in a new line, do this
           return nl2br($extracted_plaintext);
           // need more spacing after each paragraph use another nl2br
        }
    }   
    }
Kianakiang answered 7/10, 2013 at 12:11 Comment(3)
This function works on to read the doc file but I guess only UTF encoded file. Can you please tell me why the other encoding does not work? I tried to read some file using this function and it does not work for all. The only difference I see is the encoding.Iona
see my answer below for encoding issuesBritney
After half an hour of searching a simple answer to read a doc file. I got this answer finally which solved my problem.Dishabille
E
3

Decoding in pure PHP never worked for me, so here is my solution : http://wvware.sourceforge.net/

Install package

sudo apt-get install wv elinks

Use it in PHP :

$output = str_replace('.doc', '.txt', $filename);
shell_exec('/usr/bin/wvText ' . $filename . ' ' . $output);
$text = file_get_contents($output);
# Convert to UTF-8 if needed
if(!mb_detect_encoding($text, 'UTF-8', true))
{
    $text = utf8_encode($text);
}
unlink($output);
Exposed answered 21/3, 2016 at 13:40 Comment(1)
thank you so much @Exposed this is the only way to make it work as I wanted :)Trusty
B
1

I also used it but for accents ( and single quotes like ' ) it would put � instead SOo my PDO mySQL didn't like it but I finally figured it out by adding

mb_convert_encoding($extracted_plaintext,'UTF-8');

So the final version should read:

function getRawWordText($filename) {
    if(file_exists($filename)) {
        if(($fh = fopen($filename, 'r')) !== false ) {
            $headers = fread($fh, 0xA00);
            $n1 = ( ord($headers[0x21C]) - 1 );// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
            $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
            $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
            $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
            $textLength = ($n1 + $n2 + $n3 + $n4);// Total length of text in the document
            $extracted_plaintext = fread($fh, $textLength);
            $extracted_plaintext = mb_convert_encoding($extracted_plaintext,'UTF-8');
             // if you want to see your paragraphs in a new line, do this
             // return nl2br($extracted_plaintext);
             return ($extracted_plaintext);
        } else {
            return false;
        }
    } else {
        return false;
    }  
}

This works fine in a utf8_general_ci mySQL database to read word doc files :)

Hope this helps someone else

Britney answered 9/9, 2011 at 7:54 Comment(0)
S
1

I'm using soffice to convert doc to txt and read txt converted file

soffice --convert-to txt test.doc

you can see more in here

Spieler answered 5/11, 2018 at 13:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.