Reading/Writing a MS Word file in PHP
Asked Answered
F

16

33

Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object? I know that I can:

$file = fopen('c:\file.doc', 'w+');
fwrite($file, $text);
fclose();

but Word will read it as an HTML file not a native .doc file.

Fronniah answered 9/10, 2008 at 18:9 Comment(1)
I find it HIGHLY unlikely that you could achieve this without using COM.Preen
M
29

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.

The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.

Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

Matteroffact answered 5/11, 2008 at 13:4 Comment(1)
It seems the ppl at PHPExcel have made PHPWord to create word documents.Incogitable
F
18

this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007

<?php



/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;


?>
Factorial answered 5/11, 2008 at 12:35 Comment(2)
Do not use this if you want to preserve Umlaute.Rind
I find some special characters that cannot be parsed in this function.Penholder
H
8

You can use Antiword, it is a free MS Word reader for Linux and most popular OS.

$document_file = 'c:\file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);
Herschelherself answered 23/5, 2009 at 0:57 Comment(5)
The problem with this type of solution is that it assumes that one is able to install software on the server.Fronniah
Bit of a long time, but correct me if i'm wrong. C:\file.doc is a windows directory and /usr/local/bin is a Linux/Unix directory?Respond
@UnkwnTech: as long as the program doesn't require elevated permission, most programs can be installed in any directory that you do have permission to write to. You can then use the full path to refer to the program, or add the install directory to your PATH variable.Embranchment
@LieRyan you missed the point, if your running this in a shared hosting environment you most often can't install any software regardless of the directory.Fronniah
@UnkwnTech: by installing, I meant simply copying it to any directory you have write permission on and setting the execute bit. This works in any shared hosting environment that gives you ssh access or at least the ability to execute scripts (i.e. the only environment this wouldn't work is on static file only hosting, but then you won't be talking about PHP anyway). If you only have ftp access and no ssh, it's still possible, though you may need to write a few PHP script to set the execute bit.Embranchment
N
6

I don't know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.

Nanon answered 10/10, 2008 at 0:23 Comment(0)
Q
6

Just updating the code

<?php

/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $word_text = @fread($fileHandle, filesize($userDoc));
    $line = "";
    $tam = filesize($userDoc);
    $nulos = 0;
    $caracteres = 0;
    for($i=1536; $i<$tam; $i++)
    {
        $line .= $word_text[$i];

        if( $word_text[$i] == 0)
        {
            $nulos++;
        }
        else
        {
            $nulos=0;
            $caracteres++;
        }

        if( $nulos>1996)
        {   
            break;  
        }
    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);
    //$outtext = "<pre>";

    $outtext = "";
    foreach($lines as $thisline)
    {
        $tam = strlen($thisline);
        if( !$tam )
        {
            continue;
        }

        $new_line = ""; 
        for($i=0; $i<$tam; $i++)
        {
            $onechar = $thisline[$i];
            if( $onechar > chr(240) )
            {
                continue;
            }

            if( $onechar >= chr(0x20) )
            {
                $caracteres++;
                $new_line .= $onechar;
            }

            if( $onechar == chr(0x14) )
            {
                $new_line .= "</a>";
            }

            if( $onechar == chr(0x07) )
            {
                $new_line .= "\t";
                if( isset($thisline[$i+1]) )
                {
                    if( $thisline[$i+1] == chr(0x07) )
                    {
                        $new_line .= "\n";
                    }
                }
            }
        }
        //troca por hiperlink
        $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
        $new_line = str_replace("\o" ,">",$new_line); 
        $new_line .= "\n";

        //link de imagens
        $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
        $new_line = str_replace("\*" ,"><br>",$new_line); 
        $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 


        $outtext .= nl2br($new_line);
    }

 return $outtext;
} 

$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);

echo $text;


?>
Quaquaversal answered 4/4, 2011 at 2:43 Comment(4)
Although interesting, this failed to find the start of a Word97 document, and cut the document off. I found it's in the 1536 and 1996 numbers, which should be determined by parsing, not arbitrary hardcoding. As well, the special chars like smart quotes, ellipses, em-dash, and special single quotes all were stripped, and I saw a lot of ampersands throughout the output. So, this is an interesting start, but needs a lot of refinement.Reason
You may also want to reference this tutorial on how to convert special MS Word characters: toao.net/48-replacing-smart-quotes-and-em-dashes-in-mysqlReason
the function produces some strange chars: "Œ’ÛJA†ïßaÈ}7Û"ÒÙÞH¡w"ë„™ìw̤ھ½..."Sideway
@Reason change $nulus to a higher number to avoid the break.History
M
5

Most probably you won't be able to read Word documents without COM.

Writing was covered in this topic

Michaels answered 10/10, 2008 at 2:17 Comment(0)
S
3

2007 might be a bit complicated as well.

The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.

Rename a .docx file to .zip and you'll see what I mean.

So if you can work within zip files in PHP, you should be on the right path.

Surreptitious answered 9/10, 2008 at 18:9 Comment(0)
A
2

www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)...I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse...

Adjoint answered 13/9, 2009 at 17:45 Comment(0)
T
2

Source gotten from

Use following class directly to read word document

class DocxConversion{
    private $filename;

    public function __construct($filePath) {
        $this->filename = $filePath;
    }

    private function read_doc() {
        $fileHandle = fopen($this->filename, "r");
        $line = @fread($fileHandle, filesize($this->filename));   
        $lines = explode(chr(0x0D),$line);
        $outtext = "";
        foreach($lines as $thisline)
          {
            $pos = strpos($thisline, chr(0x00));
            if (($pos !== FALSE)||(strlen($thisline)==0))
              {
              } else {
                $outtext .= $thisline." ";
              }
          }
         $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
        return $outtext;
    }

    private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

 /************************excel sheet************************************/

function xlsx_to_text($input_file){
    $xml_filename = "xl/sharedStrings.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

/*************************power point files*****************************/
function pptx_to_text($input_file){
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        $slide_number = 1; //loop through slide files
        while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text .= strip_tags($xml_handle->saveXML());
            $slide_number++;
        }
        if($slide_number == 1){
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}


    public function convertToText() {

        if(isset($this->filename) && !file_exists($this->filename)) {
            return "File Not exists";
        }

        $fileArray = pathinfo($this->filename);
        $file_ext  = $fileArray['extension'];
        if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
        {
            if($file_ext == "doc") {
                return $this->read_doc();
            } elseif($file_ext == "docx") {
                return $this->read_docx();
            } elseif($file_ext == "xlsx") {
                return $this->xlsx_to_text();
            }elseif($file_ext == "pptx") {
                return $this->pptx_to_text();
            }
        } else {
            return "Invalid File Type";
        }
    }

}

$docObj = new DocxConversion("test.docx"); //replace your document name with correct extension doc or docx 
echo $docText= $docObj->convertToText();
Tibold answered 3/7, 2019 at 10:25 Comment(0)
V
1

Office 2007 .docx should be possible since it's an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven't seen many libraries written to match them yet.

Versicolor answered 10/10, 2008 at 2:45 Comment(0)
J
1

I don't know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called "catdoc"; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.

Juryrig answered 10/10, 2008 at 15:25 Comment(0)
B
1

phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.

See the project web site at:

http://www.phplivedocx.org

Ballon answered 14/5, 2009 at 7:3 Comment(1)
Reference Link is deadLeapt
G
1

One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX. You may see how it works having a look at its online tutorial. You can insert or extract contents or even merge multiple Word files into a asingle one.

Gurgitation answered 28/9, 2012 at 16:44 Comment(0)
C
0

Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.

Carlocarload answered 24/1, 2009 at 5:9 Comment(1)
Circumstance is irrelivent the question was weather or not it was possible, but thanks.Fronniah
M
0

even i'm working on same kind of project [An Onlinw Word Processor]! But i've choosen c#.net and ASP.net. But through the survey i did; i got to know that

By Using Open XML SDK and VSTO [Visual Studio Tools For Office]

we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..

So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!

But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!

Materials answered 5/9, 2010 at 14:17 Comment(0)
B
0

I have the same case I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy. All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP so simple CURL would do it.

Bread answered 11/10, 2010 at 19:12 Comment(1)
Seems like this is the only way to do it after all. Can you provide more details ? I mean, am I supposed to go and purchase a windows hosting and use it to run a PHP code (that uses the COM library) to create the .doc/x file?Pe

© 2022 - 2024 — McMap. All rights reserved.