How to extract text from MS office documents in C#
Asked Answered
M

10

42

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

Mabellemable answered 18/6, 2009 at 7:20 Comment(0)
F
27

Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

Fuji answered 18/6, 2009 at 8:28 Comment(7)
Interesting... a very sneaky solution :)Steelyard
Not really. It's the mechanism used by the indexing service on Windows and I think the desktop search also uses it. I've used it to index pdfs (by installing the Adobe IFilter - adobe.com/support/downloads/detail.jsp?ftpID=2611), all types of Office documents (the IFilters for these come installed with Windows) and several other file types. When it works, it works well. Occasionally though, you get no text back from the IFilter, and no reason as to why.Fuji
I used pInvoke and find it excellent. To extract text from any document all we have to do is make sure the appropriate IFilter is installed on the machine (or download and install). And i love this articel and sample form code project look at this codeproject.com/KB/cs/IFilter.aspx for MS Office 2007 here is the MS Office 2007 filter pack microsoft.com/downloads/…Mabellemable
Yes, as long as you install the PDF iFilter. You can do this by installing Acrobat Reader (the iFilter gets installed with it), or by installing the iFilter separately (adobe.com/support/downloads/detail.jsp?ftpID=4025). [Note: other PDF iFilters are available :)]Fuji
2 quick Qs - a) I am currently using the method outlined here - codeproject.com/KB/cs/PDFToText.aspx to extract text from PDF. In what way would using IFilters be any different? b) In the IFilter method you linked, the author does a: TextReader reader=new FilterReader(fileName); I am using the FileUpload control in ASP.NET and I cannot get the path to the fileName as this is not exposed on the server side for security. I can only do the following with the fileUpload control on the server side: Stream str = fileUpload1.FileContent; byte b[] = fileUpload1.FileBytes;Septi
@user102533: a) The only real difference is that using the IFilter gives you a generic method of extracting the text from any supported files type. Using PDFToText is specific to that library, and to PDF files. If you only need to do it for PDF files though, it doesn't make much difference (and might be better as the Adobe IFilter is a bit temperamental). b) IFilters work by you passing them a filename. What I've done in the past is to save the byte[] to a temporary file and then pass its filename to the IFilter.Fuji
Please post a sample of invoking an iFilter using pInvoke.Acanthous
H
48

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.

See: http://msdn.microsoft.com/en-us/library/bb448854.aspx

 public static string TextFromWord(SPFile file)
    {
        const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

        StringBuilder textBuilder = new StringBuilder();
        using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
        {
            // Manage namespaces to perform XPath queries.  
            NameTable nt = new NameTable();
            XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
            nsManager.AddNamespace("w", wordmlNamespace);

            // Get the document part from the package.  
            // Load the XML in the document part into an XmlDocument instance.  
            XmlDocument xdoc = new XmlDocument(nt);
            xdoc.Load(wdDoc.MainDocumentPart.GetStream());

            XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
            foreach (XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
                foreach (System.Xml.XmlNode textNode in textNodes)
                {
                    textBuilder.Append(textNode.InnerText);
                }
                textBuilder.Append(Environment.NewLine);
            }

        }
        return textBuilder.ToString();
    }
Heppman answered 28/12, 2011 at 18:21 Comment(10)
@adrianbanks I feel that this answer is currently better than the accepted answer because the accepted answer will not work on certain versions of Windows and because IFilter is a deprecated interface. Of course at the time adrian's post was written that was not the case.Heppman
What about SPFile? The argument you are putting in the function is of this type and all I could found about it is Microsoft.Sharepoint namespace in Microsoft.Sharepoint.dll -> and this dll is not easy to find. What have you referenced to get SPFile?Regine
@user867703 You don't have to use SPFile. It was an example. You can use any .docx file (opened as a binary stream). Look at the WordprocessingDocument.Open method, that's the important method.Heppman
I simply changed SPFile to path (string) and in open method I've used just path -> it works. Solution is very clear and simple.Regine
@KyleM This doesn't looks like working for me on a 64 bit system. I can't find the DocumentFormat.OpenXML dll for 64 bit system. Adding 32 bit doesn't works. Or I am doing something wrong?Templar
@Templar Well your application will have to be run in 32 bit mode. A 64 bit process cannot load a 32 bit .dll. All assemblies loaded by a particular process must conform to the "bit-ness" of that processHeppman
@KyleM Hey thanks for the reply. Turns out I just had to change the framework from 2.0 to 3.5. And it does works on my 64 bit project, just to confirm. Thanks anyway :)Templar
@Templar Glad you got it working but what I said was correct. See #2265523Heppman
In the OpenXML package you need to import: DocumentFormat.OpenXml.Packaging DocumentFormat.OpenXml.Wordprocessing And you need to reference WindowsBase.dll for it to work. Other than that; nice solution.Glum
@KristianBarrett Thanks. If you reference the DLL I mentioned in the post, I think Visual Studio will tell you which packages to import. It's been a while though, so thanks for the exact imports for anyone who needs them.Heppman
F
27

Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

Fuji answered 18/6, 2009 at 8:28 Comment(7)
Interesting... a very sneaky solution :)Steelyard
Not really. It's the mechanism used by the indexing service on Windows and I think the desktop search also uses it. I've used it to index pdfs (by installing the Adobe IFilter - adobe.com/support/downloads/detail.jsp?ftpID=2611), all types of Office documents (the IFilters for these come installed with Windows) and several other file types. When it works, it works well. Occasionally though, you get no text back from the IFilter, and no reason as to why.Fuji
I used pInvoke and find it excellent. To extract text from any document all we have to do is make sure the appropriate IFilter is installed on the machine (or download and install). And i love this articel and sample form code project look at this codeproject.com/KB/cs/IFilter.aspx for MS Office 2007 here is the MS Office 2007 filter pack microsoft.com/downloads/…Mabellemable
Yes, as long as you install the PDF iFilter. You can do this by installing Acrobat Reader (the iFilter gets installed with it), or by installing the iFilter separately (adobe.com/support/downloads/detail.jsp?ftpID=4025). [Note: other PDF iFilters are available :)]Fuji
2 quick Qs - a) I am currently using the method outlined here - codeproject.com/KB/cs/PDFToText.aspx to extract text from PDF. In what way would using IFilters be any different? b) In the IFilter method you linked, the author does a: TextReader reader=new FilterReader(fileName); I am using the FileUpload control in ASP.NET and I cannot get the path to the fileName as this is not exposed on the server side for security. I can only do the following with the fileUpload control on the server side: Stream str = fileUpload1.FileContent; byte b[] = fileUpload1.FileBytes;Septi
@user102533: a) The only real difference is that using the IFilter gives you a generic method of extracting the text from any supported files type. Using PDFToText is specific to that library, and to PDF files. If you only need to do it for PDF files though, it doesn't make much difference (and might be better as the Adobe IFilter is a bit temperamental). b) IFilters work by you passing them a filename. What I've done in the past is to save the byte[] to a temporary file and then pass its filename to the IFilter.Fuji
Please post a sample of invoking an iFilter using pInvoke.Acanthous
L
18

Tika is very helpful and easy to extract text from different kind of documents, including microsoft office files.

You can use this project which is such a nice piece of art made by Kevin Miller http://kevm.github.io/tikaondotnet/

Just simply add this NuGet package https://www.nuget.org/packages/TikaOnDotNet/

and then, this one line of code will do the magic:

var text = new TikaOnDotNet.TextExtractor().Extract("fileName.docx  / pdf  / .... ").Text;
Lanciform answered 23/11, 2015 at 2:5 Comment(2)
This is the package you need: nuget.org/packages/TikaOnDotnet.TextExtractorIncorporator
Worth noting here that this actually runs Apache Tika (java) through IKVM which is a .net runtime for java, so it's not a light-weight solution. (40MB of binaries, basically a whole java runtime)Eskisehir
D
13

Let me just correct a little bit the answer given by KyleM. I just added processing of two extra nodes, which influence the result: one is responsible for the horizontal tabulation with "\t", other - for the vertical tabulation with "\v". Here is the code:

    public static string ReadAllTextFromDocx(FileInfo fileInfo)
    {
        StringBuilder stringBuilder;
        using(WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(dataSourceFileInfo.FullName, false))
        {
            NameTable nameTable = new NameTable();
            XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nameTable);
            xmlNamespaceManager.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

            string wordprocessingDocumentText;
            using(StreamReader streamReader = new StreamReader(wordprocessingDocument.MainDocumentPart.GetStream()))
            {
                wordprocessingDocumentText = streamReader.ReadToEnd();
            }

            stringBuilder = new StringBuilder(wordprocessingDocumentText.Length);

            XmlDocument xmlDocument = new XmlDocument(nameTable);
            xmlDocument.LoadXml(wordprocessingDocumentText);

            XmlNodeList paragraphNodes = xmlDocument.SelectNodes("//w:p", xmlNamespaceManager);
            foreach(XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t | .//w:tab | .//w:br", xmlNamespaceManager);
                foreach(XmlNode textNode in textNodes)
                {
                    switch(textNode.Name)
                    {
                        case "w:t":
                            stringBuilder.Append(textNode.InnerText);
                            break;

                        case "w:tab":
                            stringBuilder.Append("\t");
                            break;

                        case "w:br":
                            stringBuilder.Append("\v");
                            break;
                    }
                }

                stringBuilder.Append(Environment.NewLine);
            }
        }

        return stringBuilder.ToString();
    }
Dolf answered 2/7, 2014 at 16:4 Comment(2)
How to do you extract images if there is one inside the w:p?Strangles
Note: You will need to add a reference to DocumentFormat.OpenXml and add this: using DocumentFormat.OpenXml.Packaging;Zuzana
L
11

Use The Microsoft Office Interop. It's free and slick. Here how I pulled all the words from a doc.

    using Microsoft.Office.Interop.Word;

   //Create Doc
    string docPath = @"C:\docLocation.doc";
    Application app = new Application();
    Document doc = app.Documents.Open(docPath);

    //Get all words
    string allWords = doc.Content.Text;
    doc.Close();
    app.Quit();

Then do whatever you want with the words.

Lecroy answered 19/10, 2016 at 2:57 Comment(4)
Ah, brilliant my friend. This should now be the accepted answer, the rest are outdated.Macbeth
This is very easy, but also very slow solution. Open XML is "thousands" times faster.Octosyllabic
It's free - doesn't it require you to have Word installed?Maddalena
@Chris: And appart from Matt Burland's catch22, how do I run this on a Linux server ? ;)Escudo
E
7

A bit late to the party, but nevertheless - nowadays you don't need to download anything - all is already installed with .NET: (just make sure to add references to System.IO.Compression and System.IO.Compression.FileSystem)

using System;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml;
using System.Text;
using System.IO.Compression;

public static class DocxTextExtractor
{
    public static string Extract(string filename)
    {
        XmlNamespaceManager NsMgr = new XmlNamespaceManager(new NameTable());
        NsMgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

        using (var archive = ZipFile.OpenRead(filename))
        {
            return XDocument
                .Load(archive.GetEntry(@"word/document.xml").Open())
                .XPathSelectElements("//w:p", NsMgr)
                .Aggregate(new StringBuilder(), (sb, p) => p
                    .XPathSelectElements(".//w:t|.//w:tab|.//w:br", NsMgr)
                    .Select(e => { switch (e.Name.LocalName) { case "br": return "\v"; case "tab": return "\t"; } return e.Value; })
                    .Aggregate(sb, (sb1, v) => sb1.Append(v)))
                .ToString();
        }
    }
}
Expiable answered 15/9, 2016 at 16:40 Comment(3)
This looks like a great solution, but I'm unable to make this work since I'm getting an error: Number of entries expected in End Of Central Directory does not correspond to number of entries in Central Directory.Macbeth
That message seems to be a ZipFile notion of a zip file (i.e. docx file in this case) being corrupt...Expiable
this doesn't work because it doesn't preserve line ends.Willemstad
H
2

Simple!

These two steps will get you there:

1) Use the Office Interop library to convert DOC to DOCX
2) Use DOCX2TXT to extract the text from the new DOCX

The link for 1) has a very good explanation of how to do the conversion and even a code sample.

An alternative to 2) is to just unzip the DOCX file in C# and scan for the files you need. You can read about the structure of the ZIP file here.

Edit: Ah yes, I forgot to point out as Skurmedel did below that you must have Office installed on the system on which you want to do the conversion.

Heelandtoe answered 18/6, 2009 at 7:38 Comment(4)
Only sad part with the Office interop library is that you need to have Office installed.Steelyard
Interop is usable, but should be avoided if possible.Hying
Microsoft Word 12.0 Object Library --> This is not in my Add Reference list on the Add Reference right click. Is there another way that Microsoft Word 12.0 Object Library has to be entered so that I can read in a word document.Revelry
Interop not working in godaddy hosting. Godday not support Office.Savanna
S
1

I did a docx text extractor once, and it was very simple. Basically docx, and the other (new) formats I presume, is a zip-file with a bunch of XML-files instead. The text can be extracted using a XmlReader and using only .NET-classes.

I don't have the code anymore, it seems :(, but I found a guy who have a similar solution.

Maybe this isn't viable for you if you need to read .doc and .xls files though, since they are binary formats and probably much harder to parse.

There is also the OpenXML SDK, still in CTP though, released by Microsoft.

Steelyard answered 18/6, 2009 at 7:25 Comment(3)
this is really greate! I am done with docx, and what about for the rest?Mabellemable
You can "connect" to a xslx-file like it were a database with ODCB I think. A quite cumbersome solution I think. I have no idea on how to read .doc-files or .xls-files, so I can't help you there. Here is a reference for .xls files though: sc.openoffice.org/excelfileformat.pdfSteelyard
I couldn't find anything better on XLSX than the specification itself sadly: ecma-international.org/publications/files/ECMA-ST/…Steelyard
F
0

If you're looking for asp.net options, the interop won't work unless you install office on the server. Even then, Microsoft says not to do it.

I used Spire.Doc, worked beautifully. Spire.Doc download It even read documents that were really .txt but were saved .doc. They have free and pay versions. You can also get a trial license that removes some warning from documents that you create, but I didn't create any, just searched them so the free version worked like a charm.

Faena answered 23/6, 2017 at 16:51 Comment(1)
Erik Felde ,could you give some example for asp.net on Spire.DocAmbroid
S
0

One of the suitable options for extracting text from Office documents in C# is GroupDocs.Parser for .NET API. The following are the code samples for extracting simple as well as formatted text.

Extracting Text

// Create an instance of Parser class
using(Parser parser = new Parser("sample.docx"))
{
    // Extract a text into the reader
    using(TextReader reader = parser.GetText())
    {
        // Print a text from the document
        // If text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

Extracting Formatted Text

// Create an instance of Parser class
using (Parser parser = new Parser("sample.docx"))
{
    // Extract a formatted text into the reader
    using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
    {
        // Print a formatted text from the document
        // If formatted text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Formatted text extraction isn't suppported" : reader.ReadToEnd());
    }
}

Disclosure: I work as Developer Evangelist at GroupDocs.

Sialkot answered 9/10, 2019 at 10:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.