How to extract text from MS office documents in C#

M

10

42

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

Mabellemable answered 18/6, 2009 at 7:20 Comment(0)

F

27

Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

Fuji answered 18/6, 2009 at 8:28 Comment(7)

Interesting... a very sneaky solution :) – Steelyard 18/6, 2009 at 9:5

Not really. It's the mechanism used by the indexing service on Windows and I think the desktop search also uses it. I've used it to index pdfs (by installing the Adobe IFilter - adobe.com/support/downloads/detail.jsp?ftpID=2611), all types of Office documents (the IFilters for these come installed with Windows) and several other file types. When it works, it works well. Occasionally though, you get no text back from the IFilter, and no reason as to why. – Fuji 18/6, 2009 at 11:3

I used pInvoke and find it excellent. To extract text from any document all we have to do is make sure the appropriate IFilter is installed on the machine (or download and install). And i love this articel and sample form code project look at this codeproject.com/KB/cs/IFilter.aspx for MS Office 2007 here is the MS Office 2007 filter pack microsoft.com/downloads/… – Mabellemable 19/6, 2009 at 8:25

Yes, as long as you install the PDF iFilter. You can do this by installing Acrobat Reader (the iFilter gets installed with it), or by installing the iFilter separately (adobe.com/support/downloads/detail.jsp?ftpID=4025). [Note: other PDF iFilters are available :)] – Fuji 22/2, 2010 at 17:15

2 quick Qs - a) I am currently using the method outlined here - codeproject.com/KB/cs/PDFToText.aspx to extract text from PDF. In what way would using IFilters be any different? b) In the IFilter method you linked, the author does a: TextReader reader=new FilterReader(fileName); I am using the FileUpload control in ASP.NET and I cannot get the path to the fileName as this is not exposed on the server side for security. I can only do the following with the fileUpload control on the server side: Stream str = fileUpload1.FileContent; byte b[] = fileUpload1.FileBytes; – Septi 22/2, 2010 at 17:31

@user102533: a) The only real difference is that using the IFilter gives you a generic method of extracting the text from any supported files type. Using PDFToText is specific to that library, and to PDF files. If you only need to do it for PDF files though, it doesn't make much difference (and might be better as the Adobe IFilter is a bit temperamental). b) IFilters work by you passing them a filename. What I've done in the past is to save the byte[] to a temporary file and then pass its filename to the IFilter. – Fuji 22/2, 2010 at 21:59

Please post a sample of invoking an iFilter using pInvoke. – Acanthous 27/12, 2011 at 16:55

H

48

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.

See: http://msdn.microsoft.com/en-us/library/bb448854.aspx

 public static string TextFromWord(SPFile file)
    {
        const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

        StringBuilder textBuilder = new StringBuilder();
        using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
        {
            // Manage namespaces to perform XPath queries.  
            NameTable nt = new NameTable();
            XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
            nsManager.AddNamespace("w", wordmlNamespace);

            // Get the document part from the package.  
            // Load the XML in the document part into an XmlDocument instance.  
            XmlDocument xdoc = new XmlDocument(nt);
            xdoc.Load(wdDoc.MainDocumentPart.GetStream());

            XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
            foreach (XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
                foreach (System.Xml.XmlNode textNode in textNodes)
                {
                    textBuilder.Append(textNode.InnerText);
                }
                textBuilder.Append(Environment.NewLine);
            }

        }
        return textBuilder.ToString();
    }

Heppman answered 28/12, 2011 at 18:21 Comment(10)

@adrianbanks I feel that this answer is currently better than the accepted answer because the accepted answer will not work on certain versions of Windows and because IFilter is a deprecated interface. Of course at the time adrian's post was written that was not the case. – Heppman 28/12, 2011 at 18:24

What about SPFile? The argument you are putting in the function is of this type and all I could found about it is Microsoft.Sharepoint namespace in Microsoft.Sharepoint.dll -> and this dll is not easy to find. What have you referenced to get SPFile? – Regine 30/9, 2013 at 10:33

@user867703 You don't have to use SPFile. It was an example. You can use any .docx file (opened as a binary stream). Look at the WordprocessingDocument.Open method, that's the important method. – Heppman 30/9, 2013 at 15:12

I simply changed SPFile to path (string) and in open method I've used just path -> it works. Solution is very clear and simple. – Regine 30/9, 2013 at 19:15

@KyleM This doesn't looks like working for me on a 64 bit system. I can't find the DocumentFormat.OpenXML dll for 64 bit system. Adding 32 bit doesn't works. Or I am doing something wrong? – Templar 18/10, 2013 at 13:58

@Templar Well your application will have to be run in 32 bit mode. A 64 bit process cannot load a 32 bit .dll. All assemblies loaded by a particular process must conform to the "bit-ness" of that process – Heppman 18/10, 2013 at 16:1

@KyleM Hey thanks for the reply. Turns out I just had to change the framework from 2.0 to 3.5. And it does works on my 64 bit project, just to confirm. Thanks anyway :) – Templar 19/10, 2013 at 9:24

@Templar Glad you got it working but what I said was correct. See #2265523 – Heppman 20/10, 2013 at 6:13

In the OpenXML package you need to import: DocumentFormat.OpenXml.Packaging DocumentFormat.OpenXml.Wordprocessing And you need to reference WindowsBase.dll for it to work. Other than that; nice solution. – Glum 9/12, 2014 at 14:57

@KristianBarrett Thanks. If you reference the DLL I mentioned in the post, I think Visual Studio will tell you which packages to import. It's been a while though, so thanks for the exact imports for anyone who needs them. – Heppman 10/12, 2014 at 16:45

F

27

Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

Fuji answered 18/6, 2009 at 8:28 Comment(7)

Interesting... a very sneaky solution :) – Steelyard 18/6, 2009 at 9:5

Not really. It's the mechanism used by the indexing service on Windows and I think the desktop search also uses it. I've used it to index pdfs (by installing the Adobe IFilter - adobe.com/support/downloads/detail.jsp?ftpID=2611), all types of Office documents (the IFilters for these come installed with Windows) and several other file types. When it works, it works well. Occasionally though, you get no text back from the IFilter, and no reason as to why. – Fuji 18/6, 2009 at 11:3

I used pInvoke and find it excellent. To extract text from any document all we have to do is make sure the appropriate IFilter is installed on the machine (or download and install). And i love this articel and sample form code project look at this codeproject.com/KB/cs/IFilter.aspx for MS Office 2007 here is the MS Office 2007 filter pack microsoft.com/downloads/… – Mabellemable 19/6, 2009 at 8:25

Yes, as long as you install the PDF iFilter. You can do this by installing Acrobat Reader (the iFilter gets installed with it), or by installing the iFilter separately (adobe.com/support/downloads/detail.jsp?ftpID=4025). [Note: other PDF iFilters are available :)] – Fuji 22/2, 2010 at 17:15

2 quick Qs - a) I am currently using the method outlined here - codeproject.com/KB/cs/PDFToText.aspx to extract text from PDF. In what way would using IFilters be any different? b) In the IFilter method you linked, the author does a: TextReader reader=new FilterReader(fileName); I am using the FileUpload control in ASP.NET and I cannot get the path to the fileName as this is not exposed on the server side for security. I can only do the following with the fileUpload control on the server side: Stream str = fileUpload1.FileContent; byte b[] = fileUpload1.FileBytes; – Septi 22/2, 2010 at 17:31

@user102533: a) The only real difference is that using the IFilter gives you a generic method of extracting the text from any supported files type. Using PDFToText is specific to that library, and to PDF files. If you only need to do it for PDF files though, it doesn't make much difference (and might be better as the Adobe IFilter is a bit temperamental). b) IFilters work by you passing them a filename. What I've done in the past is to save the byte[] to a temporary file and then pass its filename to the IFilter. – Fuji 22/2, 2010 at 21:59

Please post a sample of invoking an iFilter using pInvoke. – Acanthous 27/12, 2011 at 16:55

L

18

Tika is very helpful and easy to extract text from different kind of documents, including microsoft office files.

You can use this project which is such a nice piece of art made by Kevin Miller http://kevm.github.io/tikaondotnet/

Just simply add this NuGet package https://www.nuget.org/packages/TikaOnDotNet/

and then, this one line of code will do the magic:

var text = new TikaOnDotNet.TextExtractor().Extract("fileName.docx  / pdf  / .... ").Text;

Lanciform answered 23/11, 2015 at 2:5 Comment(2)

This is the package you need: nuget.org/packages/TikaOnDotnet.TextExtractor – Incorporator 26/4, 2017 at 11:36

Worth noting here that this actually runs Apache Tika (java) through IKVM which is a .net runtime for java, so it's not a light-weight solution. (40MB of binaries, basically a whole java runtime) – Eskisehir 28/2, 2018 at 23:49

D

13

Let me just correct a little bit the answer given by KyleM. I just added processing of two extra nodes, which influence the result: one is responsible for the horizontal tabulation with "\t", other - for the vertical tabulation with "\v". Here is the code:

    public static string ReadAllTextFromDocx(FileInfo fileInfo)
    {
        StringBuilder stringBuilder;
        using(WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(dataSourceFileInfo.FullName, false))
        {
            NameTable nameTable = new NameTable();
            XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nameTable);
            xmlNamespaceManager.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

            string wordprocessingDocumentText;
            using(StreamReader streamReader = new StreamReader(wordprocessingDocument.MainDocumentPart.GetStream()))
            {
                wordprocessingDocumentText = streamReader.ReadToEnd();
            }

            stringBuilder = new StringBuilder(wordprocessingDocumentText.Length);

            XmlDocument xmlDocument = new XmlDocument(nameTable);
            xmlDocument.LoadXml(wordprocessingDocumentText);

            XmlNodeList paragraphNodes = xmlDocument.SelectNodes("//w:p", xmlNamespaceManager);
            foreach(XmlNode paragraphNode in paragraphNodes)
            {
                XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t | .//w:tab | .//w:br", xmlNamespaceManager);
                foreach(XmlNode textNode in textNodes)
                {
                    switch(textNode.Name)
                    {
                        case "w:t":
                            stringBuilder.Append(textNode.InnerText);
                            break;

                        case "w:tab":
                            stringBuilder.Append("\t");
                            break;

                        case "w:br":
                            stringBuilder.Append("\v");
                            break;
                    }
                }

                stringBuilder.Append(Environment.NewLine);
            }
        }

        return stringBuilder.ToString();
    }

Dolf answered 2/7, 2014 at 16:4 Comment(2)

How to do you extract images if there is one inside the w:p? – Strangles 11/5, 2017 at 12:41

Note: You will need to add a reference to DocumentFormat.OpenXml and add this: using DocumentFormat.OpenXml.Packaging; – Zuzana 13/2, 2023 at 20:0

L

11

Use The Microsoft Office Interop. It's free and slick. Here how I pulled all the words from a doc.

    using Microsoft.Office.Interop.Word;

   //Create Doc
    string docPath = @"C:\docLocation.doc";
    Application app = new Application();
    Document doc = app.Documents.Open(docPath);

    //Get all words
    string allWords = doc.Content.Text;
    doc.Close();
    app.Quit();

Then do whatever you want with the words.

Lecroy answered 19/10, 2016 at 2:57 Comment(4)

Ah, brilliant my friend. This should now be the accepted answer, the rest are outdated. – Macbeth 27/10, 2016 at 11:9

This is very easy, but also very slow solution. Open XML is "thousands" times faster. – Octosyllabic 4/11, 2016 at 16:25

It's free - doesn't it require you to have Word installed? – Maddalena 4/1, 2019 at 16:36

@Chris: And appart from Matt Burland's catch22, how do I run this on a Linux server ? ;) – Escudo 26/4, 2019 at 16:43

E

7

A bit late to the party, but nevertheless - nowadays you don't need to download anything - all is already installed with .NET: (just make sure to add references to System.IO.Compression and System.IO.Compression.FileSystem)

using System;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml;
using System.Text;
using System.IO.Compression;

public static class DocxTextExtractor
{
    public static string Extract(string filename)
    {
        XmlNamespaceManager NsMgr = new XmlNamespaceManager(new NameTable());
        NsMgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");

        using (var archive = ZipFile.OpenRead(filename))
        {
            return XDocument
                .Load(archive.GetEntry(@"word/document.xml").Open())
                .XPathSelectElements("//w:p", NsMgr)
                .Aggregate(new StringBuilder(), (sb, p) => p
                    .XPathSelectElements(".//w:t|.//w:tab|.//w:br", NsMgr)
                    .Select(e => { switch (e.Name.LocalName) { case "br": return "\v"; case "tab": return "\t"; } return e.Value; })
                    .Aggregate(sb, (sb1, v) => sb1.Append(v)))
                .ToString();
        }
    }
}

Expiable answered 15/9, 2016 at 16:40 Comment(3)

This looks like a great solution, but I'm unable to make this work since I'm getting an error: Number of entries expected in End Of Central Directory does not correspond to number of entries in Central Directory. – Macbeth 24/3, 2017 at 16:26

That message seems to be a ZipFile notion of a zip file (i.e. docx file in this case) being corrupt... – Expiable 26/3, 2017 at 19:18

this doesn't work because it doesn't preserve line ends. – Willemstad 12/3, 2020 at 17:2

H

2

Simple!

These two steps will get you there:

1) Use the Office Interop library to convert DOC to DOCX
2) Use DOCX2TXT to extract the text from the new DOCX

The link for 1) has a very good explanation of how to do the conversion and even a code sample.

An alternative to 2) is to just unzip the DOCX file in C# and scan for the files you need. You can read about the structure of the ZIP file here.

Edit: Ah yes, I forgot to point out as Skurmedel did below that you must have Office installed on the system on which you want to do the conversion.

Heelandtoe answered 18/6, 2009 at 7:38 Comment(4)

Only sad part with the Office interop library is that you need to have Office installed. – Steelyard 18/6, 2009 at 7:40

Interop is usable, but should be avoided if possible. – Hying 21/10, 2011 at 3:16

Microsoft Word 12.0 Object Library --> This is not in my Add Reference list on the Add Reference right click. Is there another way that Microsoft Word 12.0 Object Library has to be entered so that I can read in a word document. – Revelry 19/12, 2013 at 21:50

Interop not working in godaddy hosting. Godday not support Office. – Savanna 17/6, 2016 at 4:57

S

1

I did a docx text extractor once, and it was very simple. Basically docx, and the other (new) formats I presume, is a zip-file with a bunch of XML-files instead. The text can be extracted using a XmlReader and using only .NET-classes.

I don't have the code anymore, it seems :(, but I found a guy who have a similar solution.

Maybe this isn't viable for you if you need to read .doc and .xls files though, since they are binary formats and probably much harder to parse.

There is also the OpenXML SDK, still in CTP though, released by Microsoft.

Steelyard answered 18/6, 2009 at 7:25 Comment(3)

this is really greate! I am done with docx, and what about for the rest? – Mabellemable 18/6, 2009 at 9:22

You can "connect" to a xslx-file like it were a database with ODCB I think. A quite cumbersome solution I think. I have no idea on how to read .doc-files or .xls-files, so I can't help you there. Here is a reference for .xls files though: sc.openoffice.org/excelfileformat.pdf – Steelyard 18/6, 2009 at 10:32

I couldn't find anything better on XLSX than the specification itself sadly: ecma-international.org/publications/files/ECMA-ST/… – Steelyard 18/6, 2009 at 10:37

F

0

If you're looking for asp.net options, the interop won't work unless you install office on the server. Even then, Microsoft says not to do it.

I used Spire.Doc, worked beautifully. Spire.Doc download It even read documents that were really .txt but were saved .doc. They have free and pay versions. You can also get a trial license that removes some warning from documents that you create, but I didn't create any, just searched them so the free version worked like a charm.

Faena answered 23/6, 2017 at 16:51 Comment(1)

Erik Felde ,could you give some example for asp.net on Spire.Doc – Ambroid 25/9, 2018 at 3:52

S

0

One of the suitable options for extracting text from Office documents in C# is GroupDocs.Parser for .NET API. The following are the code samples for extracting simple as well as formatted text.

Extracting Text

// Create an instance of Parser class
using(Parser parser = new Parser("sample.docx"))
{
    // Extract a text into the reader
    using(TextReader reader = parser.GetText())
    {
        // Print a text from the document
        // If text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
    }
}

Extracting Formatted Text

// Create an instance of Parser class
using (Parser parser = new Parser("sample.docx"))
{
    // Extract a formatted text into the reader
    using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
    {
        // Print a formatted text from the document
        // If formatted text extraction isn't supported, a reader is null
        Console.WriteLine(reader == null ? "Formatted text extraction isn't suppported" : reader.ReadToEnd());
    }
}

Disclosure: I work as Developer Evangelist at GroupDocs.

Sialkot answered 9/10, 2019 at 10:18 Comment(0)

Recommended topics

Hot tags