I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.
Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).
For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.
See: http://msdn.microsoft.com/en-us/library/bb448854.aspx
public static string TextFromWord(SPFile file)
{
const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
StringBuilder textBuilder = new StringBuilder();
using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
{
// Manage namespaces to perform XPath queries.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
// Get the document part from the package.
// Load the XML in the document part into an XmlDocument instance.
XmlDocument xdoc = new XmlDocument(nt);
xdoc.Load(wdDoc.MainDocumentPart.GetStream());
XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
foreach (XmlNode paragraphNode in paragraphNodes)
{
XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
foreach (System.Xml.XmlNode textNode in textNodes)
{
textBuilder.Append(textNode.InnerText);
}
textBuilder.Append(Environment.NewLine);
}
}
return textBuilder.ToString();
}
DocumentFormat.OpenXml.Packaging
DocumentFormat.OpenXml.Wordprocessing
And you need to reference WindowsBase.dll
for it to work. Other than that; nice solution. –
Glum Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).
Tika is very helpful and easy to extract text from different kind of documents, including microsoft office files.
You can use this project which is such a nice piece of art made by Kevin Miller http://kevm.github.io/tikaondotnet/
Just simply add this NuGet package https://www.nuget.org/packages/TikaOnDotNet/
and then, this one line of code will do the magic:
var text = new TikaOnDotNet.TextExtractor().Extract("fileName.docx / pdf / .... ").Text;
Let me just correct a little bit the answer given by KyleM. I just added processing of two extra nodes, which influence the result: one is responsible for the horizontal tabulation with "\t", other - for the vertical tabulation with "\v". Here is the code:
public static string ReadAllTextFromDocx(FileInfo fileInfo)
{
StringBuilder stringBuilder;
using(WordprocessingDocument wordprocessingDocument = WordprocessingDocument.Open(dataSourceFileInfo.FullName, false))
{
NameTable nameTable = new NameTable();
XmlNamespaceManager xmlNamespaceManager = new XmlNamespaceManager(nameTable);
xmlNamespaceManager.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");
string wordprocessingDocumentText;
using(StreamReader streamReader = new StreamReader(wordprocessingDocument.MainDocumentPart.GetStream()))
{
wordprocessingDocumentText = streamReader.ReadToEnd();
}
stringBuilder = new StringBuilder(wordprocessingDocumentText.Length);
XmlDocument xmlDocument = new XmlDocument(nameTable);
xmlDocument.LoadXml(wordprocessingDocumentText);
XmlNodeList paragraphNodes = xmlDocument.SelectNodes("//w:p", xmlNamespaceManager);
foreach(XmlNode paragraphNode in paragraphNodes)
{
XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t | .//w:tab | .//w:br", xmlNamespaceManager);
foreach(XmlNode textNode in textNodes)
{
switch(textNode.Name)
{
case "w:t":
stringBuilder.Append(textNode.InnerText);
break;
case "w:tab":
stringBuilder.Append("\t");
break;
case "w:br":
stringBuilder.Append("\v");
break;
}
}
stringBuilder.Append(Environment.NewLine);
}
}
return stringBuilder.ToString();
}
Use The Microsoft Office Interop. It's free and slick. Here how I pulled all the words from a doc.
using Microsoft.Office.Interop.Word;
//Create Doc
string docPath = @"C:\docLocation.doc";
Application app = new Application();
Document doc = app.Documents.Open(docPath);
//Get all words
string allWords = doc.Content.Text;
doc.Close();
app.Quit();
Then do whatever you want with the words.
A bit late to the party, but nevertheless - nowadays you don't need to download anything - all is already installed with .NET: (just make sure to add references to System.IO.Compression and System.IO.Compression.FileSystem)
using System;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml;
using System.Text;
using System.IO.Compression;
public static class DocxTextExtractor
{
public static string Extract(string filename)
{
XmlNamespaceManager NsMgr = new XmlNamespaceManager(new NameTable());
NsMgr.AddNamespace("w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main");
using (var archive = ZipFile.OpenRead(filename))
{
return XDocument
.Load(archive.GetEntry(@"word/document.xml").Open())
.XPathSelectElements("//w:p", NsMgr)
.Aggregate(new StringBuilder(), (sb, p) => p
.XPathSelectElements(".//w:t|.//w:tab|.//w:br", NsMgr)
.Select(e => { switch (e.Name.LocalName) { case "br": return "\v"; case "tab": return "\t"; } return e.Value; })
.Aggregate(sb, (sb1, v) => sb1.Append(v)))
.ToString();
}
}
}
Number of entries expected in End Of Central Directory does not correspond to number of entries in Central Directory.
–
Macbeth ZipFile
notion of a zip file (i.e. docx file in this case) being corrupt... –
Expiable Simple!
These two steps will get you there:
1) Use the Office Interop library to convert DOC to DOCX
2) Use DOCX2TXT to extract the text from the new DOCX
The link for 1) has a very good explanation of how to do the conversion and even a code sample.
An alternative to 2) is to just unzip the DOCX file in C# and scan for the files you need. You can read about the structure of the ZIP file here.
Edit: Ah yes, I forgot to point out as Skurmedel did below that you must have Office installed on the system on which you want to do the conversion.
Interop
is usable, but should be avoided if possible. –
Hying I did a docx text extractor once, and it was very simple. Basically docx, and the other (new) formats I presume, is a zip-file with a bunch of XML-files instead. The text can be extracted using a XmlReader and using only .NET-classes.
I don't have the code anymore, it seems :(, but I found a guy who have a similar solution.
Maybe this isn't viable for you if you need to read .doc and .xls files though, since they are binary formats and probably much harder to parse.
There is also the OpenXML SDK, still in CTP though, released by Microsoft.
If you're looking for asp.net options, the interop won't work unless you install office on the server. Even then, Microsoft says not to do it.
I used Spire.Doc, worked beautifully. Spire.Doc download It even read documents that were really .txt but were saved .doc. They have free and pay versions. You can also get a trial license that removes some warning from documents that you create, but I didn't create any, just searched them so the free version worked like a charm.
One of the suitable options for extracting text from Office documents in C# is GroupDocs.Parser for .NET API. The following are the code samples for extracting simple as well as formatted text.
Extracting Text
// Create an instance of Parser class
using(Parser parser = new Parser("sample.docx"))
{
// Extract a text into the reader
using(TextReader reader = parser.GetText())
{
// Print a text from the document
// If text extraction isn't supported, a reader is null
Console.WriteLine(reader == null ? "Text extraction isn't supported" : reader.ReadToEnd());
}
}
Extracting Formatted Text
// Create an instance of Parser class
using (Parser parser = new Parser("sample.docx"))
{
// Extract a formatted text into the reader
using (TextReader reader = parser.GetFormattedText(new FormattedTextOptions(FormattedTextMode.Html)))
{
// Print a formatted text from the document
// If formatted text extraction isn't supported, a reader is null
Console.WriteLine(reader == null ? "Formatted text extraction isn't suppported" : reader.ReadToEnd());
}
}
Disclosure: I work as Developer Evangelist at GroupDocs.
© 2022 - 2024 — McMap. All rights reserved.