Generating docx file from HTML file using OpenXML
Asked Answered
E

3

5

I'm using this method for generating docx file:

public static void CreateDocument(string documentFileName, string text)
{
    using (WordprocessingDocument wordDoc =
        WordprocessingDocument.Create(documentFileName, WordprocessingDocumentType.Document))
    {
        MainDocumentPart mainPart = wordDoc.AddMainDocumentPart();

        string docXml =
                    @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>
                 <w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
                 <w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body>
                 </w:document>";

        docXml = docXml.Replace("#REPLACE#", text);

        using (Stream stream = mainPart.GetStream())
        {
            byte[] buf = (new UTF8Encoding()).GetBytes(docXml);
            stream.Write(buf, 0, buf.Length);
        }
    }
}

It works like a charm:

CreateDocument("test.docx", "Hello");

But what if I want to put HTML content instead of Hello? for example:

CreateDocument("test.docx", @"<html><head></head>
                              <body>
                                    <h1>Hello</h1>
                              </body>
                        </html>");

Or something like this:

CreateDocument("test.docx", @"Hello<BR>
                                    This is a simple text<BR>
                                    Third paragraph<BR>
                                    Sign
                        ");

both cases creates an invalid structure for document.xml. Any idea? How can I generate a docx file from a HTML content?

Escutcheon answered 11/5, 2016 at 6:20 Comment(0)
T
7

You cannot just insert the HTML content into a "document.xml", this part expects only a WordprocessingML content so you'll have to convert that HTML into WordprocessingML, see this.

Another thing that you could use is altChunk element, with it you would be able to place a HTML file inside your DOCX file and then reference that HTML content on some specific place inside your document, see this.

Last as an alternative, with GemBox.Document library you could accomplish exactly what you want, see the following:

public static void CreateDocument(string documentFileName, string text)
{
    DocumentModel document = new DocumentModel();
    document.Content.LoadText(text, LoadOptions.HtmlDefault);
    document.Save(documentFileName);
}

Or you could actually straightforwardly convert a HTML content into a DOCX file:

public static void Convert(string documentFileName, string htmlText)
{
    HtmlLoadOptions options = LoadOptions.HtmlDefault;
    using (var htmlStream = new MemoryStream(options.Encoding.GetBytes(htmlText)))
        DocumentModel.Load(htmlStream, options)
                     .Save(documentFileName);
}
Tremml answered 11/5, 2016 at 6:44 Comment(1)
My post docx4java.org/blog/2014/09/… ends with a couple of other optionsEducable
M
12

I realize I'm 7 years late to the game here. Still, for future people searching on how to convert from HTML to Word Doc, this blog posting on a Microsoft MSDN site gives most of the ingredients necessary to do this using OpenXML. I found the post itself to be confusing, but the source code that he included clarified it all for me.

The only piece that was missing was how to build a Docx file from scratch, instead of how to merge into an existing one as his example shows. I found that tidbit from here.

Unfortunately the project I used this in is written in vb.net. So I'm going to share the vb.net code first, then an automated c# conversion of it, that may or may not be accurate.

vb.net code:

Imports DocumentFormat.OpenXml
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports System.IO

Dim ms As IO.MemoryStream
Dim mainPart As MainDocumentPart
Dim b As Body
Dim d As Document
Dim chunk As AlternativeFormatImportPart
Dim altChunk As AltChunk

Const altChunkID As String = "AltChunkId1"

ms = New MemoryStream()

Using myDoc = WordprocessingDocument.Create(ms,WordprocessingDocumentType.Document)
    mainPart = myDoc.MainDocumentPart

    If mainPart Is Nothing Then
        mainPart = myDoc.AddMainDocumentPart()

        b = New Body()
        d = New Document(b)
        d.Save(mainPart)
    End If

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID)

    Using chunkStream As Stream = chunk.GetStream(FileMode.Create, FileAccess.Write)
        Using stringStream As StreamWriter = New StreamWriter(chunkStream)
            stringStream.Write("YOUR HTML HERE")
        End Using
    End Using

    altChunk = New AltChunk()
    altChunk.Id = altChunkID
    mainPart.Document.Body.InsertAt(Of AltChunk)(altChunk, 0)
    mainPart.Document.Save()
End Using

c# code:

using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.IO;

IO.MemoryStream ms;
MainDocumentPart mainPart;
Body b;
Document d;
AlternativeFormatImportPart chunk;
AltChunk altChunk;

string altChunkID = "AltChunkId1";

ms = new MemoryStream();

using (myDoc = WordprocessingDocument.Create(ms, WordprocessingDocumentType.Document))
{
    mainPart = myDoc.MainDocumentPart;

    if (mainPart == null) 
    {
         mainPart = myDoc.AddMainDocumentPart();
         b = new Body();
         d = new Document(b);
         d.Save(mainPart);
    }

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID);

    using (Stream chunkStream = chunk.GetStream(FileMode.Create, FileAccess.Write)
    {
         Using (StreamWriter stringStream = new StreamWriter(chunkStream))         
         {
              stringStream.Write("YOUR HTML HERE");
         }
    }    

    altChunk = new AltChunk();
    altChunk.Id = altChunkID;
    mainPart.Document.Body.InsertAt(Of, AltChunk)[altChunk, 0];
    mainPart.Document.Save();
}

Note that I'm using the ms memory stream in another routine, which is where it's disposed of after use.

I hope this helps someone else!

Misreport answered 5/7, 2018 at 23:45 Comment(2)
One thing to note here is that the HTML your inserting needs to be wrapped in the <html></html> tag for it to be rendered as HTML, i.e. stringStream.Write(@"<html><h2>Hi There</h2></body>")Orgiastic
Conversion has removed all of the margins for me, so the output is all squished vertically. Anything to fix this?Marek
T
7

You cannot just insert the HTML content into a "document.xml", this part expects only a WordprocessingML content so you'll have to convert that HTML into WordprocessingML, see this.

Another thing that you could use is altChunk element, with it you would be able to place a HTML file inside your DOCX file and then reference that HTML content on some specific place inside your document, see this.

Last as an alternative, with GemBox.Document library you could accomplish exactly what you want, see the following:

public static void CreateDocument(string documentFileName, string text)
{
    DocumentModel document = new DocumentModel();
    document.Content.LoadText(text, LoadOptions.HtmlDefault);
    document.Save(documentFileName);
}

Or you could actually straightforwardly convert a HTML content into a DOCX file:

public static void Convert(string documentFileName, string htmlText)
{
    HtmlLoadOptions options = LoadOptions.HtmlDefault;
    using (var htmlStream = new MemoryStream(options.Encoding.GetBytes(htmlText)))
        DocumentModel.Load(htmlStream, options)
                     .Save(documentFileName);
}
Tremml answered 11/5, 2016 at 6:44 Comment(1)
My post docx4java.org/blog/2014/09/… ends with a couple of other optionsEducable
P
3

I could successfully convert HTML content to docx file using OpenXML in an .net Core using this code

string html = "<strong>Hello</strong> World";
using (MemoryStream generatedDocument = new MemoryStream()){
   using (WordprocessingDocument package = 
                  WordprocessingDocument.Create(generatedDocument,
                  WordprocessingDocumentType.Document)){
   MainDocumentPart mainPart = package.MainDocumentPart;
   if (mainPart == null){
    mainPart = package.AddMainDocumentPart();
    new Document(new Body()).Save(mainPart);
}
HtmlConverter converter = new HtmlConverter(mainPart);
converter.ParseHtml(html);
mainPart.Document.Save();
}

To save on disk

System.IO.File.WriteAllBytes("filename.docx", generatedDocument.ToArray());

To return the file for download in net core mvc, use

return File(generatedDocument.ToArray(), 
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
          "filename.docx");
Pennywise answered 17/12, 2020 at 15:8 Comment(3)
What is the namespace of HtmlConverter class?Indicate
@AnandMurali It's under HtmlToOpenXml.HtmlConverter and it's part of html to openxml, visible after installing openxml nugget and dll.Iluminadailwain
This seems to work for most cases. However, it appears that it doesn't handle tables with merged cells very well. It also can't display an UL. All UL's are converted or displayed as ordered lists in the XML output.Neomineomycin

© 2022 - 2024 — McMap. All rights reserved.