Add HTML String to OpenXML (*.docx) Document
Asked Answered
L

2

25

I am trying to use Microsoft's OpenXML 2.5 library to create a OpenXML document. Everything works great, until I try to insert an HTML string into my document. I have scoured the web and here is what I have come up with so far (snipped to just the portion I am having trouble with):

Paragraph paragraph = new Paragraph();
Run run = new Run();

string altChunkId = "id1";
AlternativeFormatImportPart chunk =
       document.MainDocumentPart.AddAlternativeFormatImportPart(
           AlternativeFormatImportPartType.Html, altChunkId);
chunk.FeedData(new MemoryStream(Encoding.UTF8.GetBytes(ioi.Text)));
AltChunk altChunk = new AltChunk { Id = altChunkId };

run.AppendChild(new Break());

paragraph.AppendChild(run);
body.AppendChild(paragraph);

Obviously, I haven't actually added the altChunk in this example, but I have tried appending it everywhere - to the run, paragraph, body, etc. In ever case, I am unable to open up the docx file in Word 2010.

This is making me a little nutty because it seems like it should be straightforward (I will admit that I'm not fully understanding the AltChunk "thing"). Would appreciate any help.

Side Note: One thing I did find that was interesting, and I don't know if it's actually a problem or not, is this response which says AltChunk corrupts the file when working from a MemoryStream. Can anybody confirm that this is/isn't true?

Lamprey answered 6/8, 2013 at 20:32 Comment(2)
Do you get an error message when you try opening the generated docx file in Word 2010?Rodrich
I do. I get a "The file [filename] cannot be opened because there are problems with the contents." I look at the contents in the inspector, but I don't see anything obvious with respect to what is actually wrong.Lamprey
R
30

I can reproduce the error "... there is a problem with the content" by using an incomplete HTML document as the content of the alternative format import part. For example if you use the following HTML snippet <h1>HELLO</h1> MS Word is unable to open the document.

The code below shows how to add an AlternativeFormatImportPart to a word document. (I've tested the code with MS Word 2013).

using (WordprocessingDocument doc = WordprocessingDocument.Open(@"test.docx", true))
{
  string altChunkId = "myId";
  MainDocumentPart mainDocPart = doc.MainDocumentPart;

  var run = new Run(new Text("test"));
  var p = new Paragraph(new ParagraphProperties(
       new Justification() { Val = JustificationValues.Center }),
                     run);

  var body = mainDocPart.Document.Body;
  body.Append(p);        

  MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes("<html><head></head><body><h1>HELLO</h1></body></html>"));

  // Uncomment the following line to create an invalid word document.
  // MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes("<h1>HELLO</h1>"));

  // Create alternative format import part.
  AlternativeFormatImportPart formatImportPart =
     mainDocPart.AddAlternativeFormatImportPart(
        AlternativeFormatImportPartType.Html, altChunkId);
  //ms.Seek(0, SeekOrigin.Begin);

  // Feed HTML data into format import part (chunk).
  formatImportPart.FeedData(ms);
  AltChunk altChunk = new AltChunk();
  altChunk.Id = altChunkId;

  mainDocPart.Document.Body.Append(altChunk);
}

According to the Office OpenXML specification valid parent elements for the w:altChunk element are body, comment, docPartBody, endnote, footnote, ftr, hdr and tc. So, I've added the w:altChunk to the body element.

For more information on the w:altChunk element see this MSDN link.

EDIT

As pointed out by @user2945722, to make sure that the OpenXml library correctlty interprets the byte array as UTF-8, you should add the UTF-8 preamble. This can be done this way:

MemoryStream ms = new MemoryStream(new UTF8Encoding(true).GetPreamble().Concat(Encoding.UTF8.GetBytes(htmlEncodedString)).ToArray()

This will prevent your é's from being rendered as é's, your ä's as ä's, etc.

Rodrich answered 9/8, 2013 at 17:26 Comment(5)
"...using an incomplete HTML document..." - That's exactly what the problem was. Such a simple thing, yet very non-obvious to me. Thanks for your help.Lamprey
You should consider adding the UTF8 BOM to the byte array before passing it to the memorystream. This helped my scenario where the docx file would not show some UTF8 characters correctly. Something like this - byte[] utf8Bom = new UTF8Encoding(true).GetPreamble(); and then prepend that to the "GetBytes" resultWassyngton
@Wassyngton Thanks! This was the correct answer for my issue. It should be included in the answer.Spastic
How can i insert html inside the header and footer?Strep
Can anyone explain why the ID on the alt chunk is required? The doc isn't too helpful; it simply describes it as "Relationship to Part."Politician
B
3

Had the same problem here, but a totally different cause. Worth a try if the accepted solution doesn't help. Try closing the file after saving. In my case, it happened to be the difference between a corrupt and a clean docx file. Oddly, most other operations work with only a Save() and program exit.

String cid = "chunkid";
WordprocessingDocument document = WordprocessingDocument.Open("somefile.docx", true);
Body body = document.MainDocumentPart.Document.Body;
MemoryStream ms = new MemoryStream(System.Text.Encoding.UTF8.GetBytes("<html><head></head><body>hi</body></html>"));
AlternativeFormatImportPart formatImportPart = document.MainDocumentPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html, cid);
formatImportPart.FeedData(ms);
AltChunk altChunk = new AltChunk();
altChunk.Id = cid;
document.MainDocumentPart.Document.Body.Append(altChunk);
document.MainDocumentPart.Document.Save();
// here's the magic!
document.Close();
Buke answered 15/1, 2015 at 22:11 Comment(1)
I was trying to write to a MemoryStream (using WordprocessingDocument.Create instead of WordprocessingDocument.Open) and the "magic" of document.Close() was precisely what I needed to get a clean memory stream to return IF I tried to return from within the using statement (or didn't use a using statement). Returning outside of the using statement did not require this magic. I am suspecting that the using statement effectively does the same task as document.Close() when disposing the object.Translate

© 2022 - 2024 — McMap. All rights reserved.