How can I convert a Markdown string to a DocX in Python?
Asked Answered
L

2

7

I am getting markdown text from my API like this:

{
    name:'Onur',
    surname:'Gule',
    biography:'## Computers
    I like **computers** so much.
    I wanna *be* a computer.',
    membership:1
}

biography column includes markdown string like above.

## Computers
I like **computers** so much.
I wanna *be* a computer.

I want to take this markdown text and convert to docx string for my reports.

In my docx template:

{{markdownText|mark2html}}

{{simpleText}}

I am using python3 docxtpl package for creating docx and it's working for simple texts.

  • I tried BeautifulSoup for convert markdown to docx text but it doesn't work for styles(bold, italic etc.).
  • I tried pandoc and it worked but it just create a docx file, I want to add rendered markdown text to existing docx(while creating).

My current code:

import docx
from docxtpl import DocxTemplate, RichText
import markdown
import jinja2
import markupsafe
from bs4 import BeautifulSoup
import pypandoc

def safe_markdown(text):
    return markupsafe.Markup(markdown.markdown(text))

def mark2html(value):
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    output = pypandoc.convert_text(value,'rtf',format='md')
    return RichText(value) #tried soup and pandoc..

def from_template(template):
    template = DocxTemplate(template)
    context = {
        'simpleText':'Simple text test.',
        'markdownText':'Markdown **text** test.'
    } 
    jenv = jinja2.Environment()
    jenv.filters['markdown'] = safe_markdown
    jenv.filters["mark2html"] = mark2html
    template.render(context,jenv)
    template.save('new_report.docx')

So, how can I add rendered markdown to existed docx or while creating, maybe with a jinja2 filter?

Lemniscate answered 15/12, 2021 at 11:54 Comment(1)
github.com/nihole/md2docx pandoc.org/demos.htmlHibben
L
10

I solved it without any shortcut. I turn the markdown to html with beautifulSoup and then process every paragraph by checking theirs tag names.

In my word template:

{% if markdownText != None %}
    {% for mt in markdownText|mark2html %} 
        {{mt}}
    {% endfor %}
{% endif %}

My template tag:

def mark2html(value):
    if value == None:
        return '-'
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    paragraphs = []
    global doc
    for tag in soup.findAll(True):
        if tag.name in ('p','h1','h2','h3','h4','h5','h6'):
            paragraphs.extend(parseHtmlToDoc(tag))  
    return paragraphs

My code to insert docx:

def parseHtmlToDoc(org_tag):
    contents = org_tag.contents
    pars= []
    for con in contents:
        if str(type(con)) == "<class 'bs4.element.Tag'>":
            tag = con
            if tag.name in ('strong',"h1","h2","h3","h4","h5","h6"):
                source = RichText("")
                if len(pars) > 0 and str(type(pars[len(pars)-1])) == "<class 'docxtpl.richtext.RichText'>":
                    source = pars[len(pars)-1]
                    source.add(con.contents[0], bold=True)
                else:
                    source.add(con.contents[0], bold=True)
                    pars.append(source) 
            elif tag.name == 'img':
                source = tag['src']
                imagen = InlineImage(doc, settings.MEDIA_ROOT+source)
                pars.append(imagen)
            elif tag.name == 'em':
                source = RichText("")
                source.add(con.contents[0], italic=True)
                pars.append(source)
        else:
            source = RichText("")
            if len(pars) > 0 and str(type(pars[len(pars)-1])) == "<class 'docxtpl.richtext.RichText'>":
                    source = pars[len(pars)-1]
                    pars.add(con)
            else:
                if org_tag.name == 'h2':
                    source.add(con,bold=True,size=40)
                else:
                    source.add(con)
                pars.append(source) # her zaman append?
    return pars

It process html tags like b, i, img, headers. You can add more tags to process. I solved like that and it doesn't need any additional file transform like html2docx or etc.

I used this process in my code like this:

report_context = {'reportVariables': report_variables}
template = DocxTemplate('report_format.docx')
jenv = jinja2.Environment()
jenv.filters["mark2html"] = mark2html
template.render(report_context,jenv)
template.save('exported_1.docx')
Lemniscate answered 19/12, 2021 at 14:33 Comment(2)
how to use this code snippets?it doesn't contains a main entry.Overhang
@Overhang I added a code block that how I can use it.Lemniscate
S
4

I have followed a lazy, not-best-efficient, yet useful, strategy. Since dealing with docx is less flexible than html, I converted the markdown md to html first, then moved from html to docx like this:

from jinja2 import FileSystemLoader, Environment
from pypandoc import convert_file, convert_text

def md2html(md):
  return convert_text(md, 'html', format='md')

def html2docx(file):
  return convert_file(f'{file}.html', 'docx', format='html', outputfile=f'{file}.docx')

def from_template(template_file, f_out):
  context = {
      'simpleText': 'Simple text test.',
      'markdownText': 'Markdown **text** test.'
  }
  ldr = FileSystemLoader(searchpath='./')
  jenv = Environment(loader=ldr)
  jenv.filters["md2html"] = md2html
  template = jenv.get_template(template_file)
  html = template.render(context)
  print(html)
  with open(f'{f_out}.html', 'w') as fout:
    fout.write(html)
    fout.close()
  html2docx(f_out)

if __name__ == '__main__':
  from_template('template.html.jinja', 'new_report')

as for the contents of the template, it should be taken from html-based one like this:

<!DOCTYPE html>
<html xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
  <head></head>
  <body>
    {{markdownText|md2html}}
    {{simpleText}}
  </body>
</html>

I saved it as template.html.jinja.

I was tempted to look into the contribution of @Mahrkeenerh, the API referred there seems to be quite some project to learn and understand.

Sattler answered 19/12, 2021 at 0:23 Comment(2)
That's nice but my template is docx, I am using a method like this. I convert to html and implement to docx line by line with processing html tags.Lemniscate
If you'd save as your docx template into html right from Word. Then, edit the html and insert your jinja fields mentioned in the (documentation here)[jinja.palletsprojects.com/en/3.0.x/templates/]: - {% ... %} for Statements - {{ ... }} for Expressions to print to the template output - {# ... #} for Comments not included in the template output - # ... ## for Line Statements You can write crazy markers in the Word file before exporting like REPLACE_THIS_LATER to fish them quickly in the text editor after the export. You do not have to do it line-by-line!Sattler

© 2022 - 2024 — McMap. All rights reserved.