Convert between Markdown elements

J

5

10

What are the options to parse Markdown document and process its elements to output an another Markdown document?

Let's say it

```
# unaffected #
```

# H1 #

H1
==

## H2 ##

H2
--

### H3 ###

should be converted to

```
# unaffected #
```

## H1 ##

H1
--

### H2 ###

### H2 ###

#### H3 ####

in Node environment. Target element may vary (e.g. #### may be converted to **).

The document may contain other markup elements that should remain unaffected.

How it can be obtained? Obviously, not with regexps (using regexp instead of full-blown lexer will affect # unaffected #). I was hoped to use marked but it seems that it is capable only of HTML output, not Markdown.

Janerich answered 7/2, 2016 at 19:54 Comment(2)

You could use a markdown parser as pandoc and a custom filter like this one : gist.github.com/scoavoux/27e4ca56b4101a88313a (although it turns == into ## rather than --). There seems to be pandoc wrappers for node.js like this but I don't know much about node.js and how convenient that solution would be – Blackguardly 10/2, 2016 at 17:39

@Blackguardly Thanks for pointing at Pandoc. I'm not too fond of using external dependencies in Node environment, it complicates the setup. But this is the solution if there are no ready to use alternatives in JS domain (still not sure about this). I will take a look at this option later, feel free to post this as an answer. – Janerich 11/2, 2016 at 10:39

W

3

Have you considered using HTML as an intermediate format? Once in HTML, the differences between the header types will be indistinguishable, so the Markdown -> HTML conversion will effectively normalize them for you. There are markdown -> HTML converters aplenty, and also a number of HTML -> markdown.

I put together an example using these two packages:

https://www.npmjs.com/package/markdown-it for Markdown -> HTML
https://www.npmjs.com/package/h2m for HTML -> Markdown

I don't know if you have any performance requirements here (read: this is slow...) but this is a very low investment solution. Take a look:

var md = require('markdown-it')(),
    h2m = require('h2m');

var mdContent = `
\`\`\`
# unaffected #
\`\`\`

# H1 #

H1
==

## H2 ##

H2
--

### H3 ###
`;

var htmlContent = md.render(mdContent);
var newMdContent = h2m(htmlContent, {converter: 'MarkdownExtra'});
console.log(newMdContent);

You may have to play with a mix of components to get the correct dialect support and whatnot. I tried a bunch and couldn't quite match your output. I think perhaps the -- is being interpreted differently? Here's the output, I'll let you decide if it is good enough:

```
# unaffected #

```

# H1 #

# H1 #

## H2 ##

## H2 ##

### H3 ###

Wivinah answered 13/2, 2016 at 5:31 Comment(2)

Thank you, I don't see slow speed as an issue, I'm ok with html<->md roundabout. Generally the output is fine. However, ```ruby fenced code highlighting is lost during conversion, I guess it can be treated with 'overides' feature. – Janerich 16/2, 2016 at 13:54

You may be able to find an importer that supports syntax highlighting, but I doubt it. There is no standard way to render the HTML in this case. It is not a feature in markdown, though several dialects have added it. The overrides look promising, possibly a bit tricky though. – Wivinah 16/2, 2016 at 16:8

B

5

Here is a solution with an external markdown parser, pandoc. It allows for custom filters in haskell or python to modify the input (there also is a node.js port). Here is a python filter that increases every header one level. Let's save that as header_increase.py.

from pandocfilters import toJSONFilter, Header

def header_increase(key, value, format, meta):
    if key == 'Header' and value[0] < 7:
        value[0] = value[0] + 1
        return Header(value[0], value[1], value[2])

if __name__ == "__main__":
    toJSONFilter(header_increase)

It will not affect the code block. However, it might transform setex-style headers for h1 and h2 elements (using === or ---) into atx-style headers (using #), and vice-versa.

To use the script, one could call pandoc from the command line:

pandoc input.md --filter header_increase.py -o output.md -t markdown

With node.js, you could use pdc to call pandoc.

var pdc = require('pdc');
pdc(input_md, 'markdown', 'markdown', [ '--filter', './header_increase.py' ], function(err, result) {
  if (err)
    throw err;

  console.log(result);
});

Blackguardly answered 11/2, 2016 at 14:6 Comment(0)

W

3

Have you considered using HTML as an intermediate format? Once in HTML, the differences between the header types will be indistinguishable, so the Markdown -> HTML conversion will effectively normalize them for you. There are markdown -> HTML converters aplenty, and also a number of HTML -> markdown.

I put together an example using these two packages:

https://www.npmjs.com/package/markdown-it for Markdown -> HTML
https://www.npmjs.com/package/h2m for HTML -> Markdown

I don't know if you have any performance requirements here (read: this is slow...) but this is a very low investment solution. Take a look:

var md = require('markdown-it')(),
    h2m = require('h2m');

var mdContent = `
\`\`\`
# unaffected #
\`\`\`

# H1 #

H1
==

## H2 ##

H2
--

### H3 ###
`;

var htmlContent = md.render(mdContent);
var newMdContent = h2m(htmlContent, {converter: 'MarkdownExtra'});
console.log(newMdContent);

You may have to play with a mix of components to get the correct dialect support and whatnot. I tried a bunch and couldn't quite match your output. I think perhaps the -- is being interpreted differently? Here's the output, I'll let you decide if it is good enough:

```
# unaffected #

```

# H1 #

# H1 #

## H2 ##

## H2 ##

### H3 ###

Wivinah answered 13/2, 2016 at 5:31 Comment(2)

Thank you, I don't see slow speed as an issue, I'm ok with html<->md roundabout. Generally the output is fine. However, ```ruby fenced code highlighting is lost during conversion, I guess it can be treated with 'overides' feature. – Janerich 16/2, 2016 at 13:54

You may be able to find an importer that supports syntax highlighting, but I doubt it. There is no standard way to render the HTML in this case. It is not a feature in markdown, though several dialects have added it. The overrides look promising, possibly a bit tricky though. – Wivinah 16/2, 2016 at 16:8

T

2

Despite its apparent simplicity, Markdown is actually somewhat complicated to parse. Each part builds upon the next, such that to cover all edge cases you need a complete parser even if you only want to process a portion of a document.

For example, various types of block level elements can be nested inside other block level elements (lists, blockquotes, etc). Most implementations rely on a vary specific order of events within the parser to ensure that the entire document is parsed correctly. If you remove one of the earlier pieces, many of the later pieces will break. For example, Markdown markup inside code blocks is not parsed as Markdown because one of the first steps is to find and identify the code blocks so that later steps in the parsing never see the code blocks.

Therefore, to accomplish your goal and cover all possible edge cases, you need a complete Markdown parser. However, as you do not want to output HTML, your options are somewhat limited and you will need to do some work to get a working solution.

There are basically three styles of Markdown parsers (I'm generalizing here):

Use regex string substitution to swap out the Markdown markup for HTML Markup within the source document.
Use a render which gets called by the parser (in each step) as it parses the document outputting a new document.
Generate a tree object or list of tokens (specifics vary by implementation) which is rendered (converted to a string) to a new document in a later step.

The original reference implementation (markdown.pl) is of the first type and probably useless to you. I simply mention it for completeness.

Marked is of the second variety and while it could be used, you would need to write your own renderer and have the renderer modify the document at the same time as you render it. While generally a performat solution, it is not always the best method when you need to modify the document, especially if you need context from elsewhere within the document. However, you should be able to make it work.

For example, to adapt an example in the docs, you might do something like this (multiplyString borrowed from here):

function multiplyString (str, num) {
    return num ? Array(num + 1).join(str) : "";
}

renderer.heading = function (text, level) {
    return multiplyString("#", level+1) + " " + text;
}

Of course, you will also need to create renderers for all of the other block level renderer methods and inline level renderer methods which output Markdown syntax. See my comments below regarding renderers in general.

Markdown-JS is of the third variety (as it turns out Marked also provides a lower level API with access to the tokens so it could be used this way as well). As stated in its README:

Intermediate Representation

Internally the process to convert a chunk of Markdown into a chunk of HTML has three steps:

Parse the Markdown into a JsonML tree. Any references found in the parsing are stored in the attribute hash of the root node under the key references.

Convert the Markdown tree into an HTML tree. Rename any nodes that need it (bulletlist to ul for example) and lookup any references used by links or images. Remove the references attribute once done.

Stringify the HTML tree being careful not to wreck whitespace where whitespace is important (surrounding inline elements for example).

Each step of this process can be called individually if you need to do some processing or modification of the data at an intermediate stage.

You could take the tree object in either step 1 or step 2 and make your modifications. However, I would recommend step 1 as the JsonML tree will more closely match the actual Markdown document as the HTML Tree in step 2 is a representation of the HTML to be output. Note that the HTML will loose some information regarding the original Markdown in any implementation. For example, were asterisks or underscores used for emphasis (*foo* vs. _foo_), or was a asterisk, dash (hyphen) or plus sign used as a list bullet? I'm not sure how much detail the JsonML tree holds (haven't used it personally), but it should certainly be more than the HTML tree in step 2.

Once you have made your modifications to the JsonML tree (perhpas using one of the tools listed here, then you will probably want to skip step 2 and implement your own step 3 which renders (stringifies) the JsonML tree back to a Markdown document.

And therein lies the hard part. It is very rare for Markdown parsers to output Markdown. In fact it is very rare for Markdown parsers to output anything except HTML. The most popular exception being Pandoc, which is a document converter for many formats of input and output. But, desiring to stay with a JavaScript solution, any library you chose will require you to write your own renderer which will output Markdown (unless a search turns up a renderer built by some other third party). Of course, once you do, if you make it available, others could benefit from it in the future. Unfortunately, building a Markdown renderer is beyond the scope of this answer.

One possible shortcut when building a renderer is that if the Markdown lib you use happens to store the position information in its list of tokens (or in some other way gives you access to the original raw Markdown on a per element basis), you could use that info in the renderer to simply copy and output the original Markdown text, except when you need to alter it. For example, the markdown-it lib offers that data on the Token.map and/or Token.markup properties. You still need to create your own renderer, but it should be easier to get the Markdown to look more like the original.

Finally, I have not personally used, nor am I recommending any of the specific Markdown parsers mentioned above. They are simply popular examples of the various types of parsers to demonstrate how you could create a solution. You may find a different implementation which fits your needs better. A lengthy, albeit incomplete, list is here.

Tantalizing answered 11/2, 2016 at 17:2 Comment(0)

L

1

You must use regexps. marked itself use Regexp for parsing the document. Why don't you?

This is some of the regexp you need, from marked.js source code on github:

var block = {
  newline: /^\n+/,
  code: /^( {4}[^\n]+\n*)+/,
  fences: noop,
  hr: /^( *[-*_]){3,} *(?:\n+|$)/,
  heading: /^ *(#{1,6}) *([^\n]+?) *#* *(?:\n+|$)/,
  nptable: noop,
  lheading: /^([^\n]+)\n *(=|-){2,} *(?:\n+|$)/,
  blockquote: /^( *>[^\n]+(\n(?!def)[^\n]+)*\n*)+/,
  list: /^( *)(bull) [\s\S]+?(?:hr|def|\n{2,}(?! )(?!\1bull )\n*|\s*$)/,
  html: /^ *(?:comment *(?:\n|\s*$)|closed *(?:\n{2,}|\s*$)|closing *(?:\n{2,}|\s*$))/,
  def: /^ *\[([^\]]+)\]: *<?([^\s>]+)>?(?: +["(]([^\n]+)[")])? *(?:\n+|$)/,
  table: noop,
  paragraph: /^((?:[^\n]+\n?(?!hr|heading|lheading|blockquote|tag|def))+)\n*/,
  text: /^[^\n]+/
};

If you really really don't want to use regexp, you can fork the marked object. and overide the Renderer object.

Marked on github is splited to two components. One for parsing and one for render. You can eaisly change the render to your own render. (compiler)

Example of one function in Render.js:

Renderer.prototype.blockquote = function(quote) {
  return '<blockquote>\n' + quote + '</blockquote>\n';
};)

Lannielanning answered 7/2, 2016 at 21:37 Comment(2)

The question shows why the lexer should be used and not pure regexps. ``` # unaffected ``` code will be parsed by simple replace, while it definitely shouldn't. I've browsed through marked source code and still have very little idea on how to render it back to Markdown without breaking anything, especially after it was parsed by lexer like that. – Janerich 7/2, 2016 at 22:36

the OP could just "strip out" the parts of text between code blocks and then insert them in place later after the markdown parser had did its job – Haerle 10/2, 2016 at 17:55

I

0

Maybe it's incomplete answer. Copy unaffected into other file.

Then replace all

#space with ##space
space# with space##

Implicate answered 17/2, 2016 at 16:10 Comment(0)

Intermediate Representation

Recommended topics

Hot tags