How to tokenize markdown using Node.js?
Asked Answered
A

3

10

Im building an iOS app that have a view that is going to have its source from markdown.

My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like:

{
    "h1": "This is the heading",
    "p" : "Heres the first paragraph",
    "link": {
        "text": "Text for link",
        "url": "http://exampledomain.com",
    }
}

On the server I am running Node.js, and was looking at the module marked which seem to be the most popular one out there. It gives me access to the Lexer, which is tokenizing the markdown to some custom object. But when I look at the object, it doesnt tokenize the link. If I go ahead and parse the markdown to HTML, the link is detected and the HTML looks correct.

After looking into some more modules, and failing I thought that maybe I could do this on the client instead and found MMMarkdown which seemed promising, but then again .. that worked fine when parsing directly to HTML, but when stepping in between and just parsing the markdown to the so called MMDocument, it did not consist of any MMElement of type Link.

So, is there anything fundamental about markdown parsing that I am missing? Is the lexing of the inline links supposed to be done in a second round, or something? I cant get my head around it.

If nothing else works, I might just go with using a UIWebView filled withed the HTML from the parsed markdown, but then we have to design the whole thing again, but with CSS, and we are running out of time so we cant reallt afford the double work.

Abiotic answered 26/2, 2014 at 12:43 Comment(0)
K
10

Did you look at https://github.com/evilstreak/markdown-js ?

It seems to give you access to the syntax tree.

For example:

var md = require( "markdown" ).markdown,
text = "Header\n---------------\n\n" +
       "This is a paragraph\n\n" +
"This is [an example](http://example.com/ \"Title\") inline link.";

// parse the markdown into a tree and grab the link references
var tree = md.parse( text );

console.log(JSON.stringify(tree));

produces

[
    "markdown",
    [
        "header",
        {
            "level": 2
        },
        "Header"
    ],
    [
        "para",
        "This is a paragraph"
    ],
    [
        "para",
        "This is ",
        [
            "link",
            {
                "href": "http://example.com/",
                "title": "Title"
            },
            "an example"
        ],
        " inline link."
    ]
]
Kilovoltampere answered 26/2, 2014 at 13:3 Comment(3)
I acctually did, and was not happy with how the tokenized data was formatted. I cannot understand why it uses nested arrays instead of objects? Well well. I guess your answer is good enough, but I think in my case it will be easier just to parse my data manually. I will only have to support h1, h2, p and link so its probably easier than trying to reformat the result from markdown-js.Abiotic
wow, reading this more than a year later I realized that they use arrays to keep the order of tokens correct :) (doh!)Abiotic
Worth noting that there is a notice on evilstreak/markdown-js that the project is no longer maintained anymore. It hasn't received any commits since March 2019. There are alternatives recommended in the README, though!Phonetist
B
11

Although this question is already quite a few years old, I wanted to give a little update.

I found the combination of unified and remark-parse a good fit for my situation. After installing those packages (with npm, yarn, pnpm or your most favourite js package manager) I wrote a little test script as follows:

const unified = require('unified');
const markdown = require('remark-parse');

const tokens = unified()
  .use(markdown)
  .parse('# Hello world');

console.log(tokens);

This of course generates a token tree and needs further processing.

Maybe this is useful for someone else who stumbled upon this question.

Bulldozer answered 5/11, 2018 at 15:8 Comment(1)
Thank you, this worked for me after trying a bunch of different libraries.Hubbs
K
10

Did you look at https://github.com/evilstreak/markdown-js ?

It seems to give you access to the syntax tree.

For example:

var md = require( "markdown" ).markdown,
text = "Header\n---------------\n\n" +
       "This is a paragraph\n\n" +
"This is [an example](http://example.com/ \"Title\") inline link.";

// parse the markdown into a tree and grab the link references
var tree = md.parse( text );

console.log(JSON.stringify(tree));

produces

[
    "markdown",
    [
        "header",
        {
            "level": 2
        },
        "Header"
    ],
    [
        "para",
        "This is a paragraph"
    ],
    [
        "para",
        "This is ",
        [
            "link",
            {
                "href": "http://example.com/",
                "title": "Title"
            },
            "an example"
        ],
        " inline link."
    ]
]
Kilovoltampere answered 26/2, 2014 at 13:3 Comment(3)
I acctually did, and was not happy with how the tokenized data was formatted. I cannot understand why it uses nested arrays instead of objects? Well well. I guess your answer is good enough, but I think in my case it will be easier just to parse my data manually. I will only have to support h1, h2, p and link so its probably easier than trying to reformat the result from markdown-js.Abiotic
wow, reading this more than a year later I realized that they use arrays to keep the order of tokens correct :) (doh!)Abiotic
Worth noting that there is a notice on evilstreak/markdown-js that the project is no longer maintained anymore. It hasn't received any commits since March 2019. There are alternatives recommended in the README, though!Phonetist
A
1

Here's the code that I ended up using instead.

var nodes = markdownText.split('\r\n');
var content = [];

nodes.forEach(function(node) {

    // Heading 2
    if (node.indexOf('##') == 0) {
        content.push({
            h2: node.replace('##','')
        })
    }

    // Heading 1
    else if (node.indexOf('#') == 0) {
        content.push({
            h1: node.replace('#','')
        })
    }

    // Link (Text + URL)
    else if (node.indexOf('[') == 0) {
        var matches = node.match(/\[(.*)\]\((.*)\)/);
        content.push({
            link: {
                text: matches[1],
                url: matches[2]
            }
        })
    }

    // Paragraph
    else if (node.length > 0) {
        content.push({
            p: node
        })
    }

});

I know this matching is very non-forgiving, but in our case it works fine.

Abiotic answered 26/2, 2014 at 19:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.