Why isn't MarkdownSharp encoding my HTML?
Asked Answered
P

3

10

In my mind, one of the bigger goals of Markdown is to prevent the user from typing potentially malformed HTML directly.

Well that isn't exactly working for me in MarkdownSharp.

This example works properly when you have the extra line break immediately after "abc"... enter image description here

But when that line break isn't there, I think it should still be HtmlEncoded, but that isn't happening here... enter image description here

Behind the scenes, the rendered markup is coming from an iframe. And this is the code behind it...

<% 
var md = new MarkdownSharp.Markdown();
%>
<%= md.Transform(Request.Form[0]) %>

Surely I must be missing something. Oh, and I am using v1.13 (the latest version as of this writing).


EDIT (this is a test for StackOverflow's implementation)

abc

this shouldn't be red
Pelligrini answered 1/2, 2011 at 15:12 Comment(7)
this may be posted in meta.stackoverflow.com if it were related to this website.Fassold
@LordCover -- Interesting... out of curiosity I tested StackOverflow's implementation just now and they actually strip the div tags completely in my example above. And that works for me, but I don't know how they did that. It sure doesn't look to be a feature included in MarkdownSharp.Pelligrini
@BoltClock, why do you think this is by design? It seems both counter-intuitive and a security vulnerability.Kaoliang
@BoltClock - If this is by design then it makes Markdown a poor choice for user comments. I mean, like I said, I could have just as easily not closed the div and that would make the rest of the page red. Or I could take it a step further and do some script injection with Javascript. It appears that StackOverflow got around all this somehow by stripping unwanted tags.Pelligrini
Oversight on my part then, sorry.Blodget
This is definitely irritating. I'm having the same issue. If I don't HTML Encode the user inputted value, then they can inject code into the page. If I do encode the user inputted value, then the code that shows up in the markdown code blocks is HTML Encoded and shows up in the code block as &lt;div&gt; instead of <div> like it should.Ramses
After scanning the MarkdownSharp source I realized how simple it was and I modified it to add my own option called EncodeCodeBlocks which by default is set to true (current behavior). Setting it to false will stop it from re-encoding. See my answer for more.Ramses
P
2

Since it became clear that the StackOverflow implementation contains quite a few customizations that could be time consuming to test and figure out, I decided to go another direction.

I created my own simplified markup language that's a subset of Markdown. The open-source project is at http://ultralight.codeplex.com/ and you can see a working example at http://www.bucketsoft.com/ultralight/

The project is a complete ASP.NET MVC solution with a Javascript editor. And unlike MarkdownSharp, safe HTML is guaranteed. The Javascript parser is used both client-side and server-side to guarantee consistent markup (special thanks to the Jurassic Javascript compiler). It's a beautiful thing to only have to maintain one codebase for that parser.

Although the project is still in beta, I'm using it on my own site already and it seems to be working well so far.

Pelligrini answered 16/2, 2011 at 21:4 Comment(4)
It's not open source if the source code for the UltralightMarkup.dll is not made available. By definition, that would be closed source. On codeplex, there's only the code that uses that DLL.Hakim
@Khnle - What you downloaded is just what you need to implement it in your own project. If you want to look at the full source code, click the "Source Code" tab... ultralight.codeplex.com/SourceControl/list/changesetsPelligrini
@Wortham - Great, very nice. UltraMarkup meets my requirements nicely.Hakim
@Khnle - Good, I'm glad it works well for you. It's something I threw together one weekend with the help of several existing open-source projects. I've been pleased with the results so far. And I like that even if you choose Ultralight Markup for your website, you can always "upgrade" to Markdown later as the syntax is a true subset of Markdown.Pelligrini
R
3

For those not wanting to use Steve Wortham's customized solution, I have submitted an issue and a proposed fix to the MarkdownSharp guys: http://code.google.com/p/markdownsharp/issues/detail?id=43

If you download my attached Markdown.cs file you will find a new option that you can set. It will stop MarkdownSharp from re-encoding text within the code blocks.

Just don't forget to HTML encode your input BEFORE you pass it into markdown, NOT after.

Another solution is to white-list HTML tags like Stack Overflow does. You would do this AFTER you pass your content to markdown.

See this for more information: http://www.CodeTunnel.com/blog/post/24/mardownsharp-and-encoded-html

Ramses answered 17/3, 2011 at 20:19 Comment(0)
P
2

Since it became clear that the StackOverflow implementation contains quite a few customizations that could be time consuming to test and figure out, I decided to go another direction.

I created my own simplified markup language that's a subset of Markdown. The open-source project is at http://ultralight.codeplex.com/ and you can see a working example at http://www.bucketsoft.com/ultralight/

The project is a complete ASP.NET MVC solution with a Javascript editor. And unlike MarkdownSharp, safe HTML is guaranteed. The Javascript parser is used both client-side and server-side to guarantee consistent markup (special thanks to the Jurassic Javascript compiler). It's a beautiful thing to only have to maintain one codebase for that parser.

Although the project is still in beta, I'm using it on my own site already and it seems to be working well so far.

Pelligrini answered 16/2, 2011 at 21:4 Comment(4)
It's not open source if the source code for the UltralightMarkup.dll is not made available. By definition, that would be closed source. On codeplex, there's only the code that uses that DLL.Hakim
@Khnle - What you downloaded is just what you need to implement it in your own project. If you want to look at the full source code, click the "Source Code" tab... ultralight.codeplex.com/SourceControl/list/changesetsPelligrini
@Wortham - Great, very nice. UltraMarkup meets my requirements nicely.Hakim
@Khnle - Good, I'm glad it works well for you. It's something I threw together one weekend with the help of several existing open-source projects. I've been pleased with the results so far. And I like that even if you choose Ultralight Markup for your website, you can always "upgrade" to Markdown later as the syntax is a true subset of Markdown.Pelligrini
B
1

Maybe I'm not understanding? If you are starting a new code block in Markdown, in all its varieties, you do need a double linebreak and four-space indentation -- a single newline won't do in any of the renderers I have to hand.

abc -- Here comes a code block:

    <div style="background-color: red"> This is code</div>

yielding:

abc -- Here comes a code block:

<div style="background-color: red"> This is code</div>

From what you are saying it seems that MarkdownSharp does fine with this rule, so with just one newline (but indentation):

 abc -- Here comes a code block:
     <div style="background-color: red"> This should be code</div>

we get a mess not a code block:

abc -- Here comes a code block: This should be code

I assume StackOverflow is stripping the <div> tags, because they think comments shouldn't have divisions and suchlike things. (?) (In general they have to do a lot of other processing don't they, e.g. to get syntax highlighting and so on?)

EDIT: I think people are expecting the wrong thing of a Markdown implementation. For example, as I say below, there is no such thing as 'invalid markdown'. It isn't a programming language or anything like one. I have verified that all three markdown implementations I have available from the command line indifferently 'convert' random .js and .c files, or those inserted into otherwise sensible markdown -- and also interpolated zip files and other nonsense -- into valid html that browsers don't mind displaying at all -- chicken scratches though it is. If you want to exclude something, e.g. in a wiki program, you do something further, of course, as most markdown-employing wiki programs do.

Beachcomber answered 1/2, 2011 at 20:29 Comment(6)
The problem is that a user can easily screw up and not insert that critical line break that allows the code block to work. Then, the text that they type is straight HTML. That is the problem. That is what opens the site up to script injection and other bad things. What I'm eluding to is that out of the box MarkdownSharp is unsafe, and although StackOverflow has obviously come up with a solution, I don't know how they did it.Pelligrini
I don't see this. It wouldn't be valid Markdown if it didn't permit the writer to use <div>...</div> for regular composition, not just in code blocks. The only place html markup isn't html markup is in backticks and codeblocks. I assume MarkdownSharp like most implementations has whitelisting procedures that can be called and so forth?Beachcomber
There are only two solutions for the problem I'm describing. 1.) HTMLEncode the text that's not valid Markdown. 2.) Strip out the invalid Markdown entirely.Pelligrini
I see the difficulty, but it's hard to believe the usual switches are not available in this markdown converter (I don't see anything in the source, admittedly, maybe they expect you to preprocess?) In fact there is no such thing as 'invalid markdown'. Every sequence of unicode letters should have a corresponding html representation.Beachcomber
@Applicative - Thanks for the effort. But nobody was able to give a practical solution to my problem. And that's really what I was after.Pelligrini
@Beachcomber The problem with pre-processing the input was that the already encoded input would be re-encoded within markdown generated code blocks. See my answer for a fix to the actual MarkdownSharp project.Ramses

© 2022 - 2024 — McMap. All rights reserved.