In PHP, there is a function called htmlspecialchars() that performs the following substitutions on a string:
&
(ampersand) is converted to&
"
(double quote) is converted to"
'
(single quote) is converted to'
(only if the flag ENT_QUOTES is set)<
(less than) is converted to<
>
(greater than) is converted to>
Apparently, this is done on the grounds that these 5 specific characters are the unsafe HTML characters.
I can understand why the last two are considered unsafe: if they are simply "echoed", arbitrary/dangerous HTML could be delivered, including potential javascript with <script>
and all that.
Question 1. Why are the first three characters (ampersand, double quote, single quote) also considered 'unsafe'?
Also, I stumbled upon this library called "he" on GitHub (by Mathias Bynens), which is about encoding/decoding HTML entities. There, I found the following:
[...] characters that are unsafe for use in HTML content (&, <, >, ", ', and `) will be encoded. [...]
(source)
Question 2. Is there a good reason for considering the backtick another unsafe HTML character? If yes, does this mean that PHP's function mentioned above is outdated?
Finally, all this begs the question:
Question 3. Are there any other characters that should be considered 'unsafe', alongside those 5/6 characters mentioned above?