Xml Escaping/Encoding terminology

Asked 18/4, 2009 at 11:12 Answered 20/4, 2013 at 21:2

Solved xml encoding escaping html-encode xml-encoding

I'm confused as for the difference between the terms "escaping" and "encoding" in phrases like:

Xml Encoding

Xml Escaping

Encoded Html

Escaped Url

...

Can anyone explain it to me?

Burnham answered 18/4, 2009 at 11:12 Comment(0)

Encoding describes how the file's characters are physically written in binary (as in Unicode or ANSI).

Escaping refers to the process of replacing special characters (such as < and >) with their XML entity equivalent (such as < and >). For URLs, escaping refers to replacing characters with strings starting with %, such as %20 for a single whitespace.

Escaping differs by language, but encodings are usually widely-accepted standards. Sometimes the terms are used ambiguously (particularly with encoding used to mean escaping), but they are well defined and distinct.

Kerf answered 18/4, 2009 at 12:4 Comment(3)

A pedantic clarification: "unicode" is not an encoding but a character set (UTF-8, ISO8859-1, CP850 are examples of encodings). Sadly, Unicode and UTF-8 are often used as synonymous while they are not. – Keikokeil 5/6, 2010 at 21:39

Agreed that "encoding" is the right term w/r/t "character encoding", but these terms are not "well defined and distinct" when it comes to the process of replacing characters to avoid special interpretation. See my answer. – Sunday 20/4, 2013 at 21:8

Regarding what Yaron has asked, note that in the .NET framework you have these two methods, which do almost the same thing: HttpUtility.UrlPathEncode and Uri.EscapeUriString. – Cabernet 25/1, 2018 at 14:13

In every Web Application, data consists of various layers like the View Layer, Model Layer, Database Layer, etc. Each layer is "supposed" to be developed independently to satisfy various scalability and maintainability requirements.

Now, basically, every layer needs to "talk" to every other, and they have to decide upon a language through which they can talk. This is called encoding. Various types of encodings exists like ASCII, UTF-8, UTF-16, etc. Now if the user is Chinese or Japanese, for instance, then for him ASCII wouldn't work, hence he would go ahead with UTF-16 or any other encoding technique which would guarantee communication in Chinese. So from the web layer, Chinese characters would pass through the business layer, and then to the data layer, and everywhere, the same "encoding" scheme is to be used.

Why ?

Now suppose , your Web Layer , sends data in UTF-16 , supporting chinese language , but the database layer accepts , only ASCII , then the database layer would get confused as to what are you talking ! it understands only English characters , it won't understanding the rest. This was about Encoding.

Escaping :

There is a certain set of data called "metadata" which have a special meaning from the browsers perspective. For example , <> are metadata from the browsers perspective. The browsers parser knows that all the data contained inside these <> are to be interpreted. Now the attackers use this technique to confuse the browsers. For Example :

<input type="text" value="${name} />

if i replace the name with

name="/><script>alert(document.cookie)</script>

Then the resultant code as the browser sees it will be

<input type="text" value=""/><script>alert(document.cookie)</script> />

Means, now you need to instruct the browser that whatever I put in the name="" should be "escaped" , or should be considered as data only. So there are various functions which either encode/escape <> as their html equivalent %3C%3E, so now the browser knows that this needs to be treated differently. Basically escaping means to escape their actual meaning (roughly speaking).

 <input type="text" value="${fn:escapeXML(name)} />

using JSTL.

Scorn answered 14/10, 2012 at 14:51 Comment(0)

TL;DR Both terms are interchangeable (if what you mean is to convert some characters so they will be interpreted as plain string data). This debate is old. From CWE-116: Improper Encoding or Escaping of Output:

The usage of the "encoding" and "escaping" terms varies widely. For example, in some programming languages, the terms are used interchangeably, while other languages provide APIs that use both terms for different tasks. This overlapping usage extends to the Web, such as the "escape" JavaScript function whose purpose is stated to be encoding. Of course, the concepts of encoding and escaping predate the Web by decades. Given such a context, it is difficult for CWE to adopt a consistent vocabulary that will not be misinterpreted by some constituency.

Comically enough JavaScript also has encodeURIComponent(), and its specification avoids the debate entirely:

The encodeURIComponent function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.

Personally I believe it's more appropriate to refer to the general process as "encoding", as you're creating a code to be transmitted through a communications channel (a piece of markup/programming code) and interpreted by a receiver (the parser). I think it's silly to replace < with something completely different like < and call that "escaping".

Sunday answered 20/4, 2013 at 21:2 Comment(1)

For example, in the .NET framework you have these two methods, which do almost the same thing: HttpUtility.UrlPathEncode and Uri.EscapeUriString. – Cabernet 25/1, 2018 at 14:12

Recommended topics

Hot tags