How to detect if a string is encoded with escape() or encodeURIComponent()

S

6

13

I have a web service that receives data from various clients. Some of them sends the data encoded using escape(), while the others instead use encodeURIComponent(). Is there a way to detect the encoding used to escape the data?

Seersucker answered 14/8, 2009 at 3:48 Comment(3)

I don't have control of the data sent by our clients, and as I said before, some of them use escape() while the other use encodeURIComponent() instead. Using unescape in a string encoded with encodeURIComponent() generate bad characters, and I want to avoid that. Is a legal validation to look if the string just has it's escape sequences in pairs, as %xx%xx. – Seersucker 14/8, 2009 at 5:40

Finally I've found the answer. decodeURIComponent will always decode escaped chars, as it uses some conventions to detect for each symbol if is encoding in utf-8 or ascii. However, as Swingley comments, if a client sends data encoded using escape(), some data could be lost or garbled. So I give the point to him. – Seersucker 2/9, 2009 at 3:14

Since encodeURIComponent() uses UTF-8 encoding of characters >= 128, you can at the server side check for valid UTF-8 sequences. If the data contains invalid UTF-8 sequences the data has been produced with escape() and you probably have to assume it is ISO-8859-1 encoded. Octets of ISO-8859-1 data in practice never look like valid UTF-8 sequences. – Geek 9/8, 2016 at 11:9

H

8

Encourage your clients to use encodeURIComponent(). See this page for an explanation: Comparing escape(), encodeURI(), and encodeURIComponent(). If you really want to try to figure out exactly how something was encoded, you can try to look for some of the characters that escape() and encodeURI() do not encode.

Helldiver answered 14/8, 2009 at 3:54 Comment(3)

I agree that, but unfortunately I can't force the clients to adopt a encoding standard. – Seersucker 15/8, 2009 at 21:35

also, maybe something like: function isEncoded(str){return decodeURIComponent(str) !== str;} – Parlay 29/4, 2012 at 0:59

@Parlay thanks for your idea, it worked for me. :) – Durand 16/3, 2018 at 6:5

L

16

This won't help in the server-side, but in the client-side I have used javascript exceptions to detect if the url encoding has produced ISO Latin or UTF8 encoding.

decodeURIComponent throws an exception on invalid UTF8 sequences.

try {
     result = decodeURIComponent(string);
}
catch (e) {
     result =  unescape(string);                                       
}

For example, ISO Latin encoded umlaut 'ä' %E4 will throw an exception in Firefox, but UTF8-encoded 'ä' %C3%A4 will not.

Encourage your clients to use encodeURIComponent(). See this page for an explanation: Comparing escape(), encodeURI(), and encodeURIComponent(). If you really want to try to figure out exactly how something was encoded, you can try to look for some of the characters that escape() and encodeURI() do not encode.

Helldiver answered 14/8, 2009 at 3:54 Comment(3)

I agree that, but unfortunately I can't force the clients to adopt a encoding standard. – Seersucker 15/8, 2009 at 21:35

also, maybe something like: function isEncoded(str){return decodeURIComponent(str) !== str;} – Parlay 29/4, 2012 at 0:59

@Parlay thanks for your idea, it worked for me. :) – Durand 16/3, 2018 at 6:5

C

3

Thanks for @mika for great answer. Maybe just one improvement since unescape function is considered as deprecated:

declare function unescape(s: string): string;


decodeURItoString(str): string {

 var resp = str;

 try {
    resp = decodeURI(str);
 } catch (e) {
    console.log('ERROR: Can not decodeURI string!');

    if ( (unescape != null) && (unescape instanceof Function) ) {
        resp = unescape(str);
    }
 }

return resp;

}

Channing answered 17/8, 2017 at 10:27 Comment(0)

D

0

You don't have to differentiate them. escape() is so called percent encoding, it only differs from URI encoding in how certain chars encodes. For example, Space is encoded as %20 with escape but + with URI encoding. Once decoded, you always get the same value.

Defect answered 14/8, 2009 at 5:37 Comment(1)

They differ wildly in how non-ascii characters are encoded: encodeURIComponent() produces percent encoded UTF-8 sequences while escape() percent encodes the octets (as in ISO-8859-1 bytes). – Geek 9/8, 2016 at 11:15

D

0

Maybe not the most performant, but this function will recursively decode the encoded string until it cannot decode it anymore.

function decodeValue(str) {
    const decodedStr = decodeURIComponent(str);

    if (decodedStr === str) {
        return decodedStr; // Base case: no more decoding needed
    } else {
        return decodeValue(decodedStr); // String is encoded. Recur with the decoded value
    }
}

decodeValue("%253Ctable class='table-1'%253E%253Ctbody%253E%253Ctr%253E%253Ctd%253Esdfsd%253C/td%253E%253Ctd%253Esdfsd%253C/td%253E%253C/tr%253E%253Ctr%253E%253Ctd%253Esdfsd%253C/td%253E%253Ctd%253Esdfs%253C/td%253E%253C/tr%253E%253C/tbody%253E%253C/table%253E");

In this example the decodeValue function is called twice since the string was encoded two times.

function decodeValue(str) {
  const decodedStr = decodeURIComponent(str);

  if (decodedStr === str) {
    return decodedStr; // Base case: no more decoding needed
  } else {
    return decodeValue(decodedStr); // Recur with the decoded value
  }
}

let decodedString = decodeValue("%253Ctable class='table-1'%253E%253Ctbody%253E%253Ctr%253E%253Ctd%253Esdfsd%253C/td%253E%253Ctd%253Esdfsd%253C/td%253E%253C/tr%253E%253Ctr%253E%253Ctd%253Esdfsd%253C/td%253E%253Ctd%253Esdfs%253C/td%253E%253C/tr%253E%253C/tbody%253E%253C/table%253E");

document.write(decodedString);

table,
th,
td {
  border: 1px solid black;
}

body {
  font-size: 30px;
}

Delorsedelos answered 21/7, 2023 at 19:15 Comment(0)

See Also

Recommended topics

Hot tags