Javascript unescape() vs. Python urllib.unquote() - McMap

About

Javascript unescape() vs. Python urllib.unquote()

Asked 18/4, 2014 at 17:13 Answered 18/4, 2014 at 17:15

Solved javascript python escaping urllib

C

1

7

From reading various posts, it seems like JavaScript's unescape() is equivalent to Pythons urllib.unquote(), however when I test both I get different results:

In browser console:

unescape('%u003c%u0062%u0072%u003e');

output: <br>

In Python interpreter:

import urllib
urllib.unquote('%u003c%u0062%u0072%u003e')

output: %u003c%u0062%u0072%u003e

I would expect Python to also return <br>. Any ideas as to what I'm missing here?

Thanks!

Collayer answered 18/4, 2014 at 17:13 Comment(0)

S

11

%uxxxx is a non standard URL encoding scheme that is not supported by urllib.parse.unquote() (Py 3) / urllib.unquote() (Py 2).

It was only ever part of ECMAScript ECMA-262 3rd edition; the format was rejected by the W3C and was never a part of an RFC.

You could use a regular expression to convert such codepoints:

try:
    unichr  # only in Python 2
except NameError:
    unichr = chr  # Python 3

re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted)

This decodes both the %uxxxx and the %uxx form ECMAScript 3rd ed can decode.

Demo:

>>> import re
>>> quoted = '%u003c%u0062%u0072%u003e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), quoted)
'<br>'
>>> altquoted = '%u3c%u0062%u0072%u3e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), altquoted)
'<br>'

but you should avoid using the encoding altogether if possible.

Sap answered 18/4, 2014 at 17:15 Comment(5)

isn't x.replace('%u',r'\u').decode('unicode-escape') simpler? – Wot 18/4, 2014 at 17:28

@roippi: that presumes that all %u characters are part of an escaped codepoint. I wanted to play it a little safer than that. – Sap 18/4, 2014 at 17:29

true, but since b is escaped as %u0062 I assume every single character is an escaped codepoint. – Wot 18/4, 2014 at 17:31

@roippi: I just checked the ECMA-262 3rd edition standard (check page 171, B.2.2); %u is also permitted (literal text), as is %uxx (equivalent to %u00xx). Using a simple \u replacement is not sufficient there. – Sap 18/4, 2014 at 17:39

@MartijnPieters diligent! You've got it right then. – Wot 18/4, 2014 at 17:41

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.