We have a node js application which we have recently moved over from running on IIS 7 (via IIS node) to running on Linux (Elastic Beanstalk). Since we switched, we've been getting a lot of non UTF-8 urls being sent to our application (mainly from crawlers) such as:
Bj%F6rk
which IIS was converting to Björk
. This is now being passed to our application and our web framework (express) eventually calls down to
decodeURIComponent('Bj%F6rk');
URIError: URI malformed
at decodeURIComponent (native)
at repl:1:1
at REPLServer.self.eval (repl.js:110:21)
at repl.js:249:20
at REPLServer.self.eval (repl.js:122:7)
at Interface.<anonymous> (repl.js:239:12)
at Interface.emit (events.js:95:17)
at Interface._onLine (readline.js:203:10)
at Interface._line (readline.js:532:8)
at Interface._ttyWrite (readline.js:761:14)
Is there a recommended safe way we can perform the same conversion as IIS before sending the url string to express?
Bearing in mind
- We are receiving requests to these badly encoded URLS and
- There is a way to decode them using the deprecated
unescape
javascript function and The majority of the requests to these URLs are coming from Bing Bot and we want to minimise any adverse effect on our search rankings.
- Should we really be doing this for all incoming URLs?
- Are there any security or performance implications we should be concerned about?
- Should we be concerned about
unescape
being removed in the near future? - Is there a better / safer way to solve this problem (Yes we did read that MDN article linked to above)