Sanitize Foreign Characters / Accents From URL

Asked 20/1, 2012 at 20:1 Answered 20/1, 2012 at 20:45

I need to write a server side function to sanitize URL encoded strings.

Example querystring:

FirstName=John&LastName=B%F3th&Address=San+Endre+%FAt+12%2F14

When I pass that through HttpUtility.UrlDecode() I get:

FirstName=John&LastName=B�th&Address=San Endre �t 12/14

The function from this SO post is looks perfect but it expects decoded strings that already have accents:

RemoveDiacritics('Bóth`) ==> 'Both';
RemoveDiacritics('San Endre út 12/14`) ==> 'San Endre ut 12/14';

How can I decode the URL without getting all these � characters?

I cannot do anything client side or change the way they come into my function.

Norling answered 20/1, 2012 at 20:1 Comment(2)

This looks like a fail at the caller... do you control the client application? – Accordance 20/1, 2012 at 20:15

@MarcGravell no, unfortunately I don't have control over that – Norling 20/1, 2012 at 20:35

I agree with the arguments already put forth; however, if you’re always receiving your encoded strings from the same client, then you may match their encoding. In this case, they appear to be using ISO/IEC 8859-1, informally known as Latin-1, which is one of the most popular 8-bit character set in use. You can decode ISO/IEC 8859-1 using the following code (which will correctly decode the sample string you provided):

HttpUtility.UrlDecode(encodedInput, Encoding.GetEncoding("iso-8859-1"));

MSDN guarantees that the above code page will be natively supported by the .NET Framework, regardless of the underlying platform; refer to the table of supported encodings for the Encoding Class.

Caracas answered 20/1, 2012 at 20:45 Comment(0)

UrlDecode expects UTF-8 for its input, where each character larger than \u007F is encoded as at least 2 bytes. So the correct string (if the character is \u00F3, ó) would have contained %C3%B3, not %F3.

If the strings arrive the way you get them, I'm not sure there's much you can do. Not with the standard libraries, that is.

By the way, stripping accents from foreign characters is OK, but I wouldn't call it "sanitizing".

Alleman answered 20/1, 2012 at 20:10 Comment(2)

Do you know what type of encoding B%F3th is called? Maybe I could try to find a conversion function? – Norling 20/1, 2012 at 20:16

Well, it's more than likely Windows-1252, so you could try to find a decoding routine that uses that. – Alleman 20/1, 2012 at 20:20

%F3 and %FA are not in UTF8 nor ASCII encoding. It looks like client side code encodes string in current page's locale.

Depending on your needs you can either simply strip out all characters above 127, or figure out how to properly decode incoming Url (I don't think built in function exist to handle it as is).

I would copy characters into a byte array (including manually decoded %-encoded ones) and use correct Encoding to convert it to string (using Encoding.GetString - http://msdn.microsoft.com/en-us/library/system.text.encoding.getstring.aspx) .

Paradox answered 20/1, 2012 at 20:11 Comment(2)

to be fair, the existing decode methods do "properly decode" an incoming url... as long as the incoming url meets the specification... – Accordance 20/1, 2012 at 20:14

@Mark Gravell, good point. I used "properly" as "the way it would work for this case", not "the way it should be done". – Paradox 20/1, 2012 at 20:17

Recommended topics

Hot tags