Using Objective C/Cocoa to unescape unicode characters, ie \u1234

Asked 20/1, 2010 at 6:0 Answered 21/11, 2014 at 0:55

Some sites that I am fetching data from are returning UTF-8 strings, with the UTF-8 characters escaped, ie: \u5404\u500b\u90fd

Is there a built in cocoa function that might assist with this or will I have to write my own decoding algorithm.

Fibster answered 20/1, 2010 at 6:0 Comment(0)

There is no built-in function to do C unescaping.

You can cheat a little with NSPropertyListSerialization since an "old text style" plist supports C escaping via \Uxxxx:

NSString* input = @"ab\"cA\"BC\\u2345\\u0123";

// will cause trouble if you have "abc\\\\uvw"
NSString* esc1 = [input stringByReplacingOccurrencesOfString:@"\\u" withString:@"\\U"];
NSString* esc2 = [esc1 stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];
NSString* quoted = [[@"\"" stringByAppendingString:esc2] stringByAppendingString:@"\""];
NSData* data = [quoted dataUsingEncoding:NSUTF8StringEncoding];
NSString* unesc = [NSPropertyListSerialization propertyListFromData:data
                   mutabilityOption:NSPropertyListImmutable format:NULL
                   errorDescription:NULL];
assert([unesc isKindOfClass:[NSString class]]);
NSLog(@"Output = %@", unesc);

but mind that this isn't very efficient. It's far better if you write up your own parser. (BTW are you decoding JSON strings? If yes you could use the existing JSON parsers.)

Used answered 20/1, 2010 at 6:40 Comment(1)

"There is no built in function to do it" is what I was trying to find out. I ended up rolling my own, just wanted to check I wasn't re-inventing the wheel. The existing JSON parsers are no where near forgiving enough on badly formed JSON output that are sometimes sent by dodgy web sites. – Fibster 20/1, 2010 at 23:16

~~It's correct that Cocoa does not offer a solution~~, yet Core Foundation does: CFStringTransform.

CFStringTransform lives in a dusty, remote corner of Mac OS (and iOS) and so it's a little know gem. It is the front end to Apple's ICU compatible string transformation engine. It can perform real magic like transliterations between greek and latin (or about any known scripts), but it can also be used to do mundane tasks like unescaping strings from a crappy server:

NSString *input = @"\\u5404\\u500b\\u90fd";
NSString *convertedString = [input mutableCopy];

CFStringRef transform = CFSTR("Any-Hex/Java");
CFStringTransform((__bridge CFMutableStringRef)convertedString, NULL, transform, YES);

NSLog(@"convertedString: %@", convertedString);

// prints: 各個都, tada!

As I said, CFStringTransform is really powerful. It supports a number of predefined transforms, like case mappings, normalizations or unicode character name conversion. You can even design your own transformations.

~~I have no idea why Apple does not make it available from Cocoa.~~

Edit 2015:

OS X 10.11 and iOS 9 add the following method to Foundation:

- (nullable NSString *)stringByApplyingTransform:(NSString *)transform reverse:(BOOL)reverse;

So the example from above becomes...

NSString *input = @"\\u5404\\u500b\\u90fd";
NSString *convertedString = [input stringByApplyingTransform:@"Any-Hex/Java"
                                                     reverse:YES];

NSLog(@"convertedString: %@", convertedString);

Thanks @nschmidt for the heads up.

Thetes answered 23/7, 2012 at 14:55 Comment(7)

This is a brilliant piece of functionality by Apple, and it goes far beyond this kind of transformation. – Knap 16/10, 2012 at 4:29

Let's say that I receive a string like convertedString from a source that I cannot change. Can you tell me how can I go about reversing the process so I get back the original string? – Mamey 3/12, 2012 at 15:27

its not working, I tried this. it prints à®à®²à®à®®à¯ – Punkah 29/10, 2014 at 6:48

@shiami see userguide.icu-project.org/transforms/… – Cloutier 14/5, 2015 at 15:13

perfect solution in iOS 9 since there is a foundation method to do this. I have been searching the solution for 2 days to decode unicode characters to NSString.. – Panpsychist 11/10, 2015 at 5:53

But when I conver the string to NSURL using [NSURL URLWithString:string] its returning nil. Any solution for this ? – Panpsychist 11/10, 2015 at 6:20

Note: Just use @"Any-Hex" for unescaping emojis. – Tommy 16/3, 2020 at 21:24

There is no built-in function to do C unescaping.

You can cheat a little with NSPropertyListSerialization since an "old text style" plist supports C escaping via \Uxxxx:

NSString* input = @"ab\"cA\"BC\\u2345\\u0123";

// will cause trouble if you have "abc\\\\uvw"
NSString* esc1 = [input stringByReplacingOccurrencesOfString:@"\\u" withString:@"\\U"];
NSString* esc2 = [esc1 stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];
NSString* quoted = [[@"\"" stringByAppendingString:esc2] stringByAppendingString:@"\""];
NSData* data = [quoted dataUsingEncoding:NSUTF8StringEncoding];
NSString* unesc = [NSPropertyListSerialization propertyListFromData:data
                   mutabilityOption:NSPropertyListImmutable format:NULL
                   errorDescription:NULL];
assert([unesc isKindOfClass:[NSString class]]);
NSLog(@"Output = %@", unesc);

but mind that this isn't very efficient. It's far better if you write up your own parser. (BTW are you decoding JSON strings? If yes you could use the existing JSON parsers.)

Used answered 20/1, 2010 at 6:40 Comment(1)

Here's what I ended up writing. Hopefully this will help some people along.

+ (NSString*) unescapeUnicodeString:(NSString*)string
{
// unescape quotes and backwards slash
NSString* unescapedString = [string stringByReplacingOccurrencesOfString:@"\\\"" withString:@"\""];
unescapedString = [unescapedString stringByReplacingOccurrencesOfString:@"\\\\" withString:@"\\"];

// tokenize based on unicode escape char
NSMutableString* tokenizedString = [NSMutableString string];
NSScanner* scanner = [NSScanner scannerWithString:unescapedString];
while ([scanner isAtEnd] == NO)
{
    // read up to the first unicode marker
    // if a string has been scanned, it's a token
    // and should be appended to the tokenized string
    NSString* token = @"";
    [scanner scanUpToString:@"\\u" intoString:&token];
    if (token != nil && token.length > 0)
    {
        [tokenizedString appendString:token];
        continue;
    }

    // skip two characters to get past the marker
    // check if the range of unicode characters is
    // beyond the end of the string (could be malformed)
    // and if it is, move the scanner to the end
    // and skip this token
    NSUInteger location = [scanner scanLocation];
    NSInteger extra = scanner.string.length - location - 4 - 2;
    if (extra < 0)
    {
        NSRange range = {location, -extra};
        [tokenizedString appendString:[scanner.string substringWithRange:range]];
        [scanner setScanLocation:location - extra];
        continue;
    }

    // move the location pas the unicode marker
    // then read in the next 4 characters
    location += 2;
    NSRange range = {location, 4};
    token = [scanner.string substringWithRange:range];
    unichar codeValue = (unichar) strtol([token UTF8String], NULL, 16);
    [tokenizedString appendString:[NSString stringWithFormat:@"%C", codeValue]];

    // move the scanner past the 4 characters
    // then keep scanning
    location += 4;
    [scanner setScanLocation:location];
}

// done
return tokenizedString;
}

+ (NSString*) escapeUnicodeString:(NSString*)string
{
// lastly escaped quotes and back slash
// note that the backslash has to be escaped before the quote
// otherwise it will end up with an extra backslash
NSString* escapedString = [string stringByReplacingOccurrencesOfString:@"\\" withString:@"\\\\"];
escapedString = [escapedString stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];

// convert to encoded unicode
// do this by getting the data for the string
// in UTF16 little endian (for network byte order)
NSData* data = [escapedString dataUsingEncoding:NSUTF16LittleEndianStringEncoding allowLossyConversion:YES];
size_t bytesRead = 0;
const char* bytes = data.bytes;
NSMutableString* encodedString = [NSMutableString string];

// loop through the byte array
// read two bytes at a time, if the bytes
// are above a certain value they are unicode
// otherwise the bytes are ASCII characters
// the %C format will write the character value of bytes
while (bytesRead < data.length)
{
    uint16_t code = *((uint16_t*) &bytes[bytesRead]);
    if (code > 0x007E)
    {
        [encodedString appendFormat:@"\\u%04X", code];
    }
    else
    {
        [encodedString appendFormat:@"%C", code];
    }
    bytesRead += sizeof(uint16_t);
}

// done
return encodedString;
}

As answered 28/10, 2011 at 22:44 Comment(1)

it must be legal to kill server-side developer, just for forcing me to use this solution. @As nice working code by the way. Cheers! – Grill 28/9, 2012 at 8:33

simple code:

const char *cString = [unicodeStr cStringUsingEncoding:NSUTF8StringEncoding];
NSString *resultStr = [NSString stringWithCString:cString encoding:NSNonLossyASCIIStringEncoding];

from: https://stackoverflow.com/a/7861345

Syndetic answered 21/11, 2014 at 0:55 Comment(1)

Hi All, I am facing a strange issue, i dont know why it is not working with the suggestions made above, Can anyone please parse this string for me? @"ElbowWristHand_DeQuervian\U00e2\U0080\U0099s Tenosynovitis"; In actual it is "ElbowWristHand_DeQuervian's" and i have tried all the above suggested methods but still not working, Please suggest. Thanks – Sadfaced 7/1, 2015 at 12:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags