Using Objective C/Cocoa to unescape unicode characters, ie \u1234
Asked Answered
F

4

34

Some sites that I am fetching data from are returning UTF-8 strings, with the UTF-8 characters escaped, ie: \u5404\u500b\u90fd

Is there a built in cocoa function that might assist with this or will I have to write my own decoding algorithm.

Fibster answered 20/1, 2010 at 6:0 Comment(0)
U
24

There is no built-in function to do C unescaping.

You can cheat a little with NSPropertyListSerialization since an "old text style" plist supports C escaping via \Uxxxx:

NSString* input = @"ab\"cA\"BC\\u2345\\u0123";

// will cause trouble if you have "abc\\\\uvw"
NSString* esc1 = [input stringByReplacingOccurrencesOfString:@"\\u" withString:@"\\U"];
NSString* esc2 = [esc1 stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];
NSString* quoted = [[@"\"" stringByAppendingString:esc2] stringByAppendingString:@"\""];
NSData* data = [quoted dataUsingEncoding:NSUTF8StringEncoding];
NSString* unesc = [NSPropertyListSerialization propertyListFromData:data
                   mutabilityOption:NSPropertyListImmutable format:NULL
                   errorDescription:NULL];
assert([unesc isKindOfClass:[NSString class]]);
NSLog(@"Output = %@", unesc);

but mind that this isn't very efficient. It's far better if you write up your own parser. (BTW are you decoding JSON strings? If yes you could use the existing JSON parsers.)

Used answered 20/1, 2010 at 6:40 Comment(1)
"There is no built in function to do it" is what I was trying to find out. I ended up rolling my own, just wanted to check I wasn't re-inventing the wheel. The existing JSON parsers are no where near forgiving enough on badly formed JSON output that are sometimes sent by dodgy web sites.Fibster
T
94

It's correct that Cocoa does not offer a solution, yet Core Foundation does: CFStringTransform.

CFStringTransform lives in a dusty, remote corner of Mac OS (and iOS) and so it's a little know gem. It is the front end to Apple's ICU compatible string transformation engine. It can perform real magic like transliterations between greek and latin (or about any known scripts), but it can also be used to do mundane tasks like unescaping strings from a crappy server:

NSString *input = @"\\u5404\\u500b\\u90fd";
NSString *convertedString = [input mutableCopy];

CFStringRef transform = CFSTR("Any-Hex/Java");
CFStringTransform((__bridge CFMutableStringRef)convertedString, NULL, transform, YES);

NSLog(@"convertedString: %@", convertedString);

// prints: 各個都, tada!

As I said, CFStringTransform is really powerful. It supports a number of predefined transforms, like case mappings, normalizations or unicode character name conversion. You can even design your own transformations.

I have no idea why Apple does not make it available from Cocoa.

Edit 2015:

OS X 10.11 and iOS 9 add the following method to Foundation:

- (nullable NSString *)stringByApplyingTransform:(NSString *)transform reverse:(BOOL)reverse;

So the example from above becomes...

NSString *input = @"\\u5404\\u500b\\u90fd";
NSString *convertedString = [input stringByApplyingTransform:@"Any-Hex/Java"
                                                     reverse:YES];

NSLog(@"convertedString: %@", convertedString);

Thanks @nschmidt for the heads up.

Thetes answered 23/7, 2012 at 14:55 Comment(7)
This is a brilliant piece of functionality by Apple, and it goes far beyond this kind of transformation.Knap
Let's say that I receive a string like convertedString from a source that I cannot change. Can you tell me how can I go about reversing the process so I get back the original string?Mamey
its not working, I tried this. it prints à®à®²à®à®®à¯Punkah
@shiami see userguide.icu-project.org/transforms/…Cloutier
perfect solution in iOS 9 since there is a foundation method to do this. I have been searching the solution for 2 days to decode unicode characters to NSString..Panpsychist
But when I conver the string to NSURL using [NSURL URLWithString:string] its returning nil. Any solution for this ?Panpsychist
Note: Just use @"Any-Hex" for unescaping emojis.Tommy
U
24

There is no built-in function to do C unescaping.

You can cheat a little with NSPropertyListSerialization since an "old text style" plist supports C escaping via \Uxxxx:

NSString* input = @"ab\"cA\"BC\\u2345\\u0123";

// will cause trouble if you have "abc\\\\uvw"
NSString* esc1 = [input stringByReplacingOccurrencesOfString:@"\\u" withString:@"\\U"];
NSString* esc2 = [esc1 stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];
NSString* quoted = [[@"\"" stringByAppendingString:esc2] stringByAppendingString:@"\""];
NSData* data = [quoted dataUsingEncoding:NSUTF8StringEncoding];
NSString* unesc = [NSPropertyListSerialization propertyListFromData:data
                   mutabilityOption:NSPropertyListImmutable format:NULL
                   errorDescription:NULL];
assert([unesc isKindOfClass:[NSString class]]);
NSLog(@"Output = %@", unesc);

but mind that this isn't very efficient. It's far better if you write up your own parser. (BTW are you decoding JSON strings? If yes you could use the existing JSON parsers.)

Used answered 20/1, 2010 at 6:40 Comment(1)
"There is no built in function to do it" is what I was trying to find out. I ended up rolling my own, just wanted to check I wasn't re-inventing the wheel. The existing JSON parsers are no where near forgiving enough on badly formed JSON output that are sometimes sent by dodgy web sites.Fibster
A
12

Here's what I ended up writing. Hopefully this will help some people along.

+ (NSString*) unescapeUnicodeString:(NSString*)string
{
// unescape quotes and backwards slash
NSString* unescapedString = [string stringByReplacingOccurrencesOfString:@"\\\"" withString:@"\""];
unescapedString = [unescapedString stringByReplacingOccurrencesOfString:@"\\\\" withString:@"\\"];

// tokenize based on unicode escape char
NSMutableString* tokenizedString = [NSMutableString string];
NSScanner* scanner = [NSScanner scannerWithString:unescapedString];
while ([scanner isAtEnd] == NO)
{
    // read up to the first unicode marker
    // if a string has been scanned, it's a token
    // and should be appended to the tokenized string
    NSString* token = @"";
    [scanner scanUpToString:@"\\u" intoString:&token];
    if (token != nil && token.length > 0)
    {
        [tokenizedString appendString:token];
        continue;
    }

    // skip two characters to get past the marker
    // check if the range of unicode characters is
    // beyond the end of the string (could be malformed)
    // and if it is, move the scanner to the end
    // and skip this token
    NSUInteger location = [scanner scanLocation];
    NSInteger extra = scanner.string.length - location - 4 - 2;
    if (extra < 0)
    {
        NSRange range = {location, -extra};
        [tokenizedString appendString:[scanner.string substringWithRange:range]];
        [scanner setScanLocation:location - extra];
        continue;
    }

    // move the location pas the unicode marker
    // then read in the next 4 characters
    location += 2;
    NSRange range = {location, 4};
    token = [scanner.string substringWithRange:range];
    unichar codeValue = (unichar) strtol([token UTF8String], NULL, 16);
    [tokenizedString appendString:[NSString stringWithFormat:@"%C", codeValue]];

    // move the scanner past the 4 characters
    // then keep scanning
    location += 4;
    [scanner setScanLocation:location];
}

// done
return tokenizedString;
}

+ (NSString*) escapeUnicodeString:(NSString*)string
{
// lastly escaped quotes and back slash
// note that the backslash has to be escaped before the quote
// otherwise it will end up with an extra backslash
NSString* escapedString = [string stringByReplacingOccurrencesOfString:@"\\" withString:@"\\\\"];
escapedString = [escapedString stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];

// convert to encoded unicode
// do this by getting the data for the string
// in UTF16 little endian (for network byte order)
NSData* data = [escapedString dataUsingEncoding:NSUTF16LittleEndianStringEncoding allowLossyConversion:YES];
size_t bytesRead = 0;
const char* bytes = data.bytes;
NSMutableString* encodedString = [NSMutableString string];

// loop through the byte array
// read two bytes at a time, if the bytes
// are above a certain value they are unicode
// otherwise the bytes are ASCII characters
// the %C format will write the character value of bytes
while (bytesRead < data.length)
{
    uint16_t code = *((uint16_t*) &bytes[bytesRead]);
    if (code > 0x007E)
    {
        [encodedString appendFormat:@"\\u%04X", code];
    }
    else
    {
        [encodedString appendFormat:@"%C", code];
    }
    bytesRead += sizeof(uint16_t);
}

// done
return encodedString;
}
As answered 28/10, 2011 at 22:44 Comment(1)
it must be legal to kill server-side developer, just for forcing me to use this solution. @As nice working code by the way. Cheers!Grill
S
3

simple code:

const char *cString = [unicodeStr cStringUsingEncoding:NSUTF8StringEncoding];
NSString *resultStr = [NSString stringWithCString:cString encoding:NSNonLossyASCIIStringEncoding];

from: https://stackoverflow.com/a/7861345

Syndetic answered 21/11, 2014 at 0:55 Comment(1)
Hi All, I am facing a strange issue, i dont know why it is not working with the suggestions made above, Can anyone please parse this string for me? @"ElbowWristHand_DeQuervian\U00e2\U0080\U0099s Tenosynovitis"; In actual it is "ElbowWristHand_DeQuervian's" and i have tried all the above suggested methods but still not working, Please suggest. ThanksSadfaced

© 2022 - 2024 — McMap. All rights reserved.