Fastest way to get array of NSRange objects for all uppercase letters in an NSString?

Asked 11/1, 2011 at 1:6 Answered 11/1, 2011 at 5:42

Solved ios iphone objective-c nsstring nsrange

I need NSRange objects for the position of each uppercase letter in a given NSString for input into a method for a custom attributed string class.

There are of course quite a few ways to accomplish this such as rangeOfString:options: with NSRegularExpressionSearch or using RegexKitLite to get each match separately while walking the string.

What would be the fastest performing approach to accomplish this task?

Bellda answered 11/1, 2011 at 1:6 Comment(0)

The simplest way is probably to use -rangeOfCharacterFromSet:options:range: with [NSCharacterSet uppercaseLetterCharacterSet]. By modifying the range to search over with each call, you can find all of the uppercase letters pretty easily. Something like the following will work to give you an NSArray of all ranges (encoded as NSValues):

- (NSArray *)rangesOfUppercaseLettersInString:(NSString *)str {
    NSCharacterSet *cs = [NSCharacterSet uppercaseLetterCharacterSet];
    NSMutableArray *results = [NSMutableArray array];
    NSRange searchRange = NSMakeRange(0, [str length]);
    NSRange range;
    while ((range = [str rangeOfCharacterFromSet:cs options:0 range:searchRange]).location != NSNotFound) {
        [results addObject:[NSValue valueWithRange:range]];
        searchRange = NSMakeRange(NSMaxRange(range), [str length] - NSMaxRange(range));
    }
    return results;
}

Note, this will not coalesce adjacent ranges into a single range, but that's easy enough to add.

Here's an alternative solution based on NSScanner:

- (NSArray *)rangesOfUppercaseLettersInString:(NSString *)str {
    NSCharacterSet *cs = [NSCharacterSet uppercaseLetterCharacterSet];
    NSMutableArray *results = [NSMutableArray array];
    NSScanner *scanner = [NSScanner scannerWithString:str];
    while (![scanner isAtEnd]) {
        [scanner scanUpToCharactersFromSet:cs intoString:NULL]; // skip non-uppercase characters
        NSString *temp;
        NSUInteger location = [scanner scanLocation];
        if ([scanner scanCharactersFromSet:cs intoString:&temp]) {
            // found one (or more) uppercase characters
            NSRange range = NSMakeRange(location, [temp length]);
            [results addObject:[NSValue valueWithRange:range]];
        }
    }
    return results;
}

Unlike the last, this one does coalesce adjacent uppercase characters into a single range.

Edit: If you're looking for absolute speed, this one will likely be the fastest of the 3 presented here, while still preserving correct unicode support (note, I have not tried compiling this):

// returns a pointer to an array of NSRanges, and fills in count with the number of ranges
// the buffer is autoreleased
- (NSRange *)rangesOfUppercaseLettersInString:(NSString *)string count:(NSUInteger *)count {
    NSMutableData *data = [NSMutableData data];
    NSUInteger numRanges = 0;
    NSUInteger length = [string length];
    unichar *buffer = malloc(sizeof(unichar) * length);
    [string getCharacters:buffer range:NSMakeRange(0, length)];
    NSCharacterSet *cs = [NSCharacterSet uppercaseLetterCharacterSet];
    NSRange range = {NSNotFound, 0};
    for (NSUInteger i = 0; i < length; i++) {
        if ([cs characterIsMember:buffer[i]]) {
            if (range.location == NSNotFound) {
                range = (NSRange){i, 0};
            }
            range.length++;
        } else if (range.location != NSNotFound) {
            [data appendBytes:&range length:sizeof(range)];
            numRanges++;
            range = (NSRange){NSNotFound, 0};
        }
    }
    if (range.location != NSNotFound) {
        [data appendBytes:&range length:sizeof(range)];
        numRanges++;
    }
    if (count) *count = numRanges;
    return [data bytes];
}

Purificator answered 11/1, 2011 at 1:15 Comment(11)

Thanks. I got the first one to work, but it was a little slower than the regex solution below. – Bellda 12/1, 2011 at 3:45

Did you try the second solution? – Purificator 12/1, 2011 at 3:59

I'll try your second solution when I get some time... probably later tonight. – Bellda 12/1, 2011 at 13:44

With the @"A simple String to TEST for Upper Case Letters." test string, the results have been revised as follows: 1. @Kevin Ballard's NSScanner solution (~3-4 ms) 2. @johne's enumerateStringsMatchedByRegex solution (~4-10 ms) 3. @Kevin Ballard's rangeOfCharacterFromSet solution (~13-14 ms) The jury is still out on @Catfish_Man's rangebuffer solution. – Bellda 12/1, 2011 at 22:28

If you're looking for speed, I've attached another solution – Purificator 13/1, 2011 at 3:37

@Bellda The times you reported seemed a little odd to me. On my 2.66GHz MacBook Pro, I got 0.0055ms for RegexKitLite and 0.0214ms for @Kevin Ballards NSScanner (the 2nd option in this answer) code (used a loop w/ 5000 repetitions, timed via getrusage()). – Fraze 14/1, 2011 at 12:58

@Fraze When doing performance testing of any kind on iOS, the numbers you get on the desktop aren't very informative. You should try the same test on an iPhone/iPad. – Purificator 14/1, 2011 at 23:21

In particular, memory management will be much slower on iOS. One of several reasons for my obsessive avoidance of malloc() and objects wherever possible for my (admittedly somewhat impractical) solution. (note for people trying solutions: there is a test project there now! Bet you can't beat it! ;) ) – Brahmanism 15/1, 2011 at 1:48

NSRangeMake should be NSMakeRange. – Rhpositive 13/6, 2011 at 23:8

@jrdioko: It already says NSMakeRange. The only instance of NSRangeMake I can find on this page is in your comment. – Purificator 14/6, 2011 at 1:5

@Kevin: Sorry, I also edited your answer to make the correction. – Rhpositive 14/6, 2011 at 1:45

Using RegexKitLite 4.0+ with a runtime that supports Blocks, this can be quite zippy:

NSString *string = @"A simple String to TEST for Upper Case Letters.";
NSString *regex = @"\\p{Lu}";

[string enumerateStringsMatchedByRegex:regex options:RKLNoOptions inRange:NSMakeRange(0UL, [string length]) error:NULL enumerationOptions:RKLRegexEnumerationCapturedStringsNotRequired usingBlock:^(NSInteger captureCount, NSString * const capturedStrings[captureCount], const NSRange capturedRanges[captureCount], volatile BOOL * const stop) {
  NSLog(@"Range: %@", NSStringFromRange(capturedRanges[0]));
}];

The regex \p{Lu} says "Match all characters with the Unicode property of 'Letter' that are also 'Upper Case'".

The option RKLRegexEnumerationCapturedStringsNotRequired tells RegexKitLite that it shouldn't create NSString objects and pass them via capturedStrings[]. This saves quite a bit of time and memory. The only thing that gets passed to the block is the NSRange values for the match via capturedRanges[].

There are two main parts to this, the first is the RegexKitLite method:

[string enumerateStringsMatchedByRegex:regex
                               options:RKLNoOptions
                               inRange:NSMakeRange(0UL, [string length])
                                 error:NULL
                    enumerationOptions:RKLRegexEnumerationCapturedStringsNotRequired
                            usingBlock:/* ... */
];

... and the second is the Block that is passed as an argument to that method:

^(NSInteger captureCount,
  NSString * const capturedStrings[captureCount],
  const NSRange capturedRanges[captureCount],
  volatile BOOL * const stop) { /* ... */ }

Fraze answered 11/1, 2011 at 5:42 Comment(1)

I tried this RegexKitLite enumerateStringsMatchedByRegex solution and version 1 of Kevin Ballard's (using rangeOfCharacterFromSet) and this was consistently the fastest at ~4 to 10 ms vs. ~13 to 14 ms respectively. I tweaked the regular expression string to return adjacent characters in a single NSRange object. Works like champ. Thanks! – Bellda 12/1, 2011 at 3:44

It somewhat depends on the size of the string, but the absolute fastest way I can think of (note: internationalization safety not guaranteed, or even expected! Does the concept of uppercase even apply in say, Japanese?) is:

1) Get a pointer to a raw C string of the string, preferably in a stack buffer if it's small enough. CFString has functions for this. Read the comments in CFString.h.

2) malloc() a buffer big enough to hold one NSRange per character in the string.

3) Something like this (completely untested, written into this text field, pardon mistakes and typos)

NSRange *bufferCursor = rangeBuffer; 
NSRange range = {NSNotFound, 0}; 
for (int idx = 0; idx < numBytes; ++idx) { 
    if (isupper(buffer[idx])) { 
        if (range.length > 0) { //extend a range, we found more than one uppercase letter in a row
            range.length++;
        } else { //begin a range
            range.location = idx; 
            range.length = 1;
        }
    }
    else if (range.location != NSNotFound) { //end a range, we hit a lowercase letter
        *bufferCursor = range; 
        bufferCursor++;
        range.location = NSNotFound;
    }
}

4) realloc() the range buffer back down to the size you actually used (might need to keep a count of ranges begun to do that)

Brahmanism answered 11/1, 2011 at 3:0 Comment(9)

I tried this one, but couldn't quite piece it together. Possibly, it would have beat out the regex solution. – Bellda 12/1, 2011 at 3:46

If you can post your test project somewhere I'd be somewhat interested in seeing how my approach compares; I could probably whip it up to try out. – Brahmanism 12/1, 2011 at 4:22

Ooh also I should use CFStringInitInline buffer for step 1. Forgot about that. – Brahmanism 12/1, 2011 at 4:25

@Catfish_Man: For my timing numbers, I used @johne's sample code exactly as-is and added RegexKitLite to my project. Since I am using this code to display attributed text on-the-fly as the user types for potentially tens of thousands of words (retrieved from a DAWG structure), shaving even a few milliseconds will help. – Bellda 12/1, 2011 at 14:16

To clarify, I'm only processing hundreds of words at a time in separate threads (not thousands) with the remainder shown via "Show More", but it will affect how fast the words are displayed. – Bellda 12/1, 2011 at 14:36

Ah, you were testing in your own project then. I was hoping for a timing harness project I could grab. Lazy and interested are conflicting ;) – Brahmanism 12/1, 2011 at 17:16

np... By "my project", I meant I pasted the above code into an existing project (I happened to already have the RegExKitLite references), put in a break and simply timed how long it took to parse his input string "A simple String to TEST for Upper Case Letters" for each approach... so I didn't do anything special. My above comments were generally about why I care to shave milliseconds off this code. – Bellda 12/1, 2011 at 17:54

Ok, my curiosity got the better of me. Test project at dscoder.com/UppercaseSearcher.zip According to my measurements (warning: 2:30AM arithmetic ;) ), I'm currently clocking in at 4.37472111E-7 seconds per call. To reiterate: I don't recommend actually using this code, but it was a fun experiment, and might have techniques worth borrowing. – Brahmanism 13/1, 2011 at 10:45

Hehe, indeed, I warned about that. There are various things that could make it more applicable (for example using CFStringInitInlineBuffer instead of CFStringGetCStringPtr). I'm doing this partly so people can see what things go really fast (letting the compiler "see" everything in the loop, avoiding malloc, etc...), partly for fun, and partly to annoy Kevin Ballard up in the other answer ;) – Brahmanism 18/1, 2011 at 23:21

a function such as isupper* in conjunction with -[NSString characterAtIndex:] will be plenty fast.

*isupper is an example - it may or may not be appropriate for your input.

Expropriate answered 11/1, 2011 at 1:38 Comment(2)

A potential problem with this approach is characterAtIndex returns a UTF16 character, whereas isupper should "probably" be iswupper. Personally, I don't much care for the is* API when I know I have to be "correct internationally", but that could just be me. – Fraze 11/1, 2011 at 6:2

@Fraze yes - i mentioned the issue in my response, but the clarification helps since not everybody is familiar with libc. cheers – Expropriate 11/1, 2011 at 6:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags