In an iOS framework, I am searching through this 3.2 MB file for pronunciations: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic
I am using NSRegularExpression to search for an arbitrary set of words that are given as an NSArray. The search is done through the contents of the large file as an NSString. I need to match any word that appears bracketed by a newline and a tab character, and then grab the whole line, for example if I have the word "monday" in my NSArray I want to match this line within the dictionary file:
monday M AH N D IY
This line starts with a newline, the string "monday" is followed by a tab character, and then the pronunciation follows. The entire line needs to be matched by the regex for its ultimate output. I also need to find alternate pronunciations of the words which are listed as follows:
monday(2) M AH N D EY
The alternative pronunciations always begin with (2) and can go as high as (5). So I also search for iterations of the word followed by parentheses containing a single number bracketed by a newline and a tab character.
I have a 100% working NSRegularExpression method as follows:
NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.
// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("];
int firstRound = 0;
for(NSString *word in array) {
if(firstRound == 0) { // this is the first round
firstRound++;
} else { // After the first iteration we need an OR operator first.
[mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
}
[mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}
[mutablePatternString appendString:@")\\t.*$"];
// This results in this regex pattern:
// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$
NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
options:NSRegularExpressionAnchorsMatchLines
error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
options:0
range:NSMakeRange(rangeLocation, rangeLength)
usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
[matches addObject:[string substringWithRange:result.range]];
}];
[mutablePatternString release];
// matches array is returned to the caller.
My issue is that given the big text file, it isn't really fast enough on the iPhone. 8 words take 1.3 seconds on an iPhone 4, which is too long for the application. Given the following known factors:
• The 3.2 MB text file has the words to match listed in alphabetical order
• The array of arbitrary words to look up are always in alphabetical order when they get to this method
• Alternate pronunciations start with (2) in parens after the word, not (1)
• If there is no (2) there won't be a (3), (4) or more
• The presence of one alternative pronunciation is rare, occurring maybe 1 time in 8 on average. Further alternate pronunciations are even rarer.
Can this method be optimized, either by improving the regex or some aspect of the Objective-C? I'm assuming that NSRegularExpression is already optimized enough that it isn't going to be worthwhile trying to do it with a different Objective-C library or in C, but if I'm wrong here let me know. Otherwise, very grateful for any suggestions on improving the performance. I am hoping to make this generalized to any pronunciation file so I'm trying to stay away from solutions like calculating the alphabetical ranges ahead of time to do more constrained searches.
****EDIT****
Here are the timings on the iPhone 4 for all of the search-related answers given by August 16th 2012:
dasblinkenlight's create NSDictionary approach https://mcmap.net/q/1669968/-regex-pattern-and-or-nsregularexpression-a-bit-too-slow-searching-over-very-large-file-can-it-be-optimized: 5.259676 seconds
Ωmega's fastest regex at https://mcmap.net/q/1669968/-regex-pattern-and-or-nsregularexpression-a-bit-too-slow-searching-over-very-large-file-can-it-be-optimized: 0.609593 seconds
dasblinkenlight's multiple NSRegularExpression approach at https://mcmap.net/q/1669968/-regex-pattern-and-or-nsregularexpression-a-bit-too-slow-searching-over-very-large-file-can-it-be-optimized: 1.255130 seconds
my first hybrid approach at https://mcmap.net/q/1669968/-regex-pattern-and-or-nsregularexpression-a-bit-too-slow-searching-over-very-large-file-can-it-be-optimized: 0.372215 seconds
my second hybrid approach at https://mcmap.net/q/1669968/-regex-pattern-and-or-nsregularexpression-a-bit-too-slow-searching-over-very-large-file-can-it-be-optimized: 0.337549 seconds
The best time so far is the second version of my answer. I can't mark any of the answers best, since all of the search-related answers informed the approach that I took in my version so they are all very helpful and mine is just based on the others. I learned a lot and my method ended up a quarter of the original time so this was enormously helpful, thank you dasblinkenlight and Ωmega for talking it through with me.
^(sunday|monday|tuesday|...)(\t|\\().*$
: you know that whatever comes in parentheses is a single-character followed by a closing parentheses, so you can skip that portion of the match. Bringing all your strings in a singleOR
block might help as well, but I am not sure if it's going to help much. – Maidenly