Optimize String Parsing
Asked Answered
M

2

9

I have a requirement to parse data files in "txf" format. The files may contain more than 1000 entries. Since the format is well defined like JSON, I wanted to make a generic parser like JSON, which can serialise and deserialise txf files.

On contrary to JSON, the mark up doesn't have a way to identify an object or an array. If an entry with same tag occurs, we need to consider it as an array.

  1. # Marks the start of an object.
  2. $ Marks the members of an object
  3. / Marks the end of an object

Following is a sample "txf" file

#Employees
$LastUpdated=2015-02-01 14:01:00
#Employee
$Id=1
$Name=Employee 01
#Departments
$LastUpdated=2015-02-01 14:01:00
#Department
$Id=1
$Name=Department Name
/Department
/Departments
/Employee
#Employee
/Employee
/Employees

I was able to create a generic TXF Parser using NSScanner. But with more entries the performance needs more tweaking.

I wrote the foundation object obtained as plist and compared its performance again the parser I wrote. My parser is around 10 times slower than plist parser.

While plist file size is 5 times more than txf and has more markup characters, I feel that there is a lot of room for optimization.

Any help in that direction is highly appreciated.

EDIT : Including the parsing code

static NSString *const kArray    = @"TXFArray";
static NSString *const kBodyText = @"TXFText";

@interface TXFParser ()

/*Temporary variable to hold values of an object*/
@property (nonatomic, strong) NSMutableDictionary *dict;

/*An array to hold the hierarchial data of all nodes encountered while parsing*/
@property (nonatomic, strong) NSMutableArray *stack;

@end

@implementation TXFParser

#pragma mark - Getters

- (NSMutableArray *)stack{
    if (!_stack) {
        _stack = [NSMutableArray new];
    }return _stack;
}

#pragma mark -

- (id)objectFromString:(NSString *)txfString{
    [txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
        if ([string hasPrefix:@"#"]) {
            [self didStartParsingTag:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"$"]){
            [self didFindKeyValuePair:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"/"]){
            [self didEndParsingTag:[string substringFromIndex:1]];
        }else{
            //[self didFindBodyValue:string];
        }
    }]; return self.dict;
}

#pragma mark -

- (void)didStartParsingTag:(NSString *)tag{
    [self parserFoundObjectStartForKey:tag];
}

- (void)didFindKeyValuePair:(NSString *)tag{
    NSArray *components = [tag componentsSeparatedByString:@"="];
    NSString *key = [components firstObject];
    NSString *value = [components lastObject];

    if (key.length) {
        self.dict[key] = value?:@"";
    }
}

- (void)didFindBodyValue:(NSString *)bodyString{
    if (!bodyString.length) return;
    bodyString = [bodyString stringByTrimmingCharactersInSet:[NSCharacterSet illegalCharacterSet]];
    if (!bodyString.length) return;

    self.dict[kBodyText] = bodyString;
}

- (void)didEndParsingTag:(NSString *)tag{
    [self parserFoundObjectEndForKey:tag];
}

#pragma mark -

- (void)parserFoundObjectStartForKey:(NSString *)key{
    self.dict = [NSMutableDictionary new];
    [self.stack addObject:self.dict];
}

- (void)parserFoundObjectEndForKey:(NSString *)key{
    NSDictionary *dict = self.dict;

    //Remove the last value of stack
    [self.stack removeLastObject];

    //Load the previous object as dict
    self.dict = [self.stack lastObject];

    //The stack has contents, then we need to append objects
    if ([self.stack count]) {
        [self addObject:dict forKey:key];
    }else{
        //This is root object,wrap with key and assign output
        self.dict = (NSMutableDictionary *)[self wrapObject:dict withKey:key];
    }
}

#pragma mark - Add Objects after finding end tag

- (void)addObject:(id)dict forKey:(NSString *)key{
    //If there is no value, bailout
    if (!dict) return;

    //Check if the dict already has a value for key array.
    NSMutableArray *array =  self.dict[kArray];

    //If array key is not found look for another object with same key
    if (array) {
        //Array found add current object after wrapping with key
        NSDictionary *currentDict = [self wrapObject:dict withKey:key];
        [array addObject:currentDict];
    }else{
        id prevObj = self.dict[key];
        if (prevObj) {
            /*
             There is a prev value for the same key. That means we need to wrap that object in a collection.
             1. Remove the object from dictionary,
             2. Wrap it with its key
             3. Add the prev and current value to array
             4. Save the array back to dict
             */
            [self.dict removeObjectForKey:key];
            NSDictionary *prevDict = [self wrapObject:prevObj withKey:key];
            NSDictionary *currentDict = [self wrapObject:dict withKey:key];
            self.dict[kArray] = [@[prevDict,currentDict] mutableCopy];

        }else{
            //Simply add object to dict
            self.dict[key] = dict;
        }
    }
}

/*Wraps Object with a key for the serializer to generate txf tag*/
- (NSDictionary *)wrapObject:(id)obj withKey:(NSString *)key{
    if (!key ||!obj) {
        return @{};
    }
    return @{key:obj};
}

EDIT 2:

A sample TXF file with more than 1000 entries.

Mcginty answered 1/2, 2015 at 8:9 Comment(6)
It's going to be hard to help you tweak your parser if you don't show us the code, or at least more hints on where you seem to be losing time...Insolvency
@Insolvency I have shared the whole project link, reposting the same link github.com/Anupdas/AMTXFParserMcginty
Run your code with Instruments and check where it's spending most of its time. In general using Core Foundation C objects can help you speed up your stuff a lot. E.g. try using CFDictionary and CFStringRef instead of NSDisctionary and NSString in performance sensitive places.Evangelista
Could you share a large TXF file with more sample data?Spirant
@Evangelista Thanks for your suggestion. From the benchmarks I could see that there is a significant different in access and creation speeds between CF and foundation variants. I'm working on them nowMcginty
@AlexNauda Thanks, I have added a TXF file which i use for benchmarking.Mcginty
P
8

Have you considered using pull-style reads & recursive processing? That would eliminate reading the whole file into memory and also eliminate managing some own stack to keep track how deep you're parsing.

Below an example in Swift. The example works with your sample "txf", but not with the dropbox version; some of your "members" span over multiple lines. If this is a requirement, it can easily be implemented into switch/case "$" section. However, I don't see your own code handling this either. Also, the example doesn't follow the correct Swift error handling yet (the parse method would need an additional NSError parameter)

import Foundation

extension String
{
    public func indexOfCharacter(char: Character) -> Int? {
        if let idx = find(self, char) {
            return distance(self.startIndex, idx)
        }
        return nil
    }

    func substringToIndex(index:Int) -> String {
        return self.substringToIndex(advance(self.startIndex, index))
    }
    func substringFromIndex(index:Int) -> String {
        return self.substringFromIndex(advance(self.startIndex, index))
    }
}


func parse(aStreamReader:StreamReader, parentTagName:String) -> Dictionary<String,AnyObject> {
    var dict = Dictionary<String,AnyObject>()

    while let line = aStreamReader.nextLine() {

        let firstChar = first(line)
        let theRest = dropFirst(line)

        switch firstChar! {
        case "$":
            if let idx = theRest.indexOfCharacter("=") {
                let key = theRest.substringToIndex(idx)
                let value = theRest.substringFromIndex(idx+1)

                dict[key] = value
            } else {
                println("no = sign")
            }
        case "#":
            let subDict = parse(aStreamReader,theRest)

            var list = dict[theRest] as? [Dictionary<String,AnyObject>]
            if list == nil {
                dict[theRest] = [subDict]
            } else {
                list!.append(subDict)
            }
        case "/":
            if theRest != parentTagName {
                println("mismatch... [\(theRest)] != [\(parentTagName)]")
            } else {
                return dict
            }
        default:
            println("mismatch... [\(line)]")
        }
    }

    println("shouldn't be here...")
    return dict
}


var data : Dictionary<String,AnyObject>?

if let aStreamReader = StreamReader(path: "/Users/taoufik/Desktop/QuickParser/QuickParser/file.txf") {

    if var line = aStreamReader.nextLine() {
        let tagName = line.substringFromIndex(advance(line.startIndex, 1))
        data = parse(aStreamReader, tagName)
    }

    aStreamReader.close()
}

println(JSON(data!))

And the StreamReader was borrowed from https://mcmap.net/q/88889/-read-a-file-url-line-by-line-in-swift

Edit

Edit 2

I rewrote the above in C++11 and got it to run in less than 0.05 seconds (release mode) on a 2012 MBA I5 using the updated file on dropbox. I suspect NSDictionary and NSArray must have some penalty. The code below can be compiled into an objective-c project (file needs have extension .mm):

#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <map>
#include <vector>

using namespace std;


class benchmark {

private:
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::milliseconds milliseconds;

    clock::time_point start;

public:
    benchmark(bool startCounting = true) {
        if(startCounting)
            start = clock::now();
    }

    void reset() {
        start = clock::now();
    }

    double elapsed() {
        milliseconds ms = std::chrono::duration_cast<milliseconds>(clock::now() - start);
        double elapsed_secs = ms.count() / 1000.0;
        return elapsed_secs;
    }
};

struct obj {
    map<string,string> properties;
    map<string,vector<obj>> subObjects;
};

obj parse(ifstream& stream, string& parentTagName) {
    obj obj;
    string line;
    while (getline(stream, line))
    {
        auto firstChar = line[0];
        auto rest = line.substr(1);

        switch (firstChar) {
            case '$': {
                auto idx = rest.find_first_of('=');

                if (idx == -1) {
                    ostringstream o;
                    o << "no = sign: " << line;
                    throw o.str();
                }
                auto key = rest.substr(0,idx);
                auto value = rest.substr(idx+1);
                obj.properties[key] = value;
                break;
            }
            case '#': {
                auto subObj = parse(stream, rest);
                obj.subObjects[rest].push_back(subObj);
                break;
            }
            case '/':
                if(rest != parentTagName) {
                    ostringstream o;
                    o << "mismatch end of object " << rest << " != " << parentTagName;
                    throw o.str();
                } else {
                    return obj;
                }
                break;
            default:
                ostringstream o;
                o << "mismatch line " << line;
                throw o.str();
                break;
        }

    }

    throw "I don't know why I'm here. Probably because the file is missing an end of object marker";
}


void visualise(obj& obj, int indent = 0) {
    for(auto& property : obj.properties) {
        cout << string(indent, '\t') << property.first << " = " << property.second << endl;
    }

    for(auto& subObjects : obj.subObjects) {
        for(auto& subObject : subObjects.second) {
            cout << string(indent, '\t') << subObjects.first << ": " << endl;
            visualise(subObject, indent + 1);
        }
    }
}

int main(int argc, const char * argv[]) {
    try {
        obj result;

        benchmark b;
        ifstream stream("/Users/taoufik/Desktop/QuickParser/QuickParser/Members.txf");
        string line;
        if (getline(stream, line))
        {
            string tagName = line.substr(1);
            result = parse(stream, tagName);
        }

        cout << "elapsed " << b.elapsed() <<  " ms" << endl;

        visualise(result);

    }catch(string s) {
        cout << "error " << s;
    }

    return 0;
}

Edit 3

See link for full code C++: https://github.com/tofi9/TxfParser

Papp answered 4/2, 2015 at 7:44 Comment(8)
Thanks, code looks so much promising with this method. I'm not yet comfortable in Swift. Could you please share the source code, so that I can run some bench mark tests?Mcginty
See github.com/tofi9/QuickParser. It should be fairly easy to implement something similar in objective-c, see link above for objective-c class similar to StreamReaderPapp
Thanks for the source code. I used some big txf files, surprisingly this method did not provide any optimization over my original work. I have updated the dropbox file, which took around 7 seconds to parse on my i7 Macbook Pro. I removed the stray characters so that your code would run without exception.Mcginty
I have updated the same dropbox file. I tried your recommended parsing method, but its slower than my original method.Mcginty
does it need to be in objective-c? rewrote the above in c++ (compilable in your objective-c project) and it runs in less 0.05 seconds on my 2012 MBA i5Papp
I don't mind using C++, infact many have suggested to use C or C++ to improve performance. But I couldn't find a way to include your code and do some testing. If I'm not asking too much can you include this code in an iOS project and share the code ?Mcginty
Thanks for sharing the source, I could see your cpp based parser is 40 % faster than the optimized parser that I have now. But is it possible to convert the parsed object to JSON so that I can use JSOM mappers to form models.Mcginty
Let us continue this discussion in chat.Papp
P
3

I did some work on your github source - with following 2 changes I got overal improvement of 30% though the major improvement is from "Optimisation 1"

Optimisation 1 - based on your data came with with following work.

+ (int)locate:(NSString*)inString check:(unichar) identifier
{
    int ret = -1;
    for (int i = 0 ; i < inString.length; i++){
        if (identifier == [inString characterAtIndex:i]) {
            ret = i;
            break;
        }

    }

    return ret;
}

- (void)didFindKeyValuePair:(NSString *)tag{
#if 0
    NSArray *components = [tag componentsSeparatedByString:@"="];
    NSString *key = [components firstObject];
    NSString *value = [components lastObject];
#else

    int locate = [TXFParser locate:tag check:'='];

    NSString *key = [tag substringToIndex:locate];
    NSString *value = [tag substringFromIndex:locate+1];

#endif
    if (key.length) {
        self.dict[key] = value?:@"";
    }
}

Optimisation 2:

- (id)objectFromString:(NSString *)txfString{
    [txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
#if 0
        if ([string hasPrefix:@"#"]) {
            [self didStartParsingTag:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"$"]){
            [self didFindKeyValuePair:[string substringFromIndex:1]];
        }else if([string hasPrefix:@"/"]){
            [self didEndParsingTag:[string substringFromIndex:1]];
        }else{
            //[self didFindBodyValue:string];
        }
#else
        unichar identifier = ([string length]>0)?[string characterAtIndex:0]:0;
        if (identifier == '#') {
            [self didStartParsingTag:[string substringFromIndex:1]];
        }else if(identifier == '$'){
            [self didFindKeyValuePair:[string substringFromIndex:1]];
        }else if(identifier == '/'){
            [self didEndParsingTag:[string substringFromIndex:1]];
        }else{
            //[self didFindBodyValue:string];
        }

#endif
    }]; return self.dict;
}

Hope it helps you.

Pinnule answered 4/2, 2015 at 8:6 Comment(2)
Thanks for suggestions. By combining both options the parsing time has reduced by around 35 percent. Your method to find the index of character can be optimized by assigning the length of string to a variable instead of calculating it every loop. I'm evaluating a way to speed up by modifying the datastructure used. Do you have any suggestions on them ?Mcginty
datastructure looks good but I may try by doing without stack by representing data tree in NSMutableDictionary - not sure how better it will be with its performance, according to me it will avoid you creating Dictionary from stack.Pinnule

© 2022 - 2024 — McMap. All rights reserved.