PLY: quickly parsing long lists of items?

Asked 20/6, 2011 at 20:1 Answered 9/8, 2019 at 14:31

I'm working with a fairly simple parser in PLY, and one of my rules takes on the following form:

def p_things(p):
    '''
    things : thing things
    things : thing
    '''
    p[0] = [p[1]]
    if len(p) == 3:
        p[0] += p[2]

Input files are generally simple lists of things, and so the parsing itself is not complex. Some of my input files are very large, however (exceeding 100,000 lines fairly regularly, and over 1,000,000 in extreme cases). In profiling (via cProfile and pstats), the bulk of the runtime is taken up by repeated calls to p_things - presumably, one call for each item in a things list.

Is there a way to reduce this time, or a more efficient way to structure this rule? Most answers I've seen so far (and the canonical compilers info I've found) have listed this method as the generally accepted way to construct a list of parseable items, no matter the length.

Slake answered 20/6, 2011 at 20:1 Comment(0)

Turns out I'm forgetting some of my basic compilers theory. PLY is a LALR(1) parser, and so it's better to write the rule as:

def p_things(p):
    '''
    things : things thing
    things : thing
    '''
    if len(p) == 2:
        p[0] = [p[1]]
    else:
        p[0] = p[1]
        p[0].append(p[2])

Though it may look more verbose, there's actually a significant improvement - somewhere in either PLY or Python, the parser was able to apply some optimization on the left-recursive form. I've seen performance drop from exponential to linear on my larger input files; one sample, with over a million items in the things list, ran in under 20% of the time.

Slake answered 21/6, 2011 at 18:21 Comment(3)

This isn't a parsing issue per se -- in your original method p[0] += p[2] runs linearly to the length of the list every time it adds a new thing, so parsing a list takes quadratic time, whereas in your new method, adding a new element p[0].append(p[2]) runs amortized constant time, so overall, parsing the list takes linear time. – Lambent 26/4, 2014 at 18:50

Secondary: split your rule into two, say one suffixed with _iter and the other with _end. As PLY's docs already recommend, the selection statement just duplicates the work already done by the parser itself to distinguish between the two cases in the production rule. This will save some computation, but is unrelated to observing an improvement in efficiency. – Epizootic 27/3, 2015 at 9:52

If you have a data file with a complex header requiring PLY, and a huge tail of very simple data, then consider splitting the input to two pieces. Pass the first (small) to PLY, and handle the second in a simple iteration per line, that splits each line lexically (e.g., if its a tuple of 3 coordinates). – Epizootic 27/3, 2015 at 9:54

Without changing your code, you could try using the "PyPy" version of python with just-in-time compilation -- it might run your code way faster than normal CPython.

Antecedent answered 20/6, 2011 at 20:39 Comment(2)

If "the bulk of time" is spent in that function, it's propably more of a complexity issue, which no language implementation can solve (exceptions for very smart analysers looking at very simple cases). – Adali 20/6, 2011 at 20:56

good suggestion, but unfortunately I'm running into some compatibility problems with PyPy (PLY is, in this case, embedded pretty deep into some other packages that like CPython perhaps more than they should). Also, @delnan probably has a point - this strikes me as more of a runtime complexity issue than a compilation one. – Slake 20/6, 2011 at 20:58

So to summarize:

The performance issue was not related to the parser per se (but the usage of +=)
Make things easier for LALR(1) parsers by making RHSs unambiguous.
It's better to avoid unnecessary selection statements (ifs) within a rule by splitting it up if possible.

To better understand Ioannis Filippidis' comment it's easier to actually visualize it. This is what I think was meant and approximately what I ended up with as well.

def p_things_iter(p):
    '''
    things_iter : things thing
    '''
        p[0] = p[1]
        p[0].append(p[2])

def p_things_end(p):
    '''
    things_iter : thing
    '''
    p[0] = [p[1]]

def p_things(p):
    '''
    things : things_iter
    things : things_end
    '''
    p[0] = p[1]

Monster answered 9/8, 2019 at 14:31 Comment(0)

Recommended topics

Hot tags