Stack overflow in Prolog DCG grammar rule: how to handle large lists efficiently or lazily

Asked 17/10, 2012 at 17:19 Answered 17/10, 2012 at 20:10

I'm parsing a fairly simple file format consisting of a series of lines, each line having some space separated fields, that looks like this:

l 0x9823 1
s 0x1111 3
l 0x1111 12
⋮

I'm using SWI-Prolog. This is the DCG I have so far:

:- consult(library(pure_input)).

load_trace(Filename, Traces) :-
    phrase_from_file(trace_file_phrase(Traces), Filename).

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) --> trace_phrase(T), trace_file_phrase(Ts).

trace_phrase(access(Type, Address, SinceLast)) -->
    access_type(Type), space,
    address(Address),  space,
    nat(SinceLast),    newline.

access_type(load)  --> "l".
access_type(store) --> "s".

address(Number) --> "0x", hexnum(Number).

hexdigit(N)  --> digit(N).
hexdigit(10) --> "a". hexdigit(11) --> "b". hexdigit(12) --> "c".
hexdigit(13) --> "d". hexdigit(14) --> "e". hexdigit(15) --> "f".
hexnum(N) --> hexdigit(D), hexnum(D, N).
hexnum(N, N) --> [].
hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N).

newline --> "\n".
space --> " ".

%% the following two productions are courtesy of Lars Mans at
%% https://mcmap.net/q/183811/-parsing-numbers-with-multiple-digits-in-prolog
digit(0) --> "0". digit(1) --> "1". digit(2) --> "2".
digit(3) --> "3". digit(4) --> "4". digit(5) --> "5".
digit(6) --> "6". digit(7) --> "7". digit(8) --> "8".
digit(9) --> "9".

nat(N)   --> digit(D), nat(D,N).
nat(N,N) --> [].
nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N).

As mentioned in the comment, I cribbed the number handling from Parsing numbers with multiple digits in Prolog.

The problem I'm running into is some of these files are large, like, on the order of 5-10 MB. The default stack in SWI-Prolog is insufficient for this, and parsing these files is taking substantial time, on the order of 5-15 seconds. I have several questions about this situation:

Where is the efficiency problem in this code? I think it's either in trace_file_phrase//1 or nat//1 but these are just hunches.
If the problem is lists, is there a better way to handle lists with DCGs than this?
How does one, in general, diagnose and treat performance problems with DCGs such as this?

Censure answered 17/10, 2012 at 17:19 Comment(0)

As a general remark you will find more on SO about it under the name library(pio). Also the way to use it cleanly is rather:

:- use_module(library(pio)).

Your example is way too complex, so I will only consider a slightly simpler case, a newline separated list of numbers:

nats([]) --> [].
nats([N|Ns]) --> nat(N), newline, nats(Ns).

So, how can you test this effectively? (That's your Question 3) The basic point of library(pio) is that you can use regular DCGs for file processing. But for testing in the small you can still use the simple phrase/2. So I do:

?- phrase(nats(Ns),"1\n").
   Ns = [1]
;  false.

Did you see the ; prompt? That means: Prolog was not able to decide whether or not further answers might be computed - so it leaves one or more choice-points open. And that only for a single digit You can imagine how things will pile up.

Let's dig deeper:

?- phrase(digit(D),"1").
   D = 1
;  false.

Again the evil ; false! In order to make this work, everything would have to be determinate. There are three ways to this:

Use cuts (and lose your soul)

I wish you luck - the best seems to be just after the repeating element:

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   trace_phrase(T),
   !, % ugly, but...
   trace_file_phrase(Ts).

(This should answer Question 1)

But, hold on a minute! What is so bad about this !? As long, as there is exactly one answer to trace_phrase//1 things are perfect. It is only, if there are more answers (or actually solutions), that the cut might remove precious answers. How do you know, if there are more solutions? Well, you don't. And you won't see them as they have been cut away already.

`call_semidet/1`

Here is a way to ensure that this does not happen. This works only for side-effect free goals that can be called twice without any effect:

call_semidet(Goal) :-
   (  call_nth(Goal, 2)
   -> throw(error(mode_error(semidet,Goal),_))
   ;  once(Goal)
   ).

This uses call_nth/2, as defined in another post. (As an optimization, the implementation could avoid calling Goal twice when there is no choice-point open...) Just to make clear, how it works:

?- phrase(nat(N),"1234").
   N = 1234
;  false.
?- call_semidet(phrase(nat(N),"1234")).
   N = 1234.
?- call_semidet((X=1;X=2)).
   error(mode_error(semidet, (2=1;2=2)), _).

So it makes your little grammar effectively determinate! There is thus no need to reformulate anything!

What is lacking now is some integration of this into the grammar. You can do this very low-level, or rather cleanly using library(lambda).

phrase_semidet(NT) -->
   call(S0^S^call_semidet(phrase(NT,S0,S))).

Note that in this very particular case we do not use the \ for renaming.

trace_file_phrase([]) --> [].
trace_file_phrase([T|Ts]) -->
   phrase_semidet(trace_phrase(T)),
   trace_file_phrase(Ts).

Exploit indexing

Finally, a very laborious but clean way would be to rewrite everything to profit better from indexing (and maybe help to improve indexing in general...) But this is a long road. Just to start with:

digit(D) --> [C],
   {c_digit(C,D)}.

c_digit(0'0,0).
c_digit(0'1,1).
c_digit(0'2,2).
c_digit(0'3,3).
c_digit(0'4,4).
c_digit(0'5,5).
c_digit(0'6,6).
c_digit(0'7,7).
c_digit(0'8,8).
c_digit(0'9,9).

This gives you now:

?- phrase(digit(D),"1").
   D = 1.

But you have another source of nondeterminism which is rather due to the way you define the grammar. In nat//2 you see it:

nat(N,N) --> [].
nat(A,N) --> digit(D), ... .

The first rule always applies, that is, "1234\n" will be parsed as "1" "12" "123" "1234" only the following newline//0 realizes that it would suffice go for the last — and then stick to it.

You can rewrite things for that, but then, the code is no longer the pure little spec you liked, isn't it? Well, maybe things might improve in the future.

E.g. indexing is much better in SWI than it used to be, maybe also things evolve here too....

The intention of library(pio) was to get this process started. Compare this to Haskell — we are far away from interact efficiency-wise! But there is no inherent cost:

... --> [] | [_], ... .

?- phrase_from_file((...,"searchstring",...),fichier).

is just as efficient as grep — space-wise. That is, it runs in constant space. So hopefully more code will run better in the future.

Edit: BTW, library(pio) did already have an impact efficiency-wise: GC phases were improved significantly, very much in the same manner as Wadler's Fixing some space leak – paper a quarter century ago. Things evolve ...

Edit almost 10 years later: a related answer.

Furlough answered 17/10, 2012 at 20:10 Comment(5)

I thought nat/2 looked a little obtuse. Can you expand a little on what I need to do to fix the problem there? I'm less worried about that rule becoming obtuse since it can be sequestered in a module elsewhere and not interfere with my "pure little spec"--if the infection won't spread. – Censure 17/10, 2012 at 20:22

@DanielLyons: Could you share the actual numbers? Then this would be also useful for other people. – Furlough 20/10, 2012 at 16:25

@DanielLyons: You wrote in your question 5-10MB took 5-15s and much more than default space. How is the situation now? – Furlough 21/10, 2012 at 10:26

@DanielLyons: you never answered this question, or did you? – Furlough 15/11, 2014 at 16:58

No, I didn't. I will try to find some time during my upcoming vacation to perform meaningful benchmarks. Thanks! – Censure 17/11, 2014 at 15:33

I've verified the stackoverflow on a 2Mb file. Then I rewrote the grammar using library(dcg/basics), and now is working.

:- [library(dcg/basics)].

load_trace_0(Filename, Ls) :-
    phrase_from_file(lines(Ls), Filename).

lines([s(H,I)|R]) -->
    "s 0x", xinteger(H), " ",
    integer(I), blanks,
    !, lines(R).
lines([l(H,I)|R]) -->
    "l 0x", xinteger(H), " ",
    integer(I), blanks,
    !, lines(R).
lines([]) --> [].

But then I tried to put the cut on your grammar, and is working as well. So the answer from @gusbro (+1) solves your problem.

Lizettelizotte answered 17/10, 2012 at 20:5 Comment(3)

Is dcg/basics contributing integer/3 and xinteger/3 in your code? I haven't heard about this before. – Censure 17/10, 2012 at 20:25

yes, it's handy. I routinely use it to parse mySQL backup files of about 10 MB – Lizettelizotte 17/10, 2012 at 20:33

Thanks for this, it has helped tremendously. I wish I could split accepting the answer three ways! – Censure 19/10, 2012 at 3:26

About the efficiency problem:

If your input is usually well formed, then I think you should swap the clauses of nat/4 and hexnum/4, so they would read:

nat(A,N) --> digit(D), { A1 is A*10 + D }, nat(A1, N).
nat(N,N) --> [].

hexnum(A, N) --> hexdigit(D), { A1 is A*16 + D }, hexnum(A1, N).
hexnum(N, N) --> [].

because you only want to stop parsing a number when there are no more digits to consume.

If used wisely, the cut (!) can help you performance-wise and also regarding the stack overflow because it prunes the prolog evaluation tree. For example, you may commit (!) at the end of trace_file_phrase/3 (that is, after the newline) because you don't need to reparse that part of the input again to find other solutions.

Holly answered 17/10, 2012 at 18:22 Comment(6)

I guess I'm just not seeing where I should place a cut to improve things. – Censure 17/10, 2012 at 18:29

@DanielLyons: for example, you can commit (!) at the end of trace_file_phrase/3 (that is, after the newline) because you don't need to reparse that part of the input again to find other solutions. – Holly 17/10, 2012 at 18:33

This will only work if you add a cut after digit//1, otherwise it is less efficient, because a choice-point is left for each recursion. – Furlough 17/10, 2012 at 20:24

@false: I already said he should use cuts, and even if he doesn't use cuts we should measure time to see which is more efficient. – Holly 18/10, 2012 at 1:0

@gusbro: Without a cut after digit//1 your definition will use one choice-point for each digit. That is forbidding. Daniel Lyons original definition was much better in this respect. Only one superficial choice-point for each number. – Furlough 18/10, 2012 at 13:45

Thanks for this, using the cut after the newline alone turned out to solve the basic problem. If I could I'd accept all three answers! – Censure 19/10, 2012 at 3:27

Use cuts (and lose your soul)

`call_semidet/1`

Exploit indexing

Recommended topics

Hot tags