How to speed up parallel parsing using Raku grammars?
Asked Answered
C

2

6

Parsing a few hundreds files using my grammar using a plain

for @files -> $file {
    my $input = $file.IO.slurp;
    my $output = parse-and-convert($input);
    $out-dir.IO.add($file ~ '.out').spurt: $output;
}

loop is relatively slow and takes ~20 seconds on my machine, so I've decided to speed this up by doing this instead:

my @promises;
for @files -> $file {
    my $input = $file.IO.slurp;
    @promises.append: start parse-and-convert($input);
}

for await @promises -> $output {
    $out-dir.IO.add($file ~ '.out').spurt: $output;
}

This works (at least in my real code, i.e. modulo any typos in this illustrative example), but the speedup is much less than I hoped for: it now takes ~11s, i.e. I've gained a factor of only two. This is appreciable, of course, but it looks like there is a lot of contention because the program uses less than 6 CPUs (on a system with 16 of them) and quite a bit of overhead (because I don't get a factor of 6 speedup neither).

I've confirmed (by inserting some say now - INIT.now) that almost all the running time is really spent inside await, as expected, but I have no idea how could I debug/profile it further. I'm doing this under Linux, so I could use perf, but I am not sure how would it help me at Raku level.

Would there be some simple way to improve the degree of parallelism here?

Edit: Just to make it clear, I can live with 20s (well, 30s by now, as I've added more things) running time, I'm really mostly curious about whether the degree of parallelism could be somehow improved here, without rewriting the grammar (unless there is something very specific, like e.g. use of dynamic variables, that should be avoided when using multiple threads).

Castiron answered 13/8, 2022 at 14:56 Comment(2)
Answers might be delayed today: conf.raku.org . – Imparisyllabic
I don't know the answer, but I liked the clean Raku style πŸ˜€ – Alic
I
0

A question, and a suggestion:

Does your Grammar parse entire documents, or only portions of those documents (sections, paragraphs, lines, etc.)?

If your Grammar only parses at the paragraph or lines level, then you might be spending a lot of time slurping your files in. The advantage of Raku's lines routine is that it reads lazily. To replicate and replace slurp for your second line of code you could try something like:

my $input = $file.IO.lines.join("\n");

Otherwise, if your Grammar parses on the paragraph level, you use the power of arrays in Raku (note below the assignment to @input instead of $input). You can also >> (hyper) process array elements to provide speedup, because as the docs say, "...all hyper operators are candidates for parallelism...":

my @input = $file.IO.split("\n\n");

If you have complex Paragraph (pre)-parsing to do, check out the Text::Paragraph submodule of @Codesections' _ "lowbar" module:

https://github.com/codesections/_/blob/main/lib/Text/Paragraphs/README.md

In any case, it seems your best opportunity for speedup is reducing 'impedence-mismatch', i.e. making sure that the chunk-size you're feeding to your Grammar matches the size that the Grammar expects (rather than creating a file-read bottleneck prior to the Grammar being executed).

HTH.

Imparisyllabic answered 13/8, 2022 at 16:48 Comment(5)
Thanks, but it does parse the entire documents and slurping them is not a problem -- it takes ~0.05s for all files. As I wrote, the time is really spent inside parsing and not on IO (output is quite a bit slower than input, but still <1s). – Castiron
I see now the time you see your code spending in await is due to Grammar processing. I think we'd need to see some of your Grammar to assist. FYI I recall a member of our Raku Meetup having very serious parsing delays with a Grammar that used (possibly as few as two?) regex atoms each with the * zero-or-more regex quantifier. So you might want to re-order your parse tree to eliminate this potential bottleneck. HTH. – Imparisyllabic
Yes, I'm pretty sure the grammar could be optimised, but it's difficult to do it (because it's very simple to break parsing of one of the myriads of special cases while changing it) and 20s is not bad enough to spend hours on this. I hoped to get some performance win for "free" -- and I did, but just not as much as I hoped for. – Castiron
Maybe you can break down a problematic token/rule to a simpler one, or abstract it into a sub-Grammar. Below is the commit from the Raku Meetup I referred to (problematic Grammar, and resolution): github.com/doomvox/raku-study/commit/… – Imparisyllabic
Thanks, I already try to avoid tokens not matching anything, I know that it's a bad idea (which doesn't mean I didn't miss some of them accidentally, of course, as my grammar is ~500 lines long). This question is more about Raku parallelism mechanisms and not grammars however. – Castiron
G
0

If you don't care about the order in which things occur, you can use race on any Iterable (in this case, your @files). This will by default create work for CPU-cores - 1 threads, and create work loads of 64 items to be processed per thread at a time.

Since grammar parsing is a notoriously expensive process, it's probably wise to let each thread handle 1 file at a time. You can specify that with the batch argument.

Relatedly, Intel processors typically claim to have 2x more CPUs than is available for typical workloads. So you might want to play with the degree argument (which indicates the maximum number of threads to be used) as well, because parsing a grammar creates similar types of workload.

So your code:

for @files.race(batch => 1, degree => 8) -> $file {
    my $input = $file.IO.slurp;
    my $output = parse-and-convert($input);
    $out-dir.IO.add($file ~ '.out').spurt: $output;
}

Note the only thing you needed to add to your original code was: .race(batch => 1, degree => 8)

Grison answered 14/8, 2022 at 11:34 Comment(7)
Thanks, I wanted to use race() initially (exactly because it keeps the code structure simple) but I couldn't get any speed up with it. I thought I was just using it wrong, but trying exactly the same code you show (with degree 16, 8 or 4 -- doesn't change anything), I still get only 100% CPU use and slightly higher time than when using a simple non-parallel for. Any idea why? This is with Raku v2022.04 under Debian Bookworm. – Castiron
More testing using ./race.raku & repeat 5 { ps -o thcount $!; sleep 1 } (a zsh-ism) shows that my Raku doesn't ever create more than 4 threads in total, and never actually uses more than one for running my code, whatever parameters I pass to race(). The test script is just sleep 0.01 for (1..512).race(batch => 1, degree => 16) and it always takes ~5.2s to run here. – Castiron
Final comment: with my original version, using start/await, I get best performance when using my $*SCHEDULER = ThreadPoolScheduler.new(max_threads => N); with N roughly between 4 and 16, as by default too many threads are created, resulting in too much contention. I still get "only" ~800% CPU usage with N=16 -- and sensibly the same time with N=4 with ~350% usage. – Castiron
Did you also play with the batch parameter? It could be that the default of 64 is actually too low :-) – Grison
I can't make it use more than a single thread (I think Raku creates 3 helper threads internally as soon as any threads are used). Could you please check if raku -e 'sleep 0.01 for (1..512).race(batch => 16, degree => 16)' either finishes in less than 5s or at least creates more than 4 threads for you? – Castiron
That's not adding any load, as sleep relinquishes the thread, I think – Grison
I can't make the version with race() work faster than the version without it no matter what I do inside the loop, e.g. I also tried $x += [*] 1..1000 and it still takes (slightly) longer with race. Either something is badly broken in my Raku version or I'm totally missing something here. – Castiron

© 2022 - 2024 β€” McMap. All rights reserved.