Why my simple Ragel grammar use all memory and crash
Asked Answered
D

1

1

I am trying to convert a set of regular expression from Adblock Plus rules into an optimized function I could call from C++.

I was expecting to be able to use a lexer generator such as Ragel to do this but when I try with a very small set of Regex the memory usage go very high > 30 GB and Ragel quit without error message and without producing the output file.

I included the toy grammar bellow, I am trying to understand if I am doing anything stupid that could be optimized to solve the issue.

#include <string.h>
namespace xab{
 %%{
machine lexer;
WILDCARD = /[A-Za-z0-9;\/\?:@=&$_\.\+!\*'~#^,%:\-]/*;
SUBDOMAIN = /([A-Za-z]([A-Za-z0-9\-]*[A-Za-z0-9])?\.)+/;
SEPERATOR = /[:\/\?=&]/;
main := 
(WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD) |
(WILDCARD '.a3s?n=' WILDCARD '&zone_id=' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD ';adtech;' WILDCARD) |
(WILDCARD '/addyn|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adiframe|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/adserv|' WILDCARD '|adtech;' WILDCARD) |
(WILDCARD '/affiliates.' WILDCARD '.aspx?' WILDCARD) |
(WILDCARD '/affiliates/' WILDCARD '/show_banner.' WILDCARD) |
(WILDCARD '/banner_js.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannerframe.' WILDCARD '?' WILDCARD) |
(WILDCARD '/banners.' WILDCARD '&iframe=' WILDCARD) |
(WILDCARD '/bannerview.' WILDCARD '?' WILDCARD) |
(WILDCARD '/bannery/' WILDCARD '?banner=' WILDCARD) |
(WILDCARD '/cdn-cgi/pe/bag?r[]=' WILDCARD 'cpalead.com' WILDCARD) |
(WILDCARD '/delivery/' WILDCARD '?advplaces=' WILDCARD) |
(WILDCARD '/eas?camp=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';cre=' WILDCARD) |
(WILDCARD '/eas?cu=' WILDCARD ';ord=' WILDCARD) |
(WILDCARD '/ireel/ad' WILDCARD '.jpg' WILDCARD) |
(WILDCARD '/is.php?ipua_id=' WILDCARD '&search_id=' WILDCARD);
write data; 
}%%
bool matchBlacklist(const char *data) {
const char *p = data;

const char *pe = data + strlen(data);
int cs;
//write init
%% write init; 
// write exec
%% write exec;
if (cs >= lexer_first_final)
return true;
return false;
}
}
Deboer answered 7/4, 2014 at 23:59 Comment(0)
D
1

AFAIK, you're facing the "DFA space explosion".

DFA has to match all your rules in one pass over the string. To do that every state needs a transition 1) to the beginning of every rule and 2) to the middle of every interleaving rule.

Further, the WILDCARD might be creating "nondeterministic behaviour" because, for example, in the WILDCARD '&prvtof=' WILDCARD '&poru=' WILDCARD rule the WILDCARD will match &prvtof=. This, and the sheer amount of options in the WILDCARD might further explode the DFA.

In the Ragel 6.8 manual there are guidelines on how to simplify the DFA in the sections "2.5.5 Concatenation" and "4. Controlling nondeterminism".

To avoid the "DFA space explosion" you might want to kind of "deoptimize" the Ragel machine by using scanners, thus selectively switching from a "stateless" DFA to backtracking. And you might want to reduce nondeterminism by using the strong difference operator. And you might want to simplify the WILDCARD, replacing it with any.

action matched {return true;}
main := |*
  '&prvtof=' (any* -- '&poru=') '&poru=' => matched;
  '.a3s?n=' (any* -- '&zone_id=') '&zone_id=' => matched;
  any;
*|
Delp answered 28/4, 2014 at 15:16 Comment(2)
Thanks a lot, I will give this a try! How much flower should I expect Scanner with backtracking to be?Deboer
IMO it will be faster, because the rules will have much fewer transitions meaning less branches for CPU to execute and better cache utilisation. And the rules are still DFA, it's only the scanner that backtracs.Delp

© 2022 - 2024 — McMap. All rights reserved.