Code Golf: Regex parser
Asked Answered
W

7

53

The goal

Today's Code Golf challenge is to create a regex parser in as few characters as possible.

The syntax

No, I'm not asking you to match Perl-style regular expressions. There's already a very reliable interpreter for those, after all! :-)

Here's all you need to know about regex syntax for this challenge:

  • A term is defined as a single literal character, or a regular expression within grouping parentheses ().
  • The * (asterisk) character represents a Kleene star operation on the previous TERM. This means zero or more of the previous term, concatenated together.
  • The + (plus) character represents a convenient shortcut: a+ is equivalent to aa*, meaning one or more of the previous term.
  • The ? (question mark) character represents zero or one of the previous term.
  • The | (pipe) character represents an alternation, meaning that the REGULAR EXPRESSIONS on either side can be used in the match.
  • All other characters are assumed to be literal. You may assume that all other characters are within [0-9A-Za-z] (i.e., all English alphanumerics).

Or, put another way: */+/? have highest precedence, then concatenation, then alternation. Since alternation has lower precedence than concatenation, its use within a regex without parentheses causes it to be bound to the full regex on each side. * and + and ?, on the other hand, would just apply to the immediately preceding term.

The challenge

Your challenge is to write a program that will compile or interpret a regular expression (as defined above) and then test a number of strings against it.

I'm leaving input up to you. My recommendation would be that the regex should probably come first, and then any number of strings to be tested against it; but if you want to make it last, that's fine. If you want to put everything in command-line arguments or into stdin, or the regex in command-line and the strings in stdin, or whatever, that's fine. Just show a usage example or two.

Output should be true or false, one per line, to reflect whether or not the regex matches.

Notes:

  • I shouldn't need to say this... but don't use any regex libraries in your language! You need to compile or interpret the pattern yourself. (Edit: You may use regex if it's required for splitting or joining strings. You just can't use it to directly solve the problem, e.g., converting the input regex into a language regex and using that.)
  • The regular expression must COMPLETELY match the input string for this challenge. (Equivalently, if you're familiar with Perl-like regex, assume that start- and end-of-string anchoring is in place for all matches)
  • For this challenge, all of the special characters ()*+?| are not expected to occur literally. If one comes up in the input, it is safe to assume that no pattern can match the string in question.
  • Input strings to test should be evaluated in a case-sensitive manner.

The examples

For the examples, I'm assuming everything is done in command-line arguments, regex first. (As I said above, input is up to you.) myregex here represents your invocation of the program.

> myregex easy easy Easy hard
true
false
false

> myregex ab*a aa abba abab b
true
true
false
false

> myregex 0*1|10 1 10 0110 00001
true
true
false
true

> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true

> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true

NOTE: Sorry, forgot to make community wiki! :-(

Witchcraft answered 19/8, 2010 at 15:22 Comment(26)
This is rather an interpreter than just a parser.Doublure
Are we allowed to use our language's built-in regex libraries?Erivan
Yes... I guess I meant a parser for the input, using the regular expression. Yes, you could argue that's more of an interpreter... You can edit the title and question if it's that big a deal to you. I can see your point. :-)Witchcraft
This is a pretty well thought out golf; I'll see about giving it a go with parser combinators after work ;)Bunsen
Platinum Azure: I know we can't use the regex libraries to execute the input regexes, but in some languages it's hard to avoid invoking regex libraries when parsing input strings. For example, Perl and Java only have Split and Replace implemented for regexes.Erivan
If it were up to me, I'd say using built in regex (or whatever) facilities to parse the input is allowed, but passing the regex through to it for "execution" is not.Amytal
@Gabe, @Jerry Coffin: Okay, very good clarification. Yes, that's more than fair, I agree. As long as they're not used for actually testing the input strings, then it's okay.Witchcraft
Voters for close... Any suggestions on how I can make it seem more like a "real question" to you? Yes, you need to read the whole thing, but this is most definitely a question: Who can write a regex parser/interpreter in the shortest number of keystrokes? Pretty simple, really, even if it's hard to answer. :-)Witchcraft
To make it a good code golf, there should be no solution available in any other language built in. Even if the rules "forbid" it, it's still bad. Questions like "math solver" or "regex parser" and the like just encourage 'cheating' (which is what code golf is mostly about).Hogg
I'm voting to close (and I normally don't vote to close Code-Golf questions) because this is a bit of a 'too localized' question. You're going to get an answer from the subset of people that know Code-golf and want to spend the time writing a regex machine in code-golf. Besides painful, it's not very educational. It'd be better to write one *not in code-golf-ese, so others can learn from it.Fritter
@LiraNuna: It's not particularly bad from what I can see, it's just quite a hefty amount of work. From a theoretical perspective it's trivial as transforming above subset of regular expressions into a DFA isn't exactly hard. The challenge is probably coming up with a compact algorithm for doing so as well as a clever way of representing that automaton. Still, it takes quite a while to write that and even more to golf it, that's why I won't pursue it myself either.Canadian
Tough [code-golf], but well specified and not beyond reason.Aloes
@Platinum: Could you please explain which rule in your regexp makes a?b+|(a+b|b+a?)+ not match aabba? (and please join your own game!)Bellamy
@George Stocker: i beg to differ - writing regex matcher is interesting and educational. even if one writes one in 1000+ chars, that will still be a win in codegolf... for there is no working one posted yet :)Hospitable
@mvds: Cripes! I'm sorry, I changed the regex subtly right before posting and I guess that slipped through the cracks. Also, I'm going to post one eventually, don't worry! :-)Witchcraft
@dmckee, @Joey, @Nas Banov: Thanks for your support! I guess this one really is tough... The parentheses are the worst, now that I think on it. It's hard to golf when you pretty much NEED a stack, so I'm tempted to lighten the specifications to avoid the parentheses issue. Any thoughts?Witchcraft
@Platinum: I think most languages come with a stack built-in, so this makes for a nice excercise in recursion. The parentheses, in conjunction with the alternatives of | are by far the hardest part. @others: I agree that this is not the level of the average golf but I think it is way more interesting than rewriting "cat" or "hello world" in hacky ways.Bellamy
@Platinum - keep the parenthesis, that makes it fun. One can either use recursion or stack (trivial to do with arrays, lists etc)Hospitable
aargh - I'm officially quitting without an entry. I have a ruby version which reduces to ~900 chars without serious golfing, but I can't get it to pass on the form (a|b)*c, and I've run out of any ideas other than a complete rewrite.Telson
@Ashelly: I can understand your frustration. I'm sorry for making you frustrated with this challenge. I'm having similar trouble myself right now, actually. :-( Thanks for trying though!Witchcraft
guys... where's the spirit? (yes it's terrible but so rewarding when it finally works!)Bellamy
I learnt a lot from even just attempting to write a regex parser for this.Dim
I couldn't get the problem out of my head... I'm now officially unquitting: see entry below.Telson
@Platinum dang, I was looking forward to some competition, because I know my approach wasn't the best. But you bowed out, and mobrule never showed up :)Extremist
this is my attempt lowcoupling.com/post/71143658461/…Magneton
Since codegolf.stackexchange.com exists now, this should be moved there.Taxidermy
G
15

GolfScript - 254 chars

n%([]:B:$:_"()"@*{:I"()*+|?"[{}/]?[{[[0B$,+:B))\;)]_]+}{B)):ß;:B;qß(:ß;}{8q}{[[0ß0$,)]]+}:8{[[0B-1=:ß)]]+:$q}{ß>$ß<\([0+$,+]\++}:q{[[I$,:ß)]]+}]=~:$}/;{n+[0]:3\{:c;;3:1_:3;{,}{)[$=]_*2/{~\.{c={3|:3}*;}{;.1|1,\:1,<{+0}*;}if}/}/;}/;1$,?)"true""false"if n}%

Somewhat straightforwardly, the first loop converts the regex into an NFA, which the second loop runs.

Sun Aug 22 00:58:24 EST 2010 (271→266) changed variable names to remove spaces
Sun Aug 22 01:07:11 EST 2010 (266→265) made [] a variable
Sun Aug 22 07:05:50 EST 2010 (265→259) made null state transitions inline
Sun Aug 22 07:19:21 EST 2010 (259→256) final state made implicit
Mon Feb 7 19:24:19 EST 2011 (256→254) using "()""str"*

$ echo "ab*a aa abba abab b"|tr " " "\n"|golfscript regex.gs
true
true
false
false

$ echo "0*1|10 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
false
true

$ echo "0*(1|1+0) 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
true
true

$ echo "a?b+|(a+b|b+a?)+ abb babab aaa aabba a b"|tr " " "\n"|golfscript regex.gs
true
true
false
true
false
true

$ echo "((A|B|C)+(a|(bbbbb)|bb|c)+)+ ABCABCaccabbbbbaACBbbb ABCABCaccabbbbbaACBbbbb"|tr " " "\n"|golfscript regex.gs
false
true
Goldeneye answered 19/8, 2010 at 15:22 Comment(1)
good lord! My programming skills suddenly look pitifully small and worthless! =S and you got it to 2^8 chars!Hypnotist
B
20

C (interpreting), 630 622 617 615 582 576 573 572 558 554 544 538 529 504 502 500 499 chars

This understands parentheses, and works on the regexp live (i.e. not parsed first)

#define C char
#define M m(s,o
m(C*s,C*o,C*S,C*p,C*P,C T){C*n=P-1,*q=s,h=*P==41,c=1;for(;h*c;c-=*n--==40)c+=*n==41;
c=*P-42;c=p-P?c-82?T&&c&~1&&c-21?h?2:*S==*P&s<S?M,S-1,p,n,2)||(T&4&&M,S-1,p,P,T|1)):
4:M,T?S:o,p,P-1,T?c&~1?3:7-c:0):T&&s==S||M,o,p,P-1,2):T&&s==S;if(c==2)for(c=4;q<=S;q
++)c|=m(q,S,S,n+h,P-h,2)?M,q,p,n,2)||q<S&T/4&&M,q,p,P,T&6):0;return
c&4?c&1?:T&1&&M,S,p,n,2)||M,o,p,n,0):c;}main(C*w,C**v){C*u;for(w=*++v;*++v;)puts(m(*v
-1,u,u=index(*v,0)-1,w-1,index(w,0)-1,2)?"true":"false");}

compiling with -Wall complains about argc being char. Will crash on illegal patterns.

This parses regexp and string from right to left, saving a few chars.

input on argv, output in reverse normal order:

$ ./a.out ab*a aa abba abab b
true
true
false
false

$ ./a.out "0*1|10" 1 10 0110 00001
true
true
false
true

$ ./a.out "0*(1|1+0)" 1 10 0110 00001
true
true
true
true

$ ./a.out "a?b+|(a+b|b+a?)+" abb babab aaa aabba a b
true
true
false
true
false
true

$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbb
false
$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbbb
true
Bellamy answered 19/8, 2010 at 15:22 Comment(4)
Nice effort so far! I'd love to see the output order corrected-- it's only a few more characters to do that, no? But it looks very nice, and I like the alternative example you added. Also: (a) Don't worry about illegal patterns; you may assume that all test cases are valid. (b) You raise a good question about argv... I wouldn't trust it, personally :-)Witchcraft
I managed to even win a few bytes by freeing up argc which is now a pointer, thanks ;-)Bellamy
What did you use to compile this? GCC crashes for me, complains about an undefined reference to index twice.Dim
@Callum man index says: CONFORMING TO 4.3BSD; marked as LEGACY in POSIX.1-2001. but you may replace it with strchr.Bellamy
G
15

GolfScript - 254 chars

n%([]:B:$:_"()"@*{:I"()*+|?"[{}/]?[{[[0B$,+:B))\;)]_]+}{B)):ß;:B;qß(:ß;}{8q}{[[0ß0$,)]]+}:8{[[0B-1=:ß)]]+:$q}{ß>$ß<\([0+$,+]\++}:q{[[I$,:ß)]]+}]=~:$}/;{n+[0]:3\{:c;;3:1_:3;{,}{)[$=]_*2/{~\.{c={3|:3}*;}{;.1|1,\:1,<{+0}*;}if}/}/;}/;1$,?)"true""false"if n}%

Somewhat straightforwardly, the first loop converts the regex into an NFA, which the second loop runs.

Sun Aug 22 00:58:24 EST 2010 (271→266) changed variable names to remove spaces
Sun Aug 22 01:07:11 EST 2010 (266→265) made [] a variable
Sun Aug 22 07:05:50 EST 2010 (265→259) made null state transitions inline
Sun Aug 22 07:19:21 EST 2010 (259→256) final state made implicit
Mon Feb 7 19:24:19 EST 2011 (256→254) using "()""str"*

$ echo "ab*a aa abba abab b"|tr " " "\n"|golfscript regex.gs
true
true
false
false

$ echo "0*1|10 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
false
true

$ echo "0*(1|1+0) 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
true
true

$ echo "a?b+|(a+b|b+a?)+ abb babab aaa aabba a b"|tr " " "\n"|golfscript regex.gs
true
true
false
true
false
true

$ echo "((A|B|C)+(a|(bbbbb)|bb|c)+)+ ABCABCaccabbbbbaACBbbb ABCABCaccabbbbbaACBbbbb"|tr " " "\n"|golfscript regex.gs
false
true
Goldeneye answered 19/8, 2010 at 15:22 Comment(1)
good lord! My programming skills suddenly look pitifully small and worthless! =S and you got it to 2^8 chars!Hypnotist
C
7

Haskell: 610 612 627

main=getLine>>=f.words
d=reverse
u=0<1
j=[]
f(r:s)=mapM_(print.any null.c(d$b$'(':r++")"))s
c%(x,y)=(c:x,y)
s _ _ _[]=(j,j)
s n a b (x:y)|n<1&&x==a=(j,x:y)|x==a=f(-1)|x==b=f 1|u=f 0where f k=x%s(n+k)a b y
b r|m==j=r|u=b$d(o++"(("++x)++")("++z++")/)"++w where(c,m)=s 0'|''!'r;_:v=m;(o,_:x)=s 0'('')'$d c;(z,_:w)=s 0')''('v
(!)g f x=f x>>=g
c[]=(:j)
c r=f!c s where(s,f)=i r
p q@(r:s)|r=='('=(s,(:j))|u=(a,f!g)where(w,f)=i q;(a,g)=p w
_?[]=j
c?(h:r)|c==h=[r]|u=j
i(r:q)=maybe(q,(r?))id$lookup r$zip")/*+?"$p q:zip[e,w,w,w][\s->f s++g s,\s->s:l s,l,\s->s:f s]where(w,f)=i q;(e,g)=i w;l s|f s==j=j|u=f s++(f s>>=l)

Ungolfed:

import Control.Monad
import Data.List

-- (aa|bb|cc)   -->  )|)cc()|)bb()aa((( 

testRegex = "a?b+|(a+b|b+a?)+"

interpret regex = any null . interpret' regex

interpret' regex = compile (rewrite regex)

mapFst f (x, y) = (f x, y)

splitWhileBalanced _ _ _ "" = ("", "")
splitWhileBalanced n b1 b2 (x:xs)
  | n < 1 && x == b1 = ("", x:xs)
  | x == b1   = f (-1)
  | x == b2   = f 1
  | otherwise = f 0
  where
    f k = mapFst (x:) $ splitWhileBalanced (n+k) b1 b2 xs

rewrite regex = reverse $ moveBars $ '(' : regex ++ ")"

moveBars regex
  | mtBar == "" = regex
  | otherwise = moveBars $ reverse (hOpen ++ "((" ++ tOpen) ++ ")(" ++ hClose ++ ")/)" ++ tClose
  where
    (hBar, mtBar) = splitWhileBalanced 0 '|' '!' regex -- '!' is a dummy character
    b:tBar = mtBar
    (hOpen, o:tOpen) = splitWhileBalanced 0 '(' ')' $ reverse hBar
    (hClose, c:tClose) = splitWhileBalanced 0 ')' '(' $ tBar

compile "" = \x -> [x]
compile rs = f <=< compile rs'
  where 
    (rs', f) = compile' rs

paren regex@(r:rs0)
  | r == '('  = (rs0, \x -> [x])
  | otherwise = (rs2, f <=< g)
  where
    (rs1, f) = compile' regex
    (rs2, g) = paren rs1

compile' (r:rs0) = case r of
  '/' -> (rs2, bar)
  '*' -> (rs1, star)
  '+' -> (rs1, plus)
  '?' -> (rs1, mark)
  ')' -> paren rs0
  _   -> (rs0, lit)
  where
    (rs1, f) = compile' rs0
    (rs2, g) = compile' rs1
    bar str = f str ++ g str
    plus str
      | null (f str) = []
      | otherwise = f str ++ (f str >>= plus)
    star str = str : plus str
    mark str = str : f str
    lit = maybe [] (\x -> [x]) . stripPrefix [r]

Memo:

Modifies | to a suffix form and then reverses the entire regex. Now the operators are in prefix form, making it easy to parse. The program parses the regex like this. The list monad is used for nondeterminism.

Usage:

> runghc Regex.hs
a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
True
True
False
True
False
True
Cartagena answered 19/8, 2010 at 15:22 Comment(7)
+1 for reversing the problem, was thinking about that as well! (but can't you keep the strings as they are and reverse the matching only?)Bellamy
+1, very nice. In my Perl solution that I have in the works, I've moved [+*?|] into prefix form. I'm still working on getting [+*?] to work though, hence why I haven't posted it yet. :-)Witchcraft
@Bellamy good call... at first I thought it would be a bit more work, but all it was was changing >=> to <=< :D (-2 chars)Cartagena
@trinithis: I'm not into Haskell but why is there still a reverse keyword in there? For C the benefit of reversing the whole thing was (at least) 34 on 538.Bellamy
It's a function, not a keyword. In any case one reverse is for reversing the regex to make the symbols appear in prefix. The reverses in moveBars (ungolfed) are used to split a string from its end, rather than its beginning (one to match from the end, the second to undo the first reversal).Cartagena
I understand, but what I mean is: if you start comparing the string and regexp from the end, and work towards the front, you don't need the long word reverse. But that may cost a little more somewhere else (like in C where I needed to subtract one from all pointers, which ended up as a net benefit since reversing in C is very expensive golf-wise)Bellamy
If I understand what you are saying, it doesn't work in Haskell because lists (strings) are singly linked. Thus one doesn't have access to the end of the list.Cartagena
B
7

C (parsing+matching) 727 670 chars

This is not fully golfed to the bare minimum yet, but I wanted to try and see if parsing the regexp first would make a difference to interpreting it live. It does, because it costs more, although both parsing and matching are easier to write/understand.

#define Q struct q
#define C char
#define R return
Q{Q*u,*n,*a;C c,m};Q*P(C*p,C*e){Q*r=calloc(99,1);C*n=p+1,c=1,w;if(p==e)R
r;if(*p==40){for(;c;)c+=(*n==40)-(*n++==41);r->u=P(p+1,n-1);}else
if(*p=='|'){r->a=P(p+1,e);R r;}else r->c=*p;if(n<e){if(*n==43)*n=42,r->n=P(p,e);else
w=*n==42|*n==63,r->n=P(n+w,e),r->m=w?*n:0;r->a=r->n->a;}R r;}M(Q*r,C*s,C*o,C*z){C*p,
e;e=r?r->m==63|r->m==42&&M(r->n,s,o,z)?:*s&&r->c==*s?M(r->m==42?r:r->n,s+1,o,z):2:s
==z;if(e-2)R e;for(p=s,e=0;!r->c&p<=z;p++)e|=M(r->u,s,s,p)&(r->m!=42|p>s)&&M(r->m==
42?r:r->n,p,p,z);R e||r->a&&M(r->a,o,o,z);}main(C
c,C**v){for(;--c>1;)puts(M(P(v[1],index(v[1],0)),v[c],v[c],index(v[c],0))?"true":"false");}

it parses a regexp into a struct, where each struct has:

  • the character to be matched or a pointer to a struct of a subpattern to be matched
  • a pointer to the next symbol struct (if any)
  • a pointer to the next alternative pattern (if any)
  • a multiplier which is \0, * or ? -- (pat)+ is parsed into (pat)(pat)* right away which makes matching far easier
Bellamy answered 19/8, 2010 at 15:22 Comment(0)
T
5

Ruby 1.9: 709 chars

R=gets.chop;s='';k=[];n=a=0
G={?(=>(A="(a-=1;s<<0)if a>1;")+"k<<[n,a];n=a=0",
Y=?|=>(B="s<<0while 0<a-=1;")+"n+=1",
?)=>B+(C="s<<?|while 0<=n-=1;")+"n,a=k.pop"+F=";a+=1",
?*=>D="s<<c",?+=>D,??=>D}
R.each_char{|c|eval G[c]||A+D+F};eval B+C
def P l,s;l.map{|a|a<<s};end
J={??=>(K="a=k.pop;")+"k<<[{Y=>n=[a[0]]},a[1]<<n]",
?*=>K+(H="P a[1],s={Y=>n=[a[0]]};")+"k<<[s,[n]]",
?+=>K+H+"k<<[a[0],[n]]",
Y=>(I=K+"b=k.pop;")+"k<<[{Y=>[a[0],b[0]]},a[1]+b[1]]",
?\0=>I+"P b[1],a[0];k<<[b[0],a[1]]"}
k=[];s.each_char{|c|eval J[c]||"k<<[{c=>a=[]},[a]]"}
e=k[0];P e[1],R;
p $<.map{|l|s=l.chop;*a=e[0]
s.each_char{|c|eval@f="n=a;a=a.map{|h|h[Y]||[h]}.flatten"while a!=n
a=a.inject([]){|a,h|a+(h[c]||[])}}
eval@f;a.include? R}

(This will also works in Ruby 1.8 with 45 more chars by adding the alias below)

Test with type testcase.txt | ruby regex.rb

An implementation of Russ Cox's NFA parser in Ruby. A state is represented as a hash with a single key which holds an array of next states.

Ungolfed:

#needed to run on ruby 1.8
class String
  alias :each_char :each_byte 
end  

## Infix to Postfix

R=gets.chop  
p "regexp = #{R}"
k=[] 
s=''  #will hold postfix string
n=a=0 #count braNches and Atoms
postfix = R.each_char{|c|
  case c
    when ?(
        (a-=1;s<<0)if a>1
        k<<[n,a]
        n=a=0
    when ?|
      s<<0while 0<a-=1
      n+=1
    when ?)
      s<<0while 0<a-=1
      s<<?|while 0<=n-=1
      n,a=k.pop;a+=1
    when ?*,?+,??
      s<< c
    else
      (a-=1;s<<0)if a>1
      s<< c
      a+=1
  end
}
s<<0while 0<a-=1
s<<?|while 0<=n-=1

## Postfix to NFA
# State is {char=>s0=[state,state]]
# Frag is [state, [s0,...]]

Y=?|   #choice indicator
X={:true=>[R]} #matcstate

 def patch l,s   #l is list of arrays, s is state.  Put s in each array
   l.map{|a|   a<< s   }
end

k=[]
s.each_char{|c|
 case c
    when ??
      a=k.pop
      s={Y=>n=[a[0]]}
      k<<[s,a[1]<<n]
    when ?*
      a=k.pop
      s={Y=>n=[a[0]]}
      patch(a[1],s)
      k<<[s,[n]]
    when ?+
      a=k.pop
      s={Y=>n=[a[0]]}
      patch(a[1],s)
      k<<[a[0],[n]]
    when ?|
     b=k.pop
     a=k.pop
      s={Y=>[a[0],b[0]]}
      k<<[s,a[1]+b[1]]
    when 0
     b=k.pop
     a=k.pop
     patch(a[1],b[0])
     k<<[a[0],b[1]]
   else
     k<<[{c=>a=[]},[a]]
   end
}
e=k.pop
patch(e[1],X)

#Running the NFA

E=[e[0]] #Evaluator
p $<.map{|l|s=l.chop
  p "evaluating: #{s}" 
  a=E
  s.each_char{|c|
    begin     #skip past split nodes
      s=a.size
      a=a.map{|h|h[?|]||[h]}.flatten  
    end while a.size!=s
    a=a.inject([]){|a,h|
    a+(h[c]||[])}  #add next state or null
  }
  a=a.map{|h|h[?|]||[h]}.flatten
  r = a.include? X  #check for end of pattern
  p r
  r
}
Telson answered 19/8, 2010 at 15:22 Comment(0)
E
5

Perl, 596 chars

Semi-explanation:

  • This is based off of the continuation-passing-style parser combinators found in Chapter 8 of Higher Order Perl.
  • Credit for helping me remove about 70 chars goes to Vincent Pit (VPIT).
  • S { block } is the same as sub { block } only it's 2 chars shorter every time.
  • $, is nil (a coderef containing a matcher that always matches, consuming nothing)
  • c is n-ary concatenation (take any number of matchers and return a matcher that succeeds if they all succeed in sequence).
  • a is n-ary alternation (take any number of matchers and return a matcher that succeeds if any of them succeed).
  • A is a helper for the regex builder — it takes a structure of alternations-of-concatenations and passes to C and a as necessary, returning a matcher.
  • k is star (take a matcher and return a matcher that matches it 0 or more times in sequence. k for Kleene, because s() is parsed as the s/// operator :)
  • The do block parses the regex a char at a time. @$r is the current concatenation list, @a is the current alternation list, @p is the paren-group stack. a+ is treated as aa*, and a? is treated as (a|) inline in b (there are no functions for plus or maybe).
  • The S{!length pop} in the last line is a matcher that succeeds at end-of-input. It's passed as the continuation for the regex matcher, which means that the regex only succeeds if it can match the entire string.

For mostly-degolfed and more-commented code, see this Gist.

Run it as perl regexer.pl 'a?b+|(a+b|b+a?)+' abb babab aaa aabba a b. No mandatory linebreaks in the code.

use feature':5.12';
sub S(&){pop}
$,=S{goto pop};
sub p{push@{+shift},@_}
sub c{my$l=$,;for my$r(@_){my$L=$l;
$l=S{my($i,$c)=@_;&$L($i,S{&$r(shift,$c)})}}$l}
sub a{my@A=@_;S{my($i,$c,$o)=@_;$o=&$_($i,$c)and return$o for@A;0}}
sub A{$#_?a(map c(@$_),@_):c(@{+pop})}
sub k{my($I,$k)=@_;$k=a c($I,S{&$k}),$,}
$_=shift;$P=do{@a=$r=[];for(/./g){
when('('){p\@p,[@a];@a=$r=[]}
when(')'){$p=A@a;@a=@{pop@p};p$r=$a[-1],$p}
p\@a,$r=[]when'|';
p$r,k pop@$r when'*';
p$r,c $$r[-1],k pop@$r when'+';
p$r,a pop@$r,$,when '?';
my$s=$_;p$r,S{my($_,$c)=@_;s/^\Q$s//&&$_->$c}}A@a};
say&$P($_,S{!length pop})?"true":"false"for@ARGV
Extremist answered 19/8, 2010 at 15:22 Comment(0)
S
4

JavaScript, 658 chars

// All whitespace is optional
function c(f,p){
    var x=f[0],w=p[0],h="substr",s=f[h](2),b=p[h](1),m=0,t=0,r,g,a=0,l,u="length",y="*";
    switch(f[1]){
        case"+":if(x!=w)return;f=x+y+s;
        case y:return x==w&&c(f,b)||c(s,p);
        case"?":return x==w&&c(s,b)||c(s,p)
    }
    if(x=="("){
        o:for(l=[""];t<f[u];t++){
            switch(f[t]){
                case"|":if(a==1){m=l.push("")-1;continue}break;
                case"(":if(++a==1)continue;break;
                case")":if(!--a)break o
            }
            l[m]+=f[t]
        }
        var v=0,e=t+1;
        return l.some(function(g){
            switch(f[t+1]){
                case y:v=1;
                case"+":e=t+2;
                    for(var q=0;q<f[u];q++)
                        if(c(g+Array(q).join(f[h](0,e))+f[h](e),p))
                            return 1;
                    return;
                case"?":v=1;e=t+2;default:if(c(g+f[h](e),p))return 1;
            }
        })||(v&&c(f[h](e),p))
    }
    return p[u]?(x==w&&c(f[h](1),b)):!f[u]
}

// Make it look nicer
function test(regex, string) { return !!c('(' + regex + ')', string); }

test('a?b+|(a+b|b+a?)+', 'abb') // true
test('a?b+|(a+b|b+a?)+', 'babab') // true

Ungolfed, ~1500 chars

function test(reg, str) {
    console.log('Testing ' + reg + ' against ' + str);
    var c = reg[0], d = str[0];


    switch (reg[1]) {
        case '+': 
            if (c != d) 
                return false;   
            reg = c + '*' + reg.substr(2);
        case '*': 
            return (c == d && test(reg, str.substr(1))) || test(reg.substr(2), str);
        case '?': 
            return (c == d && test(reg.substr(2), str.substr(1))) || test(reg.substr(2), str);
    }

    if (c == '(') {
        var regs = [''];
        o: for (var level = n = i = 0; i < reg.length; i++) {
            //console.log(level + ': ' + n + ': ' + reg[i]);
            switch (reg[i]) {
                case '|': 
                    if (level == 1) { n = regs.push('') - 1; continue; } 
                    break;
                case '(': 
                    if (++level == 1) continue; 
                    break;
                case ')': 
                    if (!--level) break o; 
                    break;
            };
            regs[n] += reg[i];
        }
        //console.log(regs); // An array of alternates (hello|hi) => ['hello', 'hi']        

        var optional = false, end = i+1;
        return regs.some(function(jitem) {
            switch (reg[i+1]) {
                case '*': optional = true; // Fall through
                case '+': 
                    end = i+2;
                    for (var k = 0; k < reg.length; k++)
                        if (test(jitem + Array(k).join(reg.substr(0, i+2)) + reg.substr(i+2), str))
                            return true;
                    return false;

                case '?': optional = true; end = i+2; // Fall through
                default: if (test(jitem + reg.substr(end), str)) return true;       
            }

        }) || (optional && test(reg.substr(end), str));
    }

    if (str == '')
        return reg == '';

    return c == d ? test(reg.substr(1), str.substr(1)) : false;

}

This works by recursing, by chopping off the front of regex and the front of the string, and calling itself. For example, test("hello", "hello") => test("ello", "ello") => test("llo", "llo") => test("lo", "lo") => test("o", "o") => test("", "") returns true. Note: the bare c function will not support implied alternation. In other words, hello|hi will not work; it has to be wrapped in parentheses: (hello|hi).

Stagey answered 19/8, 2010 at 15:22 Comment(2)
+1 for the 6th solution for this one! Could you wrap the lines in the golfed version, for readability? Just wondering, I didn't test, but does this match (a+)(aaa) correctly to aaaa?Bellamy
it seems to work 99%, but crashes firefox on ((ab)+)+ac with ababBellamy

© 2022 - 2024 — McMap. All rights reserved.