Regex - get elements to render if statement
Asked Answered
A

2

7

I'm designing a script and trying to get to the if construct without eval in php.

Still incomplete but blasting through, it's to do a templating engine, the "if" part of the engine. no Assignment operators allowed, but I need to test values without allowing php code injections, precisely not using eval It'll need to do individual operations between variables preventing injection attacks.

Regex must capture

[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output
[elseif:('b'+'atman'='batman')]
    output2
[elseif:('b'+'atman'='batman')]
    output3
[elseif:('b'+'atman'='batman')]
    output4
[else]
    output5
[endif]

[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output6
[else]
    output7
[endif]

The following works to get the if, elseif, else and endif blocks along with the condition statements:

$regex = '^\h*\[if:(.*)\]\R(?<if>(?:(?!\[elseif)[\s\S])+)\R^\h*\[elseif:(.*)\]\R(?<elseif>(?:(?!\[else)[\s\S])+)\R^\h*\[else.*\]\R(?<else>(?:(?!\[endif)[\s\S])+)\R^\[endif\]~xm';

Please help in having optional elseif and else.

then with the condition statement, I can get the operations with:

$regex = '~([^\^=<>+\-%/!&|()*]+)([\^+\-%/!|&*])([^\^=<>+\-%/!&|()*]*)~';

however, it'll only pair them, missing each 3rd operator...

Thanks for your help.

Avenge answered 12/6, 2016 at 8:47 Comment(7)
This construct [\S \t]+ matches any character besides [\r\n\f] and your [\s\S]+ won't stop cosuming if there is more than 1 block to match (greed). Please give better input/expected output samples. You can try (?s)(\[if:([^\]]+))\](.*?)\[endif\]Sheikdom
Please see bounty commentsAvenge
PCRE regular expressions are easily put together, even complex recursion. The problem stems from divining the intent from your code. If you think you know what you need, just say it in English in simple pseudo code without the hosting language code details. It has to be separate. Regex is a language by itself. One thing at a time. And one more thing, to use and do regex recursion requires a recursive use of the hosting language as well.Ablebodied
I'm fine with code using the hosting language, forgive me if it seemed otherwise, what I meant was that I didn't want eval nor php injections, hence testing each part of the code.Avenge
Yeah, it looks like you're trying to get the operator and it's immediate surrounding non-operator/parent/equals characters. So basically, all those similar regexes and code can be condensed into a single preg_match_all() using this ~([^\^=<>+\-%/!&|()*]+)([\^+\-%/!|&*])([^\^=<>+\-%/!&|()*]*)~. Where capture group 2 contains the operator, which you can test with if-then-else logic. I wish I could help further but I don't quite get what you're doing.Ablebodied
Btw, if you just need recursion on if/elseif/else/endif stuff, its fairly simple.Ablebodied
Please see revised question.Avenge
A
7

(edit Added a simple if/elseif body parsing regex at the bottom)

Using PCRE, I think this regex recursion should handle nested
if/elseif/else/endif constructs.

In it's current form it is a loose parse in that it doesn't define
very well the form of the [if/elseif: body ].
For instance, is [if: the beginning delimiter construct and ] the end? And should an error occur, etc.. It could be done this way if needing a strict parse.
Right now it basically is using [if: body ] as a the beginning delimiter
and [endif] as the end delimiter to find nesting constructs.

Also, it loosely defines body as [^\]]* which, in a serious parsing
situation, would have to be fleshed out to account for quotes and stuff.
Like I said, breaking it apart like that is doable, but is much more
involved. I've done this on a language level, and it's not trivial.

There is a host language usage pseudocode sample on the bottom.
The language recursion demonstrates how to extract nested content
correctly.

The regex matches the current outter shell of a core. Where the core
is the inner nested content.

Each call to ParseCore() is initiated inside ParseCore() itself
(except for the initial call from main().

Since scoping seems unspecified, I've made assumptions that can be seen
littered in comments.

There is a placeholder for the if/elseif body that is captured that
can then be parsed for the (operations) portion which is really part 2
of this exercise I haven't gotten around to doing yet.
Note - I will try to do this, but I don't have the time today.

Let me know if you have any questions..

(?s)(?:(?<Content>(?&_content))|\[elseif:(?<ElseIf_Body>(?&_ifbody)?)\]|(?<Else>(?&_else))|(?<Begin>\[if:(?<If_Body>(?&_ifbody)?)\])(?<Core>(?&_core)|)(?<End>\[endif\])|(?<Error>(?&_keyword)))(?(DEFINE)(?<_ifbody>(?>[^\]])+)(?<_core>(?>(?<_content>(?>(?!(?&_keyword)).)+)|(?(<_else>)(?!))(?<_else>(?>\[else\]))|(?(<_else>)(?!))(?>\[elseif:(?&_ifbody)?\])|(?>\[if:(?&_ifbody)?\])(?:(?=.)(?&_core)|)\[endif\])+)(?<_keyword>(?>\[(?:(?:if|elseif):(?&_ifbody)?|endif|else)\])))

Formatted and tested:

 (?s)                               # Dot-all modifier

 # =====================
 # Outter Scope
 # ---------------

 (?:
      (?<Content>                        # (1), Non-keyword CONTENT
           (?&_content) 
      )
   |                                   # OR,
      # --------------
      \[ elseif:                         # ELSE IF
      (?<ElseIf_Body>                    # (2), else if body
           (?&_ifbody)? 
      )
      \]
   |                                   # OR
      # --------------
      (?<Else>                           # (3), ELSE
           (?&_else) 
      )
   |                                   # OR
      # --------------
      (?<Begin>                          # (4), IF
           \[ if: 
           (?<If_Body>                        # (5), if body
                (?&_ifbody)? 
           )
           \]
      )
      (?<Core>                           # (6), The CORE
           (?&_core) 
        |  
      )
      (?<End>                            # (7)
           \[ endif \]                        # END IF
      )
   |                                   # OR
      # --------------
      (?<Error>                          # (8), Unbalanced If, ElseIf, Else, or End
           (?&_keyword) 
      )
 )

 # =====================
 #  Subroutines
 # ---------------

 (?(DEFINE)

      # __ If Body ----------------------
      (?<_ifbody>                        # (9)
           (?> [^\]] )+
      )

      # __ Core -------------------------
      (?<_core>                          # (10)
           (?>
                #
                # __ Content ( non-keywords )
                (?<_content>                       # (11)
                     (?>
                          (?! (?&_keyword) )
                          . 
                     )+
                )
             |  
                #
                # __ Else
                # Guard:  Only 1 'else'
                # allowed in this core !!

                (?(<_else>)
                     (?!)
                )
                (?<_else>                          # (12)
                     (?> \[ else \] )
                )
             |  
                #
                # __ ElseIf
                # Guard:  Not Else before ElseIf
                # allowed in this core !!

                (?(<_else>)
                     (?!)
                )
                (?>
                     \[ elseif:
                     (?&_ifbody)? 
                     \]
                )
             |  
                #
                # IF  (block start)
                (?>
                     \[ if: 
                     (?&_ifbody)? 
                     \]
                )
                # Recurse core
                (?:
                     (?= . )
                     (?&_core) 
                  |  
                )
                # END IF  (block end)
                \[ endif \] 
           )+
      )

      # __ Keyword ----------------------
      (?<_keyword>                       # (13)
           (?>
                \[ 
                (?:
                     (?: if | elseif )
                     : (?&_ifbody)? 
                  |  endif
                  |  else
                )
                \]
           )
      )
 )

Host language pseudo-code

 bool bStopOnError = false;
 regex RxCore("....."); // Above regex ..

 bool ParseCore( string sCore, int nLevel )
 {
     // Locals
     bool bFoundError = false;
     bool bBeforeElse = true;
     match _matcher;

     while ( search ( core, RxCore, _matcher ) )
     {
       // Content
         if ( _matcher["Content"].matched == true )
           // Print non-keyword content
           print ( _matcher["Content"].str() );

           // OR, Analyze content.
           // If this 'content' has error's and wish to return.
           // if ( bStopOnError )
           //   bFoundError = true;

         else

       // ElseIf
         if ( _matcher["ElseIf_Body"].matched == true )
         {
             // Check if we are not in a recursion
             if ( nLevel <= 0 )
             {
                // Report error, this 'elseif' is outside an 'if/endif' block
                // ( note - will only occur when nLevel == 0 )
                print ("\n>> Error, 'elseif' not in block, body = " + _matcher["ElseIf_Body"].str() + "\n";

                // If this 'else' error will stop the process.
                if ( bStopOnError == true )
                   bFoundError = true;
             }
             else
             {
                 // Here, we are inside a core recursion.
                 // That means we have not hit an 'else' yet
                 // because all elseif's precede it.
                 // Print 'elseif'.
                 print ( "ElseIf: " );

                 // TBD - Body regex below
                 // Analyze the 'elseif' body.
                 // This is where it's body is parsed.
                 // Use body parsing (operations) regex on it.
                 string sElIfBody = _matcher["ElseIf_Body"].str() );

                // If this 'elseif' body error will stop the process.
                if ( bStopOnError == true )
                   bFoundError = true;
             }
         }


       // Else
         if ( _matcher["Else"].matched == true )
         {
             // Check if we are not in a recursion
             if ( nLevel <= 0 )
             {
                // Report error, this 'else' is outside an 'if/endif' block
                // ( note - will only occur when nLevel == 0 )
                print ("\n>> Error, 'else' not in block\n";

                // If this 'else' error will stop the process.
                if ( bStopOnError == true )
                   bFoundError = true;
             }
             else
             {
                 // Here, we are inside a core recursion.
                 // That means there can only be 1 'else' within
                 // the relative scope of a single core.
                 // Print 'else'.
                 print ( _matcher["Else"].str() );

                 // Set the state of 'else'.
                 bBeforeElse == false;
             }
         }

         else

       // Error ( will only occur when nLevel == 0 )
         if ( _matcher["Error"].matched == true )
         {
             // Report error
             print ("\n>> Error, unbalanced " + _matcher["Error"].str() + "\n";
             // // If this unbalanced 'if/endif' error will stop the process.
             if ( bStopOnError == true )
                 bFoundError = true;
         }

         else

       // If/EndIf block
         if ( _matcher["Begin"].matched == true )
         {
             // Print 'If'
             print ( "If:" );

             // Analyze 'if body' for error and wish to return.

             // TBD - Body regex below.
             // Analyze the 'if' body.
             // This is where it's body is parsed.
             // Use body parsing (operations) regex on it.
             string sIfBody = _matcher["If_Body"].str() );

             // If this 'if' body error will stop the process.
              if ( bStopOnError == true )
                  bFoundError = true;
              else
              {

                  //////////////////////////////
                  // Recurse a new 'core'
                  bool bResult = ParseCore( _matcher["Core"].str(), nLevel+1 );
                  //////////////////////////////

                  // Check recursion result. See if we should unwind.
                  if ( bResult == false && bStopOnError == true )
                      bFoundError = true;
                  else
                      // Print 'end'
                      print ( "EndIf" );
              }
         }

         else
         {
            // Reserved placeholder, won't get here at this time.
         }

       // Error-Return Check
         if ( bFoundError == true && bStopOnError == true )
              return false;
     }

     // Finished this core!! Return true.
     return true;
 }

 ///////////////////////////////
 // Main

 string strInitial = "...";

 bool bResult = ParseCore( strInitial, 0 );
 if ( bResult == false )
    print ( "Parse terminated abnormally, check messages!\n" );

Output sample of the outter core matches
Note there will be many more matches when the inner core's are matched.

 **  Grp 0               -  ( pos 0 , len 211 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output
[elseif:('b'+'atman'='batman')]
    output2
[elseif:('b'+'atman'='batman')]
    output3
[elseif:('b'+'atman'='batman')]
    output4
[else]
    output5
[endif]  
 **  Grp 1 [Content]     -  NULL 
 **  Grp 2 [ElseIf_Body] -  NULL 
 **  Grp 3 [Else]        -  NULL 
 **  Grp 4 [Begin]       -  ( pos 0 , len 31 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]  
 **  Grp 5 [If_Body]     -  ( pos 4 , len 26 ) 
(a+b-c/d*e)|(x-y)&!(z%3=0)  
 **  Grp 6 [Core]        -  ( pos 31 , len 173 ) 

    output
[elseif:('b'+'atman'='batman')]
    output2
[elseif:('b'+'atman'='batman')]
    output3
[elseif:('b'+'atman'='batman')]
    output4
[else]
    output5

 **  Grp 7 [End]         -  ( pos 204 , len 7 ) 
[endif]  
 **  Grp 8 [Error]       -  NULL 
 **  Grp 9 [_ifbody]     -  NULL 
 **  Grp 10 [_core]       -  NULL 
 **  Grp 11 [_content]    -  NULL 
 **  Grp 12 [_else]       -  NULL 
 **  Grp 13 [_keyword]    -  NULL 

-----------------------------

 **  Grp 0               -  ( pos 211 , len 4 ) 



 **  Grp 1 [Content]     -  ( pos 211 , len 4 ) 



 **  Grp 2 [ElseIf_Body] -  NULL 
 **  Grp 3 [Else]        -  NULL 
 **  Grp 4 [Begin]       -  NULL 
 **  Grp 5 [If_Body]     -  NULL 
 **  Grp 6 [Core]        -  NULL 
 **  Grp 7 [End]         -  NULL 
 **  Grp 8 [Error]       -  NULL 
 **  Grp 9 [_ifbody]     -  NULL 
 **  Grp 10 [_core]       -  NULL 
 **  Grp 11 [_content]    -  NULL 
 **  Grp 12 [_else]       -  NULL 
 **  Grp 13 [_keyword]    -  NULL 

-----------------------------

 **  Grp 0               -  ( pos 215 , len 74 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output6
[else]
    output7
[endif]  
 **  Grp 1 [Content]     -  NULL 
 **  Grp 2 [ElseIf_Body] -  NULL 
 **  Grp 3 [Else]        -  NULL 
 **  Grp 4 [Begin]       -  ( pos 215 , len 31 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]  
 **  Grp 5 [If_Body]     -  ( pos 219 , len 26 ) 
(a+b-c/d*e)|(x-y)&!(z%3=0)  
 **  Grp 6 [Core]        -  ( pos 246 , len 36 ) 

    output6
[else]
    output7

 **  Grp 7 [End]         -  ( pos 282 , len 7 ) 
[endif]  
 **  Grp 8 [Error]       -  NULL 
 **  Grp 9 [_ifbody]     -  NULL 
 **  Grp 10 [_core]       -  NULL 
 **  Grp 11 [_content]    -  NULL 
 **  Grp 12 [_else]       -  NULL 
 **  Grp 13 [_keyword]    -  NULL 

This is the If/ElseIf Body regex

Raw

(?|((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)([\^+\-%/*=]+)(?=\s*[^\^=<>+\-%/!&|()\[\]*\s])|\G(?!^)(?<=[\^+\-%/*=])((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)())

Stringed

'~(?|((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)([\^+\-%/*=]+)(?=\s*[^\^=<>+\-%/!&|()\[\]*\s])|\G(?!^)(?<=[\^+\-%/*=])((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)())~'

Expanded

 (?|                                           # Branch Reset
      (                                             # (1 start), Operand
           (?: \s* [^\^=<>+\-%/!&|()\[\]*\s] \s* )+
      )                                             # (1 end)
      ( [\^+\-%/*=]+ )                              # (2), Forward Operator
      (?= \s* [^\^=<>+\-%/!&|()\[\]*\s] )
   |  
      \G 
      (?! ^ )
      (?<= [\^+\-%/*=] )
      (                                             # (1 start), Last Operand
           (?: \s* [^\^=<>+\-%/!&|()\[\]*\s] \s* )+
      )                                             # (1 end)
      ( )                                           # (2), Last-Empty Forward Operator
 )

Here is how this operates:
Assumes very simple constructs.
This will just parse the math operand/operator stuff.
It won't parse any enclosing parenthesis blocks, nor any logic or math
operators in between.

If needed, parse any parenthesis blocks ahead of time, i.e. \( [^)* \) or
similar. Or split on the logic operators like |.

The body regex uses a branch reset to get the operand/operator sequence.
It always matches two things.
Group 1 contains the operand, group 2 the operator.

If group 2 is empty, group 1 is the last operand in the sequence.

Valid operators are ^ + - % / * =.
The equals = is included because it separates cluster of operations
and can just be noted as a separation.

The conclusion about this body regex is that it is very simple and
only suited for simple usage. Anything more of a complexity is involved
this won't be the way to go.

Input/Output Sample 1:

(a+b-c/d*e)

 **  Grp 1 -  ( pos 1 , len 1 ) 
a  
 **  Grp 2 -  ( pos 2 , len 1 ) 
+  
------------
 **  Grp 1 -  ( pos 3 , len 1 ) 
b  
 **  Grp 2 -  ( pos 4 , len 1 ) 
-  
------------
 **  Grp 1 -  ( pos 5 , len 1 ) 
c  
 **  Grp 2 -  ( pos 6 , len 1 ) 
/  
------------
 **  Grp 1 -  ( pos 7 , len 1 ) 
d  
 **  Grp 2 -  ( pos 8 , len 1 ) 
*  
------------
 **  Grp 1 -  ( pos 9 , len 1 ) 
e  
 **  Grp 2 -  ( pos 10 , len 0 )  EMPTY 

Input/Output Sample 2:

('b'+'atman'='batman')

 **  Grp 1 -  ( pos 1 , len 3 ) 
'b'  
 **  Grp 2 -  ( pos 4 , len 1 ) 
+  
------------
 **  Grp 1 -  ( pos 5 , len 7 ) 
'atman'  
 **  Grp 2 -  ( pos 12 , len 1 ) 
=  
------------
**  Grp 1 -  ( pos 13 , len 8 ) 
'batman'  
 **  Grp 2 -  ( pos 21 , len 0 )  EMPTY 
Ablebodied answered 17/6, 2016 at 19:8 Comment(0)
S
3

You have different possibilities here.

The regex version

^\h*\[if.*\]\R                        # if in the first line
(?<if>(?:(?!\[elseif)[\s\S])+)\R      # output
^\h*\[elseif.*\]\R                    # elseif
(?<elseif>(?:(?!\[else)[\s\S])+)\R    # output
^\h*\[else.*\]\R                      # elseif
(?<else>(?:(?!\[endif)[\s\S])+)\R     # output
^\[endif\]

Afterwards, you have three named captured groups (if, elseif and else).
See a demo for this one on regex101.com.

In PHP, this would be:

<?php
$code = <<<EOF
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
output
[elseif:('b'+'atman'='batman')]
output2
out as well
[else]
output3
some other output here
[endif]
EOF;

$regex = '~
            ^\h*\[if.*\]\R                        # if in the first line
            (?<if>(?:(?!\[elseif)[\s\S])+)\R      # output
            ^\h*\[elseif.*\]\R                    # elseif
            (?<elseif>(?:(?!\[else)[\s\S])+)\R    # output
            ^\h*\[else.*\]\R                      # elseif
            (?<else>(?:(?!\[endif)[\s\S])+)\R     # output
            ^\[endif\]
          ~xm';

preg_match_all($regex, $code, $parts);
print_r($parts);
?>


Programming logic

Perhaps it would be better to skim the lines and look for [if...], capture anything up to [elseif...] in a string and glue them together afterwards.

<?php

$code = <<<EOF
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
output
[elseif:('b'+'atman'='batman')]
output2
out as well
[else]
output3
some other output here
[endif]
EOF;

// functions, shamelessly copied from https://mcmap.net/q/45126/-startswith-and-endswith-functions-in-php
function startsWith($haystack, $needle) {
    // search backwards starting from haystack length characters from the end
    return $needle === "" || strrpos($haystack, $needle, -strlen($haystack)) !== false;
}

function endsWith($haystack, $needle) {
    // search forward starting from end minus needle length characters
    return $needle === "" || (($temp = strlen($haystack) - strlen($needle)) >= 0 && strpos($haystack, $needle, $temp) !== false);
}

$code = explode("\n", $code);
$buffer = array("if" => null, "elseif" => null, "else" => null);

$pointer = false;
for ($i=0;$i<count($code);$i++) {
	$save = true;
	if (startsWith($code[$i], "[if")) {$pointer = "if"; $save = false;}
	elseif (startsWith($code[$i], "[elseif")) {$pointer = "elseif"; $save = false; }
	elseif (startsWith($code[$i], "[else")) {$pointer = "else"; $save = false; }
	elseif (startsWith($code[$i], "[endif")) {$pointer = false; $save = false; }

	if ($pointer && $save) $buffer[$pointer] .= $code[$i] . "\n";

}
print_r($buffer);

?>
Semicentennial answered 15/6, 2016 at 6:58 Comment(7)
Thanks, that's awesome, but it's half the battle, there needs to be regex to get operators/values/variables, separate them and do a PEDMAS and execute tests, I'm asking for quite a bit, you put effort into this, ----0--- bonus points if you give me complete answer :): edit, can't boost bounty :(Avenge
And does it account for recursion? (that there can be an if within an if?)Avenge
What do you mean with PEDMAS ?Semicentennial
PEMDAS, sorry: Parentheses, Exponents, Multiplication and Division, and Addition and Subtraction, priority of operations.Avenge
obtain an array of all values(strings or numbers)/variables/operators, and resolve by testing conditions. (for variables, they are part of my template, do not worry about them, only test values, for example: [if:c=c] is true [if:1=1] is true [if:1.5+2.5=8/2] is true [if:'string1' = 'string1'] is true [if:'string1' != 'string1']Avenge
Please see revised questionAvenge
@LucLaverdure: Use two regex approaches, one for the statements and one for the operator.Semicentennial

© 2022 - 2024 — McMap. All rights reserved.