(edit Added a simple if/elseif body parsing regex at the bottom)
Using PCRE, I think this regex recursion should handle nested
if/elseif/else/endif
constructs.
In it's current form it is a loose parse in that it doesn't define
very well the form of the [if/elseif: body ]
.
For instance, is [if:
the beginning delimiter construct and ]
the end?
And should an error occur, etc.. It could be done this way if needing a strict parse.
Right now it basically is using [if: body ]
as a the beginning delimiter
and [endif]
as the end delimiter to find nesting constructs.
Also, it loosely defines body
as [^\]]*
which, in a serious parsing
situation, would have to be fleshed out to account for quotes and stuff.
Like I said, breaking it apart like that is doable, but is much more
involved. I've done this on a language level, and it's not trivial.
There is a host language usage pseudocode sample on the bottom.
The language recursion demonstrates how to extract nested content
correctly.
The regex matches the current outter shell of a core. Where the core
is the inner nested content.
Each call to ParseCore() is initiated inside ParseCore() itself
(except for the initial call from main().
Since scoping seems unspecified, I've made assumptions that can be seen
littered in comments.
There is a placeholder for the if/elseif
body that is captured that
can then be parsed for the (operations)
portion which is really part 2
of this exercise I haven't gotten around to doing yet.
Note - I will try to do this, but I don't have the time today.
Let me know if you have any questions..
(?s)(?:(?<Content>(?&_content))|\[elseif:(?<ElseIf_Body>(?&_ifbody)?)\]|(?<Else>(?&_else))|(?<Begin>\[if:(?<If_Body>(?&_ifbody)?)\])(?<Core>(?&_core)|)(?<End>\[endif\])|(?<Error>(?&_keyword)))(?(DEFINE)(?<_ifbody>(?>[^\]])+)(?<_core>(?>(?<_content>(?>(?!(?&_keyword)).)+)|(?(<_else>)(?!))(?<_else>(?>\[else\]))|(?(<_else>)(?!))(?>\[elseif:(?&_ifbody)?\])|(?>\[if:(?&_ifbody)?\])(?:(?=.)(?&_core)|)\[endif\])+)(?<_keyword>(?>\[(?:(?:if|elseif):(?&_ifbody)?|endif|else)\])))
Formatted and tested:
(?s) # Dot-all modifier
# =====================
# Outter Scope
# ---------------
(?:
(?<Content> # (1), Non-keyword CONTENT
(?&_content)
)
| # OR,
# --------------
\[ elseif: # ELSE IF
(?<ElseIf_Body> # (2), else if body
(?&_ifbody)?
)
\]
| # OR
# --------------
(?<Else> # (3), ELSE
(?&_else)
)
| # OR
# --------------
(?<Begin> # (4), IF
\[ if:
(?<If_Body> # (5), if body
(?&_ifbody)?
)
\]
)
(?<Core> # (6), The CORE
(?&_core)
|
)
(?<End> # (7)
\[ endif \] # END IF
)
| # OR
# --------------
(?<Error> # (8), Unbalanced If, ElseIf, Else, or End
(?&_keyword)
)
)
# =====================
# Subroutines
# ---------------
(?(DEFINE)
# __ If Body ----------------------
(?<_ifbody> # (9)
(?> [^\]] )+
)
# __ Core -------------------------
(?<_core> # (10)
(?>
#
# __ Content ( non-keywords )
(?<_content> # (11)
(?>
(?! (?&_keyword) )
.
)+
)
|
#
# __ Else
# Guard: Only 1 'else'
# allowed in this core !!
(?(<_else>)
(?!)
)
(?<_else> # (12)
(?> \[ else \] )
)
|
#
# __ ElseIf
# Guard: Not Else before ElseIf
# allowed in this core !!
(?(<_else>)
(?!)
)
(?>
\[ elseif:
(?&_ifbody)?
\]
)
|
#
# IF (block start)
(?>
\[ if:
(?&_ifbody)?
\]
)
# Recurse core
(?:
(?= . )
(?&_core)
|
)
# END IF (block end)
\[ endif \]
)+
)
# __ Keyword ----------------------
(?<_keyword> # (13)
(?>
\[
(?:
(?: if | elseif )
: (?&_ifbody)?
| endif
| else
)
\]
)
)
)
Host language pseudo-code
bool bStopOnError = false;
regex RxCore("....."); // Above regex ..
bool ParseCore( string sCore, int nLevel )
{
// Locals
bool bFoundError = false;
bool bBeforeElse = true;
match _matcher;
while ( search ( core, RxCore, _matcher ) )
{
// Content
if ( _matcher["Content"].matched == true )
// Print non-keyword content
print ( _matcher["Content"].str() );
// OR, Analyze content.
// If this 'content' has error's and wish to return.
// if ( bStopOnError )
// bFoundError = true;
else
// ElseIf
if ( _matcher["ElseIf_Body"].matched == true )
{
// Check if we are not in a recursion
if ( nLevel <= 0 )
{
// Report error, this 'elseif' is outside an 'if/endif' block
// ( note - will only occur when nLevel == 0 )
print ("\n>> Error, 'elseif' not in block, body = " + _matcher["ElseIf_Body"].str() + "\n";
// If this 'else' error will stop the process.
if ( bStopOnError == true )
bFoundError = true;
}
else
{
// Here, we are inside a core recursion.
// That means we have not hit an 'else' yet
// because all elseif's precede it.
// Print 'elseif'.
print ( "ElseIf: " );
// TBD - Body regex below
// Analyze the 'elseif' body.
// This is where it's body is parsed.
// Use body parsing (operations) regex on it.
string sElIfBody = _matcher["ElseIf_Body"].str() );
// If this 'elseif' body error will stop the process.
if ( bStopOnError == true )
bFoundError = true;
}
}
// Else
if ( _matcher["Else"].matched == true )
{
// Check if we are not in a recursion
if ( nLevel <= 0 )
{
// Report error, this 'else' is outside an 'if/endif' block
// ( note - will only occur when nLevel == 0 )
print ("\n>> Error, 'else' not in block\n";
// If this 'else' error will stop the process.
if ( bStopOnError == true )
bFoundError = true;
}
else
{
// Here, we are inside a core recursion.
// That means there can only be 1 'else' within
// the relative scope of a single core.
// Print 'else'.
print ( _matcher["Else"].str() );
// Set the state of 'else'.
bBeforeElse == false;
}
}
else
// Error ( will only occur when nLevel == 0 )
if ( _matcher["Error"].matched == true )
{
// Report error
print ("\n>> Error, unbalanced " + _matcher["Error"].str() + "\n";
// // If this unbalanced 'if/endif' error will stop the process.
if ( bStopOnError == true )
bFoundError = true;
}
else
// If/EndIf block
if ( _matcher["Begin"].matched == true )
{
// Print 'If'
print ( "If:" );
// Analyze 'if body' for error and wish to return.
// TBD - Body regex below.
// Analyze the 'if' body.
// This is where it's body is parsed.
// Use body parsing (operations) regex on it.
string sIfBody = _matcher["If_Body"].str() );
// If this 'if' body error will stop the process.
if ( bStopOnError == true )
bFoundError = true;
else
{
//////////////////////////////
// Recurse a new 'core'
bool bResult = ParseCore( _matcher["Core"].str(), nLevel+1 );
//////////////////////////////
// Check recursion result. See if we should unwind.
if ( bResult == false && bStopOnError == true )
bFoundError = true;
else
// Print 'end'
print ( "EndIf" );
}
}
else
{
// Reserved placeholder, won't get here at this time.
}
// Error-Return Check
if ( bFoundError == true && bStopOnError == true )
return false;
}
// Finished this core!! Return true.
return true;
}
///////////////////////////////
// Main
string strInitial = "...";
bool bResult = ParseCore( strInitial, 0 );
if ( bResult == false )
print ( "Parse terminated abnormally, check messages!\n" );
Output sample of the outter core matches
Note there will be many more matches when the inner core's are matched.
** Grp 0 - ( pos 0 , len 211 )
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
output
[elseif:('b'+'atman'='batman')]
output2
[elseif:('b'+'atman'='batman')]
output3
[elseif:('b'+'atman'='batman')]
output4
[else]
output5
[endif]
** Grp 1 [Content] - NULL
** Grp 2 [ElseIf_Body] - NULL
** Grp 3 [Else] - NULL
** Grp 4 [Begin] - ( pos 0 , len 31 )
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
** Grp 5 [If_Body] - ( pos 4 , len 26 )
(a+b-c/d*e)|(x-y)&!(z%3=0)
** Grp 6 [Core] - ( pos 31 , len 173 )
output
[elseif:('b'+'atman'='batman')]
output2
[elseif:('b'+'atman'='batman')]
output3
[elseif:('b'+'atman'='batman')]
output4
[else]
output5
** Grp 7 [End] - ( pos 204 , len 7 )
[endif]
** Grp 8 [Error] - NULL
** Grp 9 [_ifbody] - NULL
** Grp 10 [_core] - NULL
** Grp 11 [_content] - NULL
** Grp 12 [_else] - NULL
** Grp 13 [_keyword] - NULL
-----------------------------
** Grp 0 - ( pos 211 , len 4 )
** Grp 1 [Content] - ( pos 211 , len 4 )
** Grp 2 [ElseIf_Body] - NULL
** Grp 3 [Else] - NULL
** Grp 4 [Begin] - NULL
** Grp 5 [If_Body] - NULL
** Grp 6 [Core] - NULL
** Grp 7 [End] - NULL
** Grp 8 [Error] - NULL
** Grp 9 [_ifbody] - NULL
** Grp 10 [_core] - NULL
** Grp 11 [_content] - NULL
** Grp 12 [_else] - NULL
** Grp 13 [_keyword] - NULL
-----------------------------
** Grp 0 - ( pos 215 , len 74 )
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
output6
[else]
output7
[endif]
** Grp 1 [Content] - NULL
** Grp 2 [ElseIf_Body] - NULL
** Grp 3 [Else] - NULL
** Grp 4 [Begin] - ( pos 215 , len 31 )
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
** Grp 5 [If_Body] - ( pos 219 , len 26 )
(a+b-c/d*e)|(x-y)&!(z%3=0)
** Grp 6 [Core] - ( pos 246 , len 36 )
output6
[else]
output7
** Grp 7 [End] - ( pos 282 , len 7 )
[endif]
** Grp 8 [Error] - NULL
** Grp 9 [_ifbody] - NULL
** Grp 10 [_core] - NULL
** Grp 11 [_content] - NULL
** Grp 12 [_else] - NULL
** Grp 13 [_keyword] - NULL
This is the If/ElseIf Body regex
Raw
(?|((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)([\^+\-%/*=]+)(?=\s*[^\^=<>+\-%/!&|()\[\]*\s])|\G(?!^)(?<=[\^+\-%/*=])((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)())
Stringed
'~(?|((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)([\^+\-%/*=]+)(?=\s*[^\^=<>+\-%/!&|()\[\]*\s])|\G(?!^)(?<=[\^+\-%/*=])((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)())~'
Expanded
(?| # Branch Reset
( # (1 start), Operand
(?: \s* [^\^=<>+\-%/!&|()\[\]*\s] \s* )+
) # (1 end)
( [\^+\-%/*=]+ ) # (2), Forward Operator
(?= \s* [^\^=<>+\-%/!&|()\[\]*\s] )
|
\G
(?! ^ )
(?<= [\^+\-%/*=] )
( # (1 start), Last Operand
(?: \s* [^\^=<>+\-%/!&|()\[\]*\s] \s* )+
) # (1 end)
( ) # (2), Last-Empty Forward Operator
)
Here is how this operates:
Assumes very simple constructs.
This will just parse the math operand/operator stuff.
It won't parse any enclosing parenthesis blocks, nor any logic or math
operators in between.
If needed, parse any parenthesis blocks ahead of time, i.e. \( [^)* \)
or
similar. Or split on the logic operators like |
.
The body regex uses a branch reset to get the operand/operator sequence.
It always matches two things.
Group 1 contains the operand, group 2 the operator.
If group 2 is empty, group 1 is the last operand in the sequence.
Valid operators are ^ + - % / * =
.
The equals =
is included because it separates cluster of operations
and can just be noted as a separation.
The conclusion about this body regex is that it is very simple and
only suited for simple usage. Anything more of a complexity is involved
this won't be the way to go.
Input/Output Sample 1:
(a+b-c/d*e)
** Grp 1 - ( pos 1 , len 1 )
a
** Grp 2 - ( pos 2 , len 1 )
+
------------
** Grp 1 - ( pos 3 , len 1 )
b
** Grp 2 - ( pos 4 , len 1 )
-
------------
** Grp 1 - ( pos 5 , len 1 )
c
** Grp 2 - ( pos 6 , len 1 )
/
------------
** Grp 1 - ( pos 7 , len 1 )
d
** Grp 2 - ( pos 8 , len 1 )
*
------------
** Grp 1 - ( pos 9 , len 1 )
e
** Grp 2 - ( pos 10 , len 0 ) EMPTY
Input/Output Sample 2:
('b'+'atman'='batman')
** Grp 1 - ( pos 1 , len 3 )
'b'
** Grp 2 - ( pos 4 , len 1 )
+
------------
** Grp 1 - ( pos 5 , len 7 )
'atman'
** Grp 2 - ( pos 12 , len 1 )
=
------------
** Grp 1 - ( pos 13 , len 8 )
'batman'
** Grp 2 - ( pos 21 , len 0 ) EMPTY
[\S \t]+
matches any character besides[\r\n\f]
and your[\s\S]+
won't stop cosuming if there is more than 1 block to match (greed). Please give better input/expected output samples. You can try(?s)(\[if:([^\]]+))\](.*?)\[endif\]
– Sheikdom~([^\^=<>+\-%/!&|()*]+)([\^+\-%/!|&*])([^\^=<>+\-%/!&|()*]*)~
. Where capture group 2 contains the operator, which you can test with if-then-else logic. I wish I could help further but I don't quite get what you're doing. – Ablebodiedif/elseif/else/endif
stuff, its fairly simple. – Ablebodied