PHP string console parameters to array
Asked Answered
B

7

17

I would like to know how I could transform the given string into the specified array:

String

all ("hi there \(option\)", (this, that), other) another

Result wanted (Array)

[0] => all,
[1] => Array(
    [0] => "hi there \(option\)",
    [1] => Array(
        [0] => this,
        [1] => that
    ),
    [2] => other
),
[2] => another

This is used for a kind of console that I'm making on PHP. I tried to use preg_match_all but, I don't know how I could find parentheses inside parentheses in order to "make arrays inside arrays".

EDIT

All other characters that are not specified on the example should be treated as String.

EDIT 2

I forgot to mention that all parameter's outside the parentheses should be detected by the space character.

Bussard answered 4/2, 2013 at 10:26 Comment(13)
You are trying to build a syntax tree, or parse tree. I think regex is not a proper tool for that.Nieves
Then, what should I do?Bussard
@CristianoSantos Write your own parser.Glossography
@CristianoSantos you should loop through the input string, which adds words to an array until a close bracket is visited or input finishes. But upon visiting an open bracket this method must call itself (a recursive call) and use the returned array.Nieves
why not simply split with [\s,()]+Tidwell
@PLB But how? This is the first time that I'm trying to do that and so, I have no experience with "own parsers"Bussard
@Some1.Kill.The.DJ The last time I tried something like that, my result with preg_math_all on the parenteses was this string: ("hi there", (this, that). I don't tried your sugestion yet but from what I see, I think I will get the same behavior.Bussard
@CristianoSantos If you think it globally then there will be {},[],<>,special characters etc. then what type of priority you want to use?Salesman
@ripa All other characters than the showed above will be treated as String.Bussard
@CristianoSantos yes I got it.but there is sub array. my question is on this basis.Salesman
@CristianoSantos If you're trying to build your own programming language, you can use already existing syntax generators.Glossography
@ripa All sub arrays should be detected as the primary array. In other words, all arrays should be found with "(" and ")" and, if I want to include a parenteses on one of the parameters, then I should add a slash to it on the string. Example: (hi, (first, "second \(other\)"))Bussard
@PLB Could you give me a link of that?Bussard
G
4

There's no question that you should write parser if you are building syntax tree. But if you just need to parse this sample input regex still might be a tool:

<?php
$str = 'all, ("hi there", (these, that) , other), another';

$str = preg_replace('/\, /', ',', $str); //get rid off extra spaces
/*
 * get rid off undefined constants with surrounding them with quotes
*/
$str = preg_replace('/(\w+),/', '\'$1\',', $str);
$str = preg_replace('/(\w+)\)/', '\'$1\')', $str);
$str = preg_replace('/,(\w+)/', ',\'$1\'', $str);

$str = str_replace('(', 'array(', $str);

$str = 'array('.$str.');';

echo '<pre>';
eval('$res = '.$str); //eval is evil.
print_r($res); //print the result

Demo.

Note: If input will be malformed regex will definitely fail. I am writing this solution just in a case you need fast script. Writing lexer and parser is time-consuming work, that will need lots of research.

Glossography answered 4/2, 2013 at 11:8 Comment(7)
Thanks, I really just need this to work fast. In my case, there's no problem at all if the regex fails. I really just need to throw a general error and not a specific one. =)Bussard
@CristianoSantos In that case I'd use this script and start reading more about syntax parsers for educational purposes.Glossography
It is so going to mess up this string 'all, ("hi, there, I am from SO", (these, that) , other), another'Cultivated
@Cultivated Yes, it will because of commas. I've noted that regex is a tool in a case strings will be formed as they are in sample.Glossography
@CristianoSantos Oh, there were not special characters in question when I was writing this answer. If there're you need to escape them and improve this script for better handling of malformed strings. For dirty job it's ok, for future using purposes big NO.Glossography
@PLB Ah, sorry. I really forgot to mention them on the beginning. My bad =SBussard
Your answer and @palindrom answer where the one's that helped me most. So, as I can't give both a "correct answer", I will accept yours because it helped me much more and add my final code as answer.Bussard
Y
14

The 10,000ft overview

You need to do this with a small custom parser: code takes input of this form and transforms it to the form you want.

In practice I find it useful to group parsing problems like this in one of three categories based on their complexity:

  1. Trivial: Problems that can be solved with a few loops and humane regular expressions. This category is seductive: if you are even a little unsure if the problem can be solved this way, a good rule of thumb is to decide that it cannot.
  2. Easy: Problems that require building a small parser yourself, but are still simple enough that it doesn't quite make sense to bring out the big guns. If you need to write more than ~100 lines of code then consider escalating to the next category.
  3. Involved: Problems for which it makes sense to go formal and use an already existing, proven parser generator¹.

I classify this particular problem as belonging into the second category, which means that you can approach it like this:

Writing a small parser

Defining the grammar

To do this, you must first define -- at least informally, with a few quick notes -- the grammar that you want to parse. Keep in mind that most grammars are defined recursively at some point. So let's say our grammar is:

  • The input is a sequence
  • A sequence is a series series of zero or more tokens
  • A token is either a word, a string or an array
  • Tokens are separated by one or more whitespace characters
  • A word is a sequence of alphabetic characters (a-z)
  • A string is an arbitrary sequence of characters enclosed within double quotes
  • An array is a series of one or more tokens separated by commas

You can see that we have recursion in one place: a sequence can contain arrays, and an array is also defined in terms of a sequence (so it can contain more arrays etc).

Treating the matter informally as above is easier as an introduction, but reasoning about grammars is easier if you do it formally.

Building a lexer

With the grammar in hand you know need to break the input down into tokens so that it can be processed. The component that takes user input and converts it to individual pieces defined by the grammar is called a lexer. Lexers are dumb; they are only concerned with the "outside appearance" of the input and do not attempt to check that it actually makes sense.

Here's a simple lexer I wrote to parse the above grammar (don't use this for anything important; may contain bugs):

$input = 'all ("hi there", (this, that) , other) another';

$tokens = array();
$input = trim($input);
while($input) {
    switch (substr($input, 0, 1)) {
        case '"':
            if (!preg_match('/^"([^"]*)"(.*)$/', $input, $matches)) {
                die; // TODO: error: unterminated string
            }

            $tokens[] = array('string', $matches[1]);
            $input = $matches[2];
            break;
        case '(':
            $tokens[] = array('open', null);
            $input = substr($input, 1);
            break;
        case ')':
            $tokens[] = array('close', null);
            $input = substr($input, 1);
            break;
        case ',':
            $tokens[] = array('comma', null);
            $input = substr($input, 1);
            break;
        default:
            list($word, $input) = array_pad(
                preg_split('/(?=[^a-zA-Z])/', $input, 2),
                2,
                null);
            $tokens[] = array('word', $word);
            break;
    }
    $input = trim($input);
}

print_r($tokens);

Building a parser

Having done this, the next step is to build a parser: a component that inspects the lexed input and converts it to the desired format. A parser is smart; in the process of converting the input it also makes sure that the input is well-formed by the grammar's rules.

Parsers are commonly implemented as state machines (also known as finite state machines or finite automata) and work like this:

  • The parser has a state; this is usually a number in an appropriate range, but each state is also described with a more human-friendly name.
  • There is a loop that reads reads lexed tokens one at a time. Based on the current state and the value of the token, the parser may decide to do one or more of the following:
    1. take some action that affects its output
    2. change its state to some other value
    3. decide that the input is badly formed and produce an error

¹ Parser generators are programs whose input is a formal grammar and whose output is a lexer and a parser you can "just add water" to: just extend the code to perform "take some action" depending on the type of token; everything else is already taken care of. A quick search on this subject gives led PHP Lexer and Parser Generator?

Yee answered 4/2, 2013 at 10:55 Comment(8)
respect @Jon. Can you link any nice articles on this topic?Murielmurielle
There is no need to define a language, since an extended regular expression can solve this problem. I bet in one expression. I think if the attendant has no clue of formal languages, this answer is even more misleading for him/her. P.S: Wow this solution gets even more complicated for a PHP application, we are not constructing a language here. :DAnaesthesiology
@Dyin: It depends on how you define "need". If you want a parser that is maintainable then you most definitely need a grammar. If you want a regex that works for me (tm) but is totally incomprehensible, subject to breaking down at the slightest provocation and ultimately impossible to extend in the future then you don't necessarily need a grammar. If you disagree please try to prove me wrong by writing such a regex.Yee
How is a regular expression impossible to extend? Truly a regular expression can't tell you where's the syntax error, but the answer should not to implement an LALR or SLR for this. :D Sadly I'm not an expert in extended, recursive regular expressions, but I believe, someone will implement a pattern for this, which solves the problem in 1 step, that is faster. This is a parentheses problem, why would a PHP developer write a lexical and syntactical analyzer for this?Anaesthesiology
@d.raev: I added some Wikipedia links which are good as landing page.Yee
@Dyin: I 'm not an expert in regex either, but I know enough to understand that I would never, ever want to do this with regex because a) I consider it impossible to prove that a regex works correctly in all cases (while it is certainly possible to prove that a parser correctly processes a given formal grammar) and b) it is much easier to reason about how a FSM works, so it's much easier to extend and maintain. YMMV.Yee
@Yee Your answer is really great. But in a scenario if I was asked to parse this kind of data and just save this information in database or somewhere else that would be easier to use, creating lexer would be overkill, IMO.Glossography
@PLB: Might be. But since it's not very clear if you would need to "cross the line" (either now or in the future) I 'd prefer to err on the safe side.Yee
G
4

There's no question that you should write parser if you are building syntax tree. But if you just need to parse this sample input regex still might be a tool:

<?php
$str = 'all, ("hi there", (these, that) , other), another';

$str = preg_replace('/\, /', ',', $str); //get rid off extra spaces
/*
 * get rid off undefined constants with surrounding them with quotes
*/
$str = preg_replace('/(\w+),/', '\'$1\',', $str);
$str = preg_replace('/(\w+)\)/', '\'$1\')', $str);
$str = preg_replace('/,(\w+)/', ',\'$1\'', $str);

$str = str_replace('(', 'array(', $str);

$str = 'array('.$str.');';

echo '<pre>';
eval('$res = '.$str); //eval is evil.
print_r($res); //print the result

Demo.

Note: If input will be malformed regex will definitely fail. I am writing this solution just in a case you need fast script. Writing lexer and parser is time-consuming work, that will need lots of research.

Glossography answered 4/2, 2013 at 11:8 Comment(7)
Thanks, I really just need this to work fast. In my case, there's no problem at all if the regex fails. I really just need to throw a general error and not a specific one. =)Bussard
@CristianoSantos In that case I'd use this script and start reading more about syntax parsers for educational purposes.Glossography
It is so going to mess up this string 'all, ("hi, there, I am from SO", (these, that) , other), another'Cultivated
@Cultivated Yes, it will because of commas. I've noted that regex is a tool in a case strings will be formed as they are in sample.Glossography
@CristianoSantos Oh, there were not special characters in question when I was writing this answer. If there're you need to escape them and improve this script for better handling of malformed strings. For dirty job it's ok, for future using purposes big NO.Glossography
@PLB Ah, sorry. I really forgot to mention them on the beginning. My bad =SBussard
Your answer and @palindrom answer where the one's that helped me most. So, as I can't give both a "correct answer", I will accept yours because it helped me much more and add my final code as answer.Bussard
A
3

As far as I know, the parentheses problem is a Chomsky language class 2, while regular expressions are equivalent to Chomsky language class 3, so there should be no regular expression, which solves this problem.

But I read something not long ago:

This PCRE pattern solves the parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \)

With delimiters and without spaces: /\(((?>[^()]+)|(?R))*\)/.

This is from Recursive Patterns (PCRE) - PHP manual.

There is an example on that manual, which solves nearly the same problem you specified! You, or others might find it and proceed with this idea.

I think the best solution is to write a sick recursive pattern with preg_match_all. Sadly I'm not in the power to do such madness!

Anaesthesiology answered 4/2, 2013 at 11:7 Comment(1)
Regex you see in modern languages are not strictly regular, so it can do thing beyond what theoretical regular expression can do.Cultivated
B
3

First, I want to thank everyone that helped me on this.

Unfortunately, I can't accept multiple answers because, if I could, I would give to you all because all answers are correct for different types of this problem.

In my case, I just needed something simple and dirty and, following @palindrom and @PLB answers, I've got the following working for me:

$str=transformEnd(transformStart($string));
$str = preg_replace('/([^\\\])\(/', '$1array(', $str);
$str = 'array('.$str.');';
eval('$res = '.$str);
print_r($res); //print the result

function transformStart($str){
    $match=preg_match('/(^\(|[^\\\]\()/', $str, $positions, PREG_OFFSET_CAPTURE);
    if (count($positions[0]))
        $first=($positions[0][1]+1);
    if ($first>1){
        $start=substr($str, 0,$first);
        preg_match_all("/(?:(?:\"(?:\\\\\"|[^\"])+\")|(?:'(?:\\\'|[^'])+')|(?:(?:[^\s^\,^\"^\']+)))/is",$start,$results);
        if (count($results[0])){
            $start=implode(",", $results[0]).",";
        } else {
            $start="";
        }
        $temp=substr($str, $first);
        $str=$start.$temp;
    }
    return $str;
}

function transformEnd($str){
    $match=preg_match('/(^\)|[^\\\]\))/', $str, $positions, PREG_OFFSET_CAPTURE);
    if (($total=count($positions)) && count($positions[$total-1]))
        $last=($positions[$total-1][1]+1);
    if ($last==null)
        $last=-1;
    if ($last<strlen($str)-1){
        $end=substr($str,$last+1);
        preg_match_all("/(?:(?:\"(?:\\\\\"|[^\"])+\")|(?:'(?:\\\'|[^'])+')|(?:(?:[^\s^\,^\"^\']+)))/is",$end,$results);
        if (count($results[0])){
            $end=",".implode(",", $results[0]);
        } else {
            $end="";
        }
        $temp=substr($str, 0,$last+1);
        $str=$temp.$end;
    }
    if ($last==-1){
        $str=substr($str, 1);
    }
    return $str;
}

Other answers are helpful too for who is searching a better way to do this.

Again, thank you all =D.

Bussard answered 4/2, 2013 at 12:34 Comment(0)
D
2

I want to know if this works:

  1. replace ( with Array(
  2. Use regex to put comma after words or parentheses without comma

    preg_replace( '/[^,]\s+/', ',', $string )

  3. eval( "\$result = Array( $string )" )

Dishcloth answered 4/2, 2013 at 10:32 Comment(7)
If you evaluate these and that you'll get undefined constant errors.Antimagnetic
@h2ooooooo: the resulted array has those constants too. So i guess they aren't constants in the first place.Dishcloth
But my string starts with "all". Wouldn't it give an error on eval?Bussard
This will trigger an error in all ("hi there", (this, that) , other) another because there's a missing , between ) and another...Naval
@Jeffrey regex can fix that too, i guess.Dishcloth
@Dishcloth I'm a little bad at regex... How could I replace only the space characters of words outside the arrays to commas?Bussard
This is not working when I have a string like "first, second and third" as a parameterBussard
N
2

I will put the algorithm or pseudo code for implementing this. Hopefully you can work-out how to implement it in PHP:

function Parser([receives] input:string) returns Array

define Array returnValue;

for each integer i from 0 to length of input string do
    charachter = ith character from input string.

    if character is '('
        returnValue.Add(Parser(substring of input after i)); // recursive call

    else if character is '"'
        returnValue.Add(substring of input from i to the next '"')

    else if character is whitespace
        continue

    else
        returnValue.Add(substring of input from i to the next space or end of input)

   increment i to the index actually consumed


return returnValue
Nieves answered 4/2, 2013 at 10:52 Comment(2)
So, I really need to parse the string char by char?Bussard
Well, you can probably extract words first. But you should be cautious about the quotes. I recommend doing it char by char.Nieves
B
1

if the string values are fixed, it can be done some how like this

$ar = explode('("', $st);

$ar[1] = explode('",', $ar[1]);

$ar[1][1] = explode(',', $ar[1][1]);

$ar[1][2] = explode(')',$ar[1][1][2]);

unset($ar[1][1][2]);

$ar[2] =$ar[1][2][1];

unset($ar[1][2][1]);
Battled answered 4/2, 2013 at 11:10 Comment(1)
Sorry, but they are dynamic =SBussard

© 2022 - 2024 — McMap. All rights reserved.