Regex - Ignore some parts of string in match
Asked Answered
S

3

10

Here's my string:

address='St Marks Church',notes='The North East\'s premier...'

The regex I'm using to grab the various parts using match_all is

'/(address|notes)='(.+?)'/i'

The results are:

address => St Marks Church
notes => The North East\

How can I get it to ignore the \' character for the notes?

Snowclad answered 6/6, 2013 at 19:44 Comment(2)
Would you want to only consider alphanumeric characters in your expression?Archaean
No basically anything between ' and the second ' excluding \'. I'm a bit of a regex newbie I'm afraid so probably got the first bit wrong too?Snowclad
S
2

Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.

The following patterns will match the substrings required to buildyour desired associative array:

Patterns that generate a fullstring match and 1 capture group:

  1. /(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
  2. /(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)

Patterns that generate 2 capture groups:

  1. /(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
  2. /(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)

Code: (Demo)

$string = "address='St Marks Church',notes='The North East\'s premier...'";

preg_match_all(
    "/(address|notes)='\K(?:\\\'|[^'])*/",
    $string,
    $out
);
var_export(array_combine($out[1], $out[0]));

echo "\n---\n";

preg_match_all(
    "/(address|notes)='((?:\\\'|[^'])*)/",
    $string,
    $out,
    PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));

Output:

array (
  'address' => 'St Marks Church',
  'notes' => 'The North East\\\'s premier...',
)
---
array (
  'address' => 'St Marks Church',
  'notes' => 'The North East\\\'s premier...',
)

Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.

Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.

Some notes:

  • Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.

  • Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.

Schweitzer answered 18/11, 2017 at 12:59 Comment(5)
@PaulPhillips over 4 years later, you may no longer be a newbie at regex. Please review all of the answers on this page. Sadly the other answers on this page are inaccurate/incorrect and have gathered upvotes over time (which means they have been misinforming readers for years). If you have any questions about my answer or why the other answers are not correct, I will be happy to explain.Schweitzer
Hey Mick you trolling everybody's past answers or just mine?Evidently
I happened upon this page while researching for another question on another StackExchange site. There is nothing trollish about my conduct. If I wanted to be a troll, I would call you names or more simply not leave a comment. No, what I have done is identified a page that contained 3 incorrect answers (now 2 after anubhava deleted his), justifiably downvoted bad answers that misinform, left explanatory comments (with demo links), edited the question, and provided a comprehensive and thoughtful answer. What I have done should only be consider "content improvement".Schweitzer
I'm guessing it used to work (though I'm not sure how) otherwise people just glanced and thought it worked, though it was marked as the answer, so it likely helped the OP to figure out their issue. Whatever.Evidently
It never worked as intended. The OP blindly trusted the answers. The snowball grew as the blind trusted the blind for years.Schweitzer
E
5

Not sure if you're wrapping your string with heredoc or double quotes, but a less greedy approach:

$str4 = 'address="St Marks Church",notes="The North East\'s premier..."';
preg_match_all('~(address|notes)="([^"]*)"~i',$str4,$matches);
print_r($matches);

Output

Array
(
    [0] => Array
        (
            [0] => address="St Marks Church"
            [1] => notes="The North East's premier..."
        )

    [1] => Array
        (
            [0] => address
            [1] => notes
        )

    [2] => Array
        (
            [0] => St Marks Church
            [1] => The North East's premier...
        )

)

Another method with preg_split:

//split the string at the comma
//assumes no commas in text
$parts = preg_split('!,!', $string);
foreach($parts as $key=>$value){
    //split the values at the = sign
    $parts[$key]=preg_split('!=!',$value);
    foreach($parts[$key] as $k2=>$v2){
        //trim the quotes out and remove the slashes
        $parts[$key][$k2]=stripslashes(trim($v2,"'"));
    }
}

Output looks like:

Array
(
    [0] => Array
        (
            [0] => address
            [1] => St Marks Church
        )

    [1] => Array
        (
            [0] => notes
            [1] => The North East's premier...
        )

)

Super slow old-skool method:

$len = strlen($string);
$key = "";
$value = "";
$store = array();
$pos = 0;
$mode = 'key';
while($pos < $len){
  switch($string[$pos]){
    case $string[$pos]==='=':
        $mode = 'value';
        break;
    case $string[$pos]===",":
        $store[$key]=trim($value,"'");
        $key=$value='';
        $mode = 'key';
        break;
    default:
        $$mode .= $string[$pos];
  }

  $pos++;
}
        $store[$key]=trim($value,"'");
Ease answered 6/6, 2013 at 19:58 Comment(3)
Your first method adjusts the input string to suit the method, this method should be removed. The second uses preg_split () where explode() is the sensible function call. Furthermore, if \' is possible in the string, then it is fair to assume , and = are possible as well. The third one, I didn't test yet but it either has a typo or is employing variable variables which should be avoided whenever possible.Schweitzer
I removed my downvote because I appreciate that you are trying to fix your answer. Sadly, I feel I had to re-downvote because this answer is suggesting poor and/or unreliable methods.Schweitzer
Making concessions for bad data storage methods is never advisable. This text stream should be stored in JSON, XML, or even CSV and processed with industry standard methods ideally. Appreciate your opinion though.Evidently
S
2

Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.

The following patterns will match the substrings required to buildyour desired associative array:

Patterns that generate a fullstring match and 1 capture group:

  1. /(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
  2. /(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)

Patterns that generate 2 capture groups:

  1. /(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
  2. /(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)

Code: (Demo)

$string = "address='St Marks Church',notes='The North East\'s premier...'";

preg_match_all(
    "/(address|notes)='\K(?:\\\'|[^'])*/",
    $string,
    $out
);
var_export(array_combine($out[1], $out[0]));

echo "\n---\n";

preg_match_all(
    "/(address|notes)='((?:\\\'|[^'])*)/",
    $string,
    $out,
    PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));

Output:

array (
  'address' => 'St Marks Church',
  'notes' => 'The North East\\\'s premier...',
)
---
array (
  'address' => 'St Marks Church',
  'notes' => 'The North East\\\'s premier...',
)

Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.

Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.

Some notes:

  • Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.

  • Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.

Schweitzer answered 18/11, 2017 at 12:59 Comment(5)
@PaulPhillips over 4 years later, you may no longer be a newbie at regex. Please review all of the answers on this page. Sadly the other answers on this page are inaccurate/incorrect and have gathered upvotes over time (which means they have been misinforming readers for years). If you have any questions about my answer or why the other answers are not correct, I will be happy to explain.Schweitzer
Hey Mick you trolling everybody's past answers or just mine?Evidently
I happened upon this page while researching for another question on another StackExchange site. There is nothing trollish about my conduct. If I wanted to be a troll, I would call you names or more simply not leave a comment. No, what I have done is identified a page that contained 3 incorrect answers (now 2 after anubhava deleted his), justifiably downvoted bad answers that misinform, left explanatory comments (with demo links), edited the question, and provided a comprehensive and thoughtful answer. What I have done should only be consider "content improvement".Schweitzer
I'm guessing it used to work (though I'm not sure how) otherwise people just glanced and thought it worked, though it was marked as the answer, so it likely helped the OP to figure out their issue. Whatever.Evidently
It never worked as intended. The OP blindly trusted the answers. The snowball grew as the blind trusted the blind for years.Schweitzer
M
1

You should match up to an end quote that isn't preceded by a backslash thus:

(address|notes)='(.*?)[^\\]'

This [^\\] forces the character immediately preceding the ' character to be anything but a backslash.

Marrin answered 6/6, 2013 at 19:49 Comment(2)
Will that work if input is: "address='.',notes='The North East\'s premier...'" ?Bondie
As @Bondie alluded to, this answer is incorrect and will mangle the expected return values. regex101.com/r/90fBSr/1 (downvoted as misleading)Schweitzer

© 2022 - 2024 — McMap. All rights reserved.