PHP Regex Check if two strings share two common characters
Asked Answered
C

7

12

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.

Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).

Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.

This is a regex that defines "valid" lines:

"/^[AB][CD][EF][GH]$/m" 

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.

The below example assumes the following:

  1. $line is always a valid format
  2. BigFileOfLines.txt contains only valid lines

Example:

// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
    $regex = "magic regex I'm looking for here";
    $matchingLines = array();
    preg_match_all($regex, $subject, $matchingLines);
    return $matchingLines;
}

// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);

/*
 * Desired return value (Note: this is an example set, there 
 * could be more or less than this)
 * 
 * BCEG
 * ADFG
 * BCFG
 * BDFG
*/

One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":

"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"

This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.

It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.

Thanks in advance!

Update:

It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."

I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:

  • A. Works
  • B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).

Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)

Update 2:

Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.

The reasons I favor this answer:

  1. The regular expression given provides excellent scalability for longer lines
  2. The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.

However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

Contaminant answered 22/4, 2012 at 22:16 Comment(6)
I definitely agree with the point you make in your update. Having asked regex questions before, I rarely find a "this one works" answer to be the best or most helpful. This applies to other questions too, of course.Undershrub
When you say two common characters, do they have to be in the same position? For example, do you count 'FBGA' as having two common characters with 'ACFG'? (it does, but they are in different positions).Civilization
@Civilization yes, they need to be in the same position.Contaminant
Just a follow-up comment on your accepted answer. It allows a lot of positives that I don't think the rest of your respondents assumed you wanted. In the case of ACFG, this would match FG12 which wasn't what I think the rest of us assumed, since we thought that the positioning of F and G would have to be in the 3rd and 4th positions.Glimp
@MikeRyan Yes but if you read the bolded part of the question carefully The below example assumes the following... BigFileOfLines.txt contains only valid lines and looked at his valid lines regex "/^[AB][CD][EF][GH]$/m" you would note that FG12 is not a valid line and therefore would not be included in his set of valid lines.Apposition
@MikeRyan is correct, FG12 is not an edge case I need to worry about. If it was however, then yes my accepted answer would be a poor choice.Contaminant
A
4

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?

For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

Apposition answered 22/4, 2012 at 23:38 Comment(7)
I'm not sure I understand why this would work (and I'm not able to test it right now). Could you explain what this accomplishes? To me it seems a bit too greedy, we can do better than using ".*" I would think. But then again, I'm still learning this stuff :)Contaminant
rubular.com/r/qx2vZ00U7k Is this your intended result? It matches 0 or more at first followed by one of ACFG then 0 or more then ACFG followed by 0 or more. I assume your biglineoftext contains all valid strings of 4 characters each?Apposition
However, this will also match the line 'AAAA' which only shares one letter in common with 'ACFG'.Civilization
AAAA does not exist. BigFileOfLines.txt contains only valid lines is what he said in his question.Apposition
This is excellent. Scales nicely (I think). If I wanted to scale my file to have only valid lines of "m" length, and match "n" number of characters, would I simply input the <code>$line</code> value where you currently have <code>ACFG</code>, and repeat the part <code>.*[ACFG].*</code> "m" number of times? (have no fear I'll mark one of these answers as correct, just spending time to read all of them first)Contaminant
So the regex is to your liking then? I believe that repeating .*[ACFG].* should produce the desired effect for longer lines. You'd have to test it or give me some examples of longer valid lines to check.Apposition
Works great. Tested on 10,000 randomly generated files with 1000 lines, this solution had an average runtime that was on average .01 ms slower or faster than the other solutions provided. I favor this solution for both its clarity and its simplicity. Thanks @AppositionContaminant
M
2

This is a regex that defines "valid" lines:

/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.

Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:

/^[AB][CD][EF][GH]$/

or, alternatively:

/^(A|B)(C|D)(E|F)(G|H)$/

That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:

/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/

But the character-class version is by far the usual way of writing this.

As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.

I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)

Mamie answered 22/4, 2012 at 23:34 Comment(4)
I appreciate your clarification, can't believe I overlooked the issue with the "valid lines" regex.Contaminant
...hit enter too fast First off, thank you for the correction, you are 100% right (I'll fix it). Secondly, I already stated that despite being able to accomplish this easier without regex, I need to use regex. Thirdly, the solution you posted in your last paragraph is identical to the one I provided (except that you missed a "." infront of "CE"). The crux of my question is how to accomplish this in a more simple way that scales nicely. If you can prove to me that this is the best way to match the lines with regex with scalability in mind as mentioned in my question, I'll mark it as correct.Contaminant
I don't see a missing '.' in front of CE? There's one ., and C is the second character? I can't prove that this is the best way of doing it with regex, but neither can I think of any other way of doing it with regex. But that could be simply because I still don't understand why you would ever want to use regex for this problem. :)Mamie
@Mark_Reed, Ah, yes you are correct, only one "." is needed. Academic curiosity is the reason I'm using Regex. Which is a god awful reason I know. It is related to a school assignment where use of Regex is required. The question I posted is a far cry from the actual problem I'm solving, so I'm not just getting other people to do my homework :)Contaminant
S
1

You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:

function findMatchingLines($line) {
    static $file = null;
    if( !$file) $file = file("BigFileOfLines.txt");

    $search = str_split($line);
    foreach($file as $l) {
        $test = str_split($l);
        $matches = count(array_intersect($search,$test));
        if( $matches > 2) // define number of matches required here - optionally make it an argument
            return true;
    }
    // no matches
    return false;
}
Spotless answered 22/4, 2012 at 23:39 Comment(4)
You're right, no point flogging a dead horse haha :) However, the reason I'm trying to do this with regex is more academic than application. Were I getting paid to write this as good code, I would certainly use your approach. (Read: This is related to a school assignment (but only related, this isn't close to a direct homework problem or anything))Contaminant
I really don't think implementing this with a regex is a good idea. It's doable, certainly, but it's going to be a nightmare...Spotless
You are 100% correct, implementing with regex is a bad idea. The reason I'm doing it is because it's part of the spec for my homework assignment (the question I posted doesn't come close to the real homework problem, so no worries, I'm not getting other people to do my homework...)Contaminant
Nightmare indeed, but the homework spec is the homework spec, and it's not a class that really allows for much flexibility to do things the "right" way. Frustrating I know, but addressing nightmares helps me learn more about regex, and more importantly know when and why NOT to use regex.Contaminant
G
1

People may be confused by your first regex. You give:

"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m" 

And then say:

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

But that's not what that regex means at all.

This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.

This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)

To get your English description as a regex, you would want:

/^[AB][CD][EF][GH]$/

The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.

--

EDIT:

You want to test for only two of these four characters matching.

Very Strictly speaking, and picking up from @Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:

/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/

as compared to:

/^(AC|A.E|A..G|.CE|.C.G|..EG)$/ 

This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.

This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.

(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)

--

EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.

Glimp answered 22/4, 2012 at 23:40 Comment(3)
Good notes on "fastest" I do mean fastest to fail. When speaking of regex, is it usually assumed that "fastest" means "fastest to find all matches"? As in, I need to find all the matches in subject as fast as possible? (I've been assuming that's true, but I would love clarification on the terminology)Contaminant
Also good notes on improving the efficiency of the algorithm I currently use for finding matches.Contaminant
Fastest to fail is usually what people want. Oh, and for most of these solutions, if you want to get absolute fastest, you should use non-grouping parentheses. I didn't put those in, since it would make it substantially more unreadable, and they're usually the last step to go in.Glimp
S
1

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:

$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match

$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
    // error checking here if necessary - $line and $match must be same length
    return (levenshtein($line, $match) <= (strlen($line) - $common));
});

var_dump($matchingLines);
Swiercz answered 22/4, 2012 at 23:42 Comment(2)
Again, looking for a regex. Although this is cool and clever, it's not what the question is asking for.Contaminant
And thank you for the edit! Was just about to fix that haha :) btw, don't mean to be brash, I really do appreciate the clever solution, it just doesn't answer my question.Contaminant
C
1

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).

So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:

"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"

This, of course, is the conclusion you're already at--so good so far.

The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.

function findMatchingLines($line, $subject) {
  $regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
      + "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
  $matchingLines = array();
  preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
  return $matchingLines;
}

What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.

If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

Clip answered 22/4, 2012 at 23:51 Comment(1)
This is great. I think this may be the answer, but I'm still reading through everything everyone else posted. The one minor aspect of my question that this does not address (and I'm curious if you would have input) is the issue of scalability. Lets say I need a function that will scale to match lines in a different file where each line is 16 characters long, and/or instead of matching just two of the characters, I need to match 5. Would you suggest using loops to simply expand/extend/create the regex based on a given line of length "n" and a need to match "m" number of characters?Contaminant
S
1

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:

/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m

It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.

Can be generated using:

function getRegexMatchingNCharactersOfLine($line, $num) {
    return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}
Stanstance answered 23/4, 2012 at 17:47 Comment(1)
This is pretty damn clever actually. The answer I selected is going to remain the answer for previously stated reasons, but this is really helpful for me to be learning regex. Amazing how many different ways the same thing can be accomplished. Is this solution faster than my currently selected answer? (and if so, why?)Contaminant

© 2022 - 2024 — McMap. All rights reserved.