Regex Pattern to Match, Excluding when... / Except between
Asked Answered
A

6

118

--Edit-- The current answers have some useful ideas but I want something more complete that I can 100% understand and reuse; that's why I set a bounty. Also ideas that work everywhere are better for me than not standard syntax like \K

This question is about how I can match a pattern except some situations s1 s2 s3. I give a specific example to show my meaning but prefer a general answer I can 100% understand so I can reuse it in other situations.

Example

I want to match five digits using \b\d{5}\b but not in three situations s1 s2 s3:

s1: Not on a line that ends with a period like this sentence.

s2: Not anywhere inside parens.

s3: Not inside a block that starts with if( and ends with //endif

I know how to solve any one of s1 s2 s3 with a lookahead and lookbehind, especially in C# lookbehind or \K in PHP.

For instance

s1 (?m)(?!\d+.*?\.$)\d+

s3 with C# lookbehind (?<!if\(\D*(?=\d+.*?//endif))\b\d+\b

s3 with PHP \K (?:(?:if\(.*?//endif)\D*)*\K\d+

But the mix of conditions together makes my head explode. Even more bad news is that I may need to add other conditions s4 s5 at another time.

The good news is, I don't care if I process the files using most common languages like PHP, C#, Python or my neighbor's washing machine. :) I'm pretty much a beginner in Python & Java but interested to learn if it has a solution.

So I came here to see if someone think of a flexible recipe.

Hints are okay: you don't need to give me full code. :)

Thank you.

Almshouse answered 11/5, 2014 at 5:12 Comment(10)
\K is no special php syntax. Please elaborate and clarify what you want to say. If you aim for telling us that you don't need a "complicated" solution you have to say what is complicated for you and why.Nephridium
@Nephridium You mean because ruby now using it and it started in perl?Almshouse
No, because it's PCRE that is not PHP (nor Ruby). Perl is different however PCRE aims to be Perl Regex compatible.Nephridium
Your s2 and s3 requirements appear to be contradictory. s2 implies that parentheses are always matched and may be nested, but s3 requires that the: "if(" open paren be closed, not with a ")", but rather with a: "//endif"? And if for s3 you really meant that the if clause should be closed with: "//endif)", then the s3 requirement is a subset of s2.Llama
@Nephridium Yes I know PCRE but to explain, question is about programming language... it says especially in C# lookbehind or \K in PHP... But C# lookbehind not just C# it's .NET so you can complain too I say C# not .NET :) And in reply I say Ruby not Onigurama that's bad too... Is there another language that use PCRE? Not talking about Notepad++ or server tools this is question about using feature in language I hope the explain and sorry if it looks wrongAlmshouse
no not at all. I thought you had a concrete problem you plan to solve just to get it done, it just wasn't clear to me you're looking for a programming solution to the problem.Nephridium
@Llama Interesting sorry let me explain. The s2 is something like (ab 123 cd) the s3 can be if(true) {hello 111 } //endif You can not guess this it is because of comment in code Anyway example not so important Thank You :)Almshouse
@Nephridium you are 100% right! for instance shell command on LAMP can answer question and use \K I changed from special php syntax to not standard syntax Thank You!Almshouse
A correct answer to this problem is highly dependent upon the target text you are matching against. e.g. If it is C code, then you'll need to fully parse the text to filter out "quoted strings" and /* multi-lined comments */ which may themselves contain the tokens you are searching for; For example: should any of the five-digit numbers be matched from the following string: '/* if( */ 12345 ") { 67890; }" '? You need to clearly define the type of file the regex will be applied to, because a solution for one type of file, (say, JavaScript), will likely fail for another (say, SQL).Llama
This is not rocket science. Anytime you have a mix of complex structures that you want to move the match position past, especially when lookbehinds are variable in length, you have to match the bad conditions. Otherwise, the match position will never advance, there is no other way. Basically, its a simple (?:s1|s2|s3)|(\b\d{5}\b) just a check for group 1 on each match. My question is why @Unihedro opened a new bounty on this?Andee
D
226

Hans, I'll take the bait and flesh out my earlier answer. You said you want "something more complete" so I hope you won't mind the long answer—just trying to please. Let's start with some background.

First off, this is an excellent question. There are often questions about matching certain patterns except in certain contexts (for instance, within a code block or inside parentheses). These questions often give rise to fairly awkward solutions. So your question about multiple contexts is a special challenge.

Surprise

Surprisingly, there is at least one efficient solution that is general, easy to implement and a pleasure to maintain. It works with all regex flavors that allow you to inspect capture groups in your code. And it happens to answer a number of common questions that may at first sound different from yours: "match everything except Donuts", "replace all but...", "match all words except those on my mom's black list", "ignore tags", "match temperature unless italicized"...

Sadly, the technique is not well known: I estimate that in twenty SO questions that could use it, only one has one answer that mentions it—which means maybe one in fifty or sixty answers. See my exchange with Kobi in the comments. The technique is described in some depth in this article which calls it (optimistically) the "best regex trick ever". Without going into as much detail, I'll try to give you a firm grasp of how the technique works. For more detail and code samples in various languages I encourage you to consult that resource.

A Better-Known Variation

There is a variation using syntax specific to Perl and PHP that accomplishes the same. You'll see it on SO in the hands of regex masters such as CasimiretHippolyte and HamZa. I'll tell you more about this below, but my focus here is on the general solution that works with all regex flavors (as long as you can inspect capture groups in your code).

Thanks for all the background, zx81... But what's the recipe?

Key Fact

The method returns the match in Group 1 capture. It does not care at all about the overall match.

In fact, the trick is to match the various contexts we don't want (chaining these contexts using the | OR / alternation) so as to "neutralize them". After matching all the unwanted contexts, the final part of the alternation matches what we do want and captures it to Group 1.

The general recipe is

Not_this_context|Not_this_either|StayAway|(WhatYouWant)

This will match Not_this_context, but in a sense that match goes into a garbage bin, because we won't look at the overall matches: we only look at Group 1 captures.

In your case, with your digits and your three contexts to ignore, we can do:

s1|s2|s3|(\b\d+\b)

Note that because we actually match s1, s2 and s3 instead of trying to avoid them with lookarounds, the individual expressions for s1, s2 and s3 can remain clear as day. (They are the subexpressions on each side of a | )

The whole expression can be written like this:

(?m)^.*\.$|\([^\)]*\)|if\(.*?//endif|(\b\d+\b)

See this demo (but focus on the capture groups in the lower right pane.)

If you mentally try to split this regex at each | delimiter, it is actually only a series of four very simple expressions.

For flavors that support free-spacing, this reads particularly well.

(?mx)
      ### s1: Match line that ends with a period ###
^.*\.$  
|     ### OR s2: Match anything between parentheses ###
\([^\)]*\)  
|     ### OR s3: Match any if(...//endif block ###
if\(.*?//endif  
|     ### OR capture digits to Group 1 ###
(\b\d+\b)

This is exceptionally easy to read and maintain.

Extending the regex

When you want to ignore more situations s4 and s5, you add them in more alternations on the left:

s4|s5|s1|s2|s3|(\b\d+\b)

How does this work?

The contexts you don't want are added to a list of alternations on the left: they will match, but these overall matches are never examined, so matching them is a way to put them in a "garbage bin".

The content you do want, however, is captured to Group 1. You then have to check programmatically that Group 1 is set and not empty. This is a trivial programming task (and we'll later talk about how it's done), especially considering that it leaves you with a simple regex that you can understand at a glance and revise or extend as required.

I'm not always a fan of visualizations, but this one does a good job of showing how simple the method is. Each "line" corresponds to a potential match, but only the bottom line is captured into Group 1.

Regular expression visualization

Debuggex Demo

Perl/PCRE Variation

In contrast to the general solution above, there exists a variation for Perl and PCRE that is often seen on SO, at least in the hands of regex Gods such as @CasimiretHippolyte and @HamZa. It is:

(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant

In your case:

(?m)(?:^.*\.$|\([^()]*\)|if\(.*?//endif)(*SKIP)(*F)|\b\d+\b

This variation is a bit easier to use because the content matched in contexts s1, s2 and s3 is simply skipped, so you don't need to inspect Group 1 captures (notice the parentheses are gone). The matches only contain whatYouWant

Note that (*F), (*FAIL) and (?!) are all the same thing. If you wanted to be more obscure, you could use (*SKIP)(?!)

demo for this version

Applications

Here are some common problems that this technique can often easily solve. You'll notice that the word choice can make some of these problems sound different while in fact they are virtually identical.

  1. How can I match foo except anywhere in a tag like <a stuff...>...</a>?
  2. How can I match foo except in an <i> tag or a javascript snippet (more conditions)?
  3. How can I match all words that are not on this black list?
  4. How can I ignore anything inside a SUB... END SUB block?
  5. How can I match everything except... s1 s2 s3?

How to Program the Group 1 Captures

You didn't as for code, but, for completion... The code to inspect Group 1 will obviously depend on your language of choice. At any rate it shouldn't add more than a couple of lines to the code you would use to inspect matches.

If in doubt, I recommend you look at the code samples section of the article mentioned earlier, which presents code for quite a few languages.

Alternatives

Depending on the complexity of the question, and on the regex engine used, there are several alternatives. Here are the two that can apply to most situations, including multiple conditions. In my view, neither is nearly as attractive as the s1|s2|s3|(whatYouWant) recipe, if only because clarity always wins out.

1. Replace then Match.

A good solution that sounds hacky but works well in many environments is to work in two steps. A first regex neutralizes the context you want to ignore by replacing potentially conflicting strings. If you only want to match, then you can replace with an empty string, then run your match in the second step. If you want to replace, you can first replace the strings to be ignored with something distinctive, for instance surrounding your digits with a fixed-width chain of @@@. After this replacement, you are free to replace what you really wanted, then you'll have to revert your distinctive @@@ strings.

2. Lookarounds.

Your original post showed that you understand how to exclude a single condition using lookarounds. You said that C# is great for this, and you are right, but it is not the only option. The .NET regex flavors found in C#, VB.NET and Visual C++ for example, as well as the still-experimental regex module to replace re in Python, are the only two engines I know that support infinite-width lookbehind. With these tools, one condition in one lookbehind can take care of looking not only behind but also at the match and beyond the match, avoiding the need to coordinate with a lookahead. More conditions? More lookarounds.

Recycling the regex you had for s3 in C#, the whole pattern would look like this.

(?!.*\.)(?<!\([^()]*(?=\d+[^)]*\)))(?<!if\(\D*(?=\d+.*?//endif))\b\d+\b

But by now you know I'm not recommending this, right?

Deletions

@HamZa and @Jerry have suggested I mention an additional trick for cases when you seek to just delete WhatYouWant. You remember that the recipe to match WhatYouWant (capturing it into Group 1) was s1|s2|s3|(WhatYouWant), right? To delete all instance of WhatYouWant, you change the regex to

(s1|s2|s3)|WhatYouWant

For the replacement string, you use $1. What happens here is that for each instance of s1|s2|s3 that is matched, the replacement $1 replaces that instance with itself (referenced by $1). On the other hand, when WhatYouWant is matched, it is replaced by an empty group and nothing else — and therefore deleted. See this demo, thank you @HamZa and @Jerry for suggesting this wonderful addition.

Replacements

This brings us to replacements, on which I'll touch briefly.

  1. When replacing with nothing, see the "Deletions" trick above.
  2. When replacing, if using Perl or PCRE, use the (*SKIP)(*F) variation mentioned above to match exactly what you want, and do a straight replacement.
  3. In other flavors, within the replacement function call, inspect the match using a callback or lambda, and replace if Group 1 is set. If you need help with this, the article already referenced will give you code in various languages.

Have fun!

No, wait, there's more!

Ah, nah, I'll save that for my memoirs in twenty volumes, to be released next Spring.

Deserve answered 11/5, 2014 at 5:16 Comment(15)
I let you know I set bounty on this question. In case you want to explain more as not 100% clear yet.Almshouse
If I may - the answer is long and detailed, which is great. The long buildup before the Key Fact title is a little too much - does it really add to the answer? The trick is pretty basic, like you said, and has been used in many occasions by many people (just a coupe of my answers for example: 1, 2, and many other cases like skipping quotes texts or escaped characters). Again, great answer.Boldfaced
@Boldfaced Two-part reply. Yes, got carried away writing last night and wrote at the bottom that I'd sleep on it and tidy up later. :) Yes the trick is simple but I don't share your perception that it is "basic" because it doesn't seem to be part of the common tools people use to solve exclusion problems. When I googled for "except" or "unless" or "not inside" problems on SO, only one answer (with no votes) suggested it, none of the others did. I hadn't seen your answers, by the way, which are terrific. :)Deserve
@Boldfaced Part 2. For instance (searching again) 1,2, 3, 4, 5, 6, 7, 8, 9, 10 (there's going to have to be a Part 3)Deserve
@Boldfaced Part 3 11, 12, 13, 14, 15, 16, 17Deserve
This is kind of like DeMorgan's law in that "(I don't want this) and (I don't want that)" is the same as "I don't want (this or that)".Hollister
Learn something new every day (if you look for it). Amazing that I've been using regex (as a not infrequent tool, rather than a daily requirement) for 15 years and never run across this one stated quite so simply. The one in the linked article about using a regex to test for primes is as horribly clever as it is inefficient, but this one is great. I've used poorer variations of it to solve more strictly defined scenarios, graduating to the general case here is awesome. I'd vote you up twice if I could :)Carraway
Sorry, but Rex's "Best trick" simply does not work (reliably). Say you want to match Tarzan, but not when anywhere inside double quotes. The: /no|no|(yes)/ trick regex would be something like: /"[^"]*"|Tarzan/ (ignoring escaped chars). This will work for many cases, but fails completely when applied to the following valid JavaScript text: var bug1 = 'One " quote here. Should match this Tarzan'; var bug2 = "Should not match this Tarzan";. Rex's trick only works when ALL possible structures are matched - in other words - you need to fully parse the text to guarantee 100% accuracy.Llama
Note that a similar bug manifestation appears in the popular Syntax Highlighter script. I wrote an article about that here: Fixing the SyntaxHighlighter 3.0.83 Parser Bug if anyone is interested.Llama
@Llama To me, your example is saying that when input is unreliable, matching is unreliable (your example has an unbalanced quote)... but is that a surprise? It's like saying "Wait, this car doesn't behave like in the ad—when I dump a ton of nails on the road it doesn't steer properly." IMO the technique does exactly what it says it does. Edge cases are edge cases, and when we worry about them we try to think of ways to work around them. To my mind, the word bug is way out of place, as well as your phrase simply does not work (reliably). The bug is in the carrot, not the blender.Deserve
@Llama Completing this comment, is any regex ever sold with this promise: This will deal with any screwed up input? You know this, so to me your words sound needlessly harsh.Deserve
Sorry if I sounded harsh - that was certainly not my intent. My point (as in my second comment to the original question above) is that a correct solution is highly dependent upon the target text being searched. My example has JavaScript source code as the target text which has one double quote enclosed within a single quoted string. It could have just as easily been a literal RegExp such as: var bug1 = /"[^"]*"|(Tarzan)/gi; and had the same effect (and this second example is certainly not an edge case). There are many more examples I could cite where this technique fails to work reliably.Llama
@Llama I always enjoy hearing from you, it just sounds unjustifiably harsh to me. When we know that our strings can contain "false alerts", we all adjust our patterns. For instance, to match a string that may contain escaped quotes that might throw a string matcher off, you might use (?<!\\)"(?:\\"|[^"\r\n])*+" You don't pull the big guns unless you have a reason. The principle of the solution is still valid. If we're not able to express a pattern to put on the left side, that's a different story, we need a different solution. But the solution does what it advertises.Deserve
This answer has been added to the Stack Overflow Regular Expressions FAQ by user @funkwurm.Kanara
Because this is matching but not capturing undesirables, it only works when your regex is returning the captures. It does not work in the search field of a text editor.Eponymous
T
11

Do three different matches and handle the combination of the three situations using in-program conditional logic. You don't need to handle everything in one giant regex.

EDIT: let me expand a bit because the question just became more interesting :-)

The general idea you are trying to capture here is to match against a certain regex pattern, but not when there are certain other (could be any number) patterns present in the test string. Fortunately, you can take advantage of your programming language: keep the regexes simple and just use a compound conditional. A best practice would be to capture this idea in a reusable component, so let's create a class and a method that implement it:

using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

public class MatcherWithExceptions {
  private string m_searchStr;
  private Regex m_searchRegex;
  private IEnumerable<Regex> m_exceptionRegexes;

  public string SearchString {
    get { return m_searchStr; }
    set {
      m_searchStr = value;
      m_searchRegex = new Regex(value);
    }
  }

  public string[] ExceptionStrings {
    set { m_exceptionRegexes = from es in value select new Regex(es); }
  }

  public bool IsMatch(string testStr) {
    return (
      m_searchRegex.IsMatch(testStr)
      && !m_exceptionRegexes.Any(er => er.IsMatch(testStr))
    );
  }
}

public class App {
  public static void Main() {
    var mwe = new MatcherWithExceptions();

    // Set up the matcher object.
    mwe.SearchString = @"\b\d{5}\b";
    mwe.ExceptionStrings = new string[] {
      @"\.$"
    , @"\(.*" + mwe.SearchString + @".*\)"
    , @"if\(.*" + mwe.SearchString + @".*//endif"
    };

    var testStrs = new string[] {
      "1." // False
    , "11111." // False
    , "(11111)" // False
    , "if(11111//endif" // False
    , "if(11111" // True
    , "11111" // True
    };

    // Perform the tests.
    foreach (var ts in testStrs) {
      System.Console.WriteLine(mwe.IsMatch(ts));
    }
  }
}

So above, we set up the search string (the five digits), multiple exception strings (your s1, s2 and s3), and then try to match against several test strings. The printed results should be as shown in the comments next to each test string.

Tawny answered 11/5, 2014 at 5:20 Comment(2)
You mean maybe like match three regex in a row? Regex 1 eliminate situation 1 (maybe just delete bad digit), r2 remove s2, r3 remove s3 and matches digits left? That's interesting idea.Almshouse
Ha, sure, that's why I upvoted you. :) Don't get me wrong, I still think that in this particular case my answer is more efficient and maintainable. Have you seen the free-spacing version I added yesterday? That's one-pass and exceptionally easy to read and maintain. But I do like your work and your expanded answer. Sorry I can't upvote again, otherwise I would. :)Deserve
C
2

Your requirement that it's not inside parens in impossible to satify for all cases. Namely, if you can somehow find a ( to the left and ) to the right, it doesn't always mean you are inside parens. Eg.

(....) + 55555 + (.....) - not inside parens yet there are ( and ) to left and right

Now you might think yourself clever and look for ( to the left only if you don't encounter ) before and vice versa to the right. This won't work for this case:

((.....) + 55555 + (.....)) - inside parens even though there are closing ) and ( to left and to right.

It is impossible to find out if you are inside parens using regex, as regex can't count how many parens have been opened and how many closed.

Consider this easier task: using regex, find out if all (possibly nested) parens in a string are closed, that is for every ( you need to find ). You will find out that it's impossible to solve and if you can't solve that with regex then you can't figure out if a word is inside parens for all cases, since you can't figure out at a some position in string if all preceeding ( have a corresponding ).

Concertante answered 15/5, 2014 at 13:17 Comment(4)
Nobody said anything about nested parenthesis, and your case #1 is handled just fine by zx81's answer.Semela
Thank you for nice thoughts :) but nested parenthesis does not worry me for this question it's more about the idea of bad situations s1 s2 s3Almshouse
Of course it isn't impossible! This is exactly why you'd need to track of the level of parens in which you are currently parsing.Pastoralist
Well if you're parsing some kind of CFG like OP seems to be doing, you are better served by generating a LALR or similar parser which doesn't have problems with this.Concertante
C
2

Hans if you don't mind I used your neighbor's washing machine called perl :)

Edited: Below a pseudo code:

  loop through input
  if line contains 'if(' set skip=true
        if skip= true do nothing
        else
           if line match '\b\d{5}\b' set s0=true
           if line does not match s1 condition  set s1=true
           if line does not match s2 condition  set s2=true
           if s0,s1,s2 are true print line 
  if line contains '//endif' set skip=false

Given the file input.txt:

tiago@dell:~$ cat input.txt 
this is a text
it should match 12345
if(
it should not match 12345
//endif 
it should match 12345
it should not match 12345.
it should not match ( blabla 12345  blablabla )
it should not match ( 12345 )
it should match 12345

And the script validator.pl:

tiago@dell:~$ cat validator.pl 
#! /usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

sub validate_s0 {
    my $line = $_[0];
    if ( $line =~ \d{5/ ){
        return "true";
    }
    return "false";
}

sub validate_s1 {
    my $line = $_[0];
    if ( $line =~ /\.$/ ){
        return "false";
    }
    return "true";
}

sub validate_s2 {
    my $line = $_[0];
    if ( $line =~ /.*?\(.*\d{5.*?\).*/ ){
        return "false";
    }
    return "true";
}

my $skip = "false";
while (<>){
    my $line = $_; 

    if( $line =~ /if\(/ ){
       $skip = "true";  
    }

    if ( $skip eq "false" ) {
        my $s0_status = validate_s0 "$line"; 
        my $s1_status = validate_s1 "$line";
        my $s2_status = validate_s2 "$line";

        if ( $s0_status eq "true"){
            if ( $s1_status eq "true"){
                if ( $s2_status eq "true"){
                    print "$line";
                }
            }
        }
    } 

    if ( $line =~ /\/\/endif/) {
        $skip="false";
    }
}

Execution:

tiago@dell:~$ cat input.txt | perl validator.pl 
it should match 12345
it should match 12345
it should match 12345
Castellano answered 16/5, 2014 at 0:32 Comment(0)
A
2

Not sure if this would help you or not, but I am providing a solution considering the following assumptions -

  1. You need an elegant solution to check all the conditions
  2. Conditions can change in future and anytime.
  3. One condition should not depend on others.

However I considered also the following -

  1. The file given has minimal errors in it. If it doe then my code might need some modifications to cope with that.
  2. I used Stack to keep track of if( blocks.

Ok here is the solution -

I used C# and with it MEF (Microsoft Extensibility Framework) to implement the configurable parsers. The idea is, use a single parser to parse and a list of configurable validator classes to validate the line and return true or false based on the validation. Then you can add or remove any validator anytime or add new ones if you like. So far I have already implemented for S1, S2 and S3 you mentioned, check classes at point 3. You have to add classes for s4, s5 if you need in future.

  1. First, Create the Interfaces -

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    
    namespace FileParserDemo.Contracts
    {
        public interface IParser
        {
            String[] GetMatchedLines(String filename);
        }
    
        public interface IPatternMatcher
        {
            Boolean IsMatched(String line, Stack<string> stack);
        }
    }
    
  2. Then comes the file reader and checker -

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using FileParserDemo.Contracts;
    using System.ComponentModel.Composition.Hosting;
    using System.ComponentModel.Composition;
    using System.IO;
    using System.Collections;
    
    namespace FileParserDemo.Parsers
    {
        public class Parser : IParser
        {
            [ImportMany]
            IEnumerable<Lazy<IPatternMatcher>> parsers;
            private CompositionContainer _container;
    
            public void ComposeParts()
            {
                var catalog = new AggregateCatalog();
                catalog.Catalogs.Add(new AssemblyCatalog(typeof(IParser).Assembly));
                _container = new CompositionContainer(catalog);
                try
                {
                    this._container.ComposeParts(this);
                }
                catch
                {
    
                }
            }
    
            public String[] GetMatchedLines(String filename)
            {
                var matched = new List<String>();
                var stack = new Stack<string>();
                using (StreamReader sr = File.OpenText(filename))
                {
                    String line = "";
                    while (!sr.EndOfStream)
                    {
                        line = sr.ReadLine();
                        var m = true;
                        foreach(var matcher in this.parsers){
                            m = m && matcher.Value.IsMatched(line, stack);
                        }
                        if (m)
                        {
                            matched.Add(line);
                        }
                     }
                }
                return matched.ToArray();
            }
        }
    }
    
  3. Then comes the implementation of individual checkers, the class names are self explanatory, so I don't think they need more descriptions.

    using FileParserDemo.Contracts;
    using System;
    using System.Collections.Generic;
    using System.ComponentModel.Composition;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading.Tasks;
    
    namespace FileParserDemo.PatternMatchers
    {
        [Export(typeof(IPatternMatcher))]
        public class MatchAllNumbers : IPatternMatcher
        {
            public Boolean IsMatched(String line, Stack<string> stack)
            {
                var regex = new Regex("\\d+");
                return regex.IsMatch(line);
            }
        }
    
        [Export(typeof(IPatternMatcher))]
        public class RemoveIfBlock : IPatternMatcher
        {
            public Boolean IsMatched(String line, Stack<string> stack)
            {
                var regex = new Regex("if\\(");
                if (regex.IsMatch(line))
                {
                    foreach (var m in regex.Matches(line))
                    {
                        //push the if
                        stack.Push(m.ToString());
                    }
                    //ignore current line, and will validate on next line with stack
                    return true;
                }
                regex = new Regex("//endif");
                if (regex.IsMatch(line))
                {
                    foreach (var m in regex.Matches(line))
                    {
                        stack.Pop();
                    }
                }
                return stack.Count == 0; //if stack has an item then ignoring this block
            }
        }
    
        [Export(typeof(IPatternMatcher))]
        public class RemoveWithEndPeriod : IPatternMatcher
        {
            public Boolean IsMatched(String line, Stack<string> stack)
            {
                var regex = new Regex("(?m)(?!\\d+.*?\\.$)\\d+");
                return regex.IsMatch(line);
            }
        }
    
    
        [Export(typeof(IPatternMatcher))]
        public class RemoveWithInParenthesis : IPatternMatcher
        {
            public Boolean IsMatched(String line, Stack<string> stack)
            {
                var regex = new Regex("\\(.*\\d+.*\\)");
                return !regex.IsMatch(line);
            }
        }
    }
    
  4. The program -

    using FileParserDemo.Contracts;
    using FileParserDemo.Parsers;
    using System;
    using System.Collections.Generic;
    using System.ComponentModel.Composition;
    using System.IO;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    
    namespace FileParserDemo
    {
        class Program
        {
            static void Main(string[] args)
            {
                var parser = new Parser();
                parser.ComposeParts();
                var matches = parser.GetMatchedLines(Path.GetFullPath("test.txt"));
                foreach (var s in matches)
                {
                    Console.WriteLine(s);
                }
                Console.ReadLine();
            }
        }
    }
    

For testing I took @Tiago's sample file as Test.txt which had the following lines -

this is a text
it should match 12345
if(
it should not match 12345
//endif 
it should match 12345
it should not match 12345.
it should not match ( blabla 12345  blablabla )
it should not match ( 12345 )
it should match 12345

Gives the output -

it should match 12345
it should match 12345
it should match 12345

Don't know if this would help you or not, I do had a fun time playing with it.... :)

The best part with it is that, for adding a new condition all you have to do is provide an implementation of IPatternMatcher, it will automatically get called and thus will validate.

Altaf answered 19/5, 2014 at 19:44 Comment(0)
C
2

Same as @zx81's (*SKIP)(*F) but with using a negative lookahead assertion.

(?m)(?:if\(.*?\/\/endif|\([^()]*\))(*SKIP)(*F)|\b\d+\b(?!.*\.$)

DEMO

In python, i would do easily like this,

import re
string = """cat 123 sat.
I like 000 not (456) though 111 is fine
222 if(  //endif if(cat==789 stuff  //endif   333"""
for line in string.split('\n'):                                  # Split the input according to the `\n` character and then iterate over the parts.
    if not line.endswith('.'):                                   # Don't consider the part which ends with a dot.
        for i in re.split(r'\([^()]*\)|if\(.*?//endif', line):   # Again split the part by brackets or if condition which endswith `//endif` and then iterate over the inner parts.
            for j in re.findall(r'\b\d+\b', i):                  # Then find all the numbers which are present inside the inner parts and then loop through the fetched numbers.
                print(j)                                         # Prints the number one ny one.

Output:

000
111
222
333
Cynde answered 28/12, 2014 at 5:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.