Replace/Remove characters that do not match the Regular Expression (.NET)
Asked Answered
B

3

21

I have a regular expression to validate a string. But now I want to remove all the characters that do not match my regular expression.

E.g.

regExpression = @"^([\w\'\-\+])"

text = "This is a sample text with some invalid characters -+%&()=?";

//Remove characters that do not match regExp.

result = "This is a sample text with some invalid characters -+";

Any ideas of how I can use the RegExpression to determine the valid characters and remove all the other ones.

Many thanks

Brownell answered 27/5, 2011 at 15:28 Comment(0)
T
23

I believe you can do this (whitelist characters and replace everything else) in one line:

var result = Regex.Replace(text, @"[^\w\s\-\+]", "");

Technically it will produce this: "This is a sample text with some invalid characters - +" which is slightly different than your example (the extra space between the - and +).

Trifling answered 27/5, 2011 at 15:40 Comment(2)
This will not work if the regex to match the text is more complicated. You can negate every regex expression that easily.Ally
True, but the poster has said he/she needs removal on a character level basis for which this should suffice. Further, if you need greater precision consider: var result = Regex.Replace(text, @"[^\w]", m => "%&=?()".Contains(m.Value) ? "" : m.Value); You can replace my MatchEvaluator with any code to determine whether or not to keep a character.Trifling
A
17

Simple as that:

var match = Regex.Match(text, regExpression);
string result = "";
if(match.Success)
    result = match.Value;

Removing the non-matched characters is the same as keeping the matched ones. That's what we are doing here.

If it is possible that the expression matches multiple times in your text, you can use this:

var result = Regex.Matches(text, regExpression).Cast<Match>()
                  .Aggregate("", (s, e) => s + e.Value, s => s);
Ally answered 27/5, 2011 at 15:29 Comment(10)
Hi Daniel, I tried your solution, but as you mentioned my regular expression will match more than once, coz I need it to just remove the invalid characters but keep all the valid ones. I could not use the second piece of code, I get an error in the Cast<Match>() Am I supposed to replace that part with something else or I should use your code as you typed it. ThanksBrownell
(1) The regex you provided is not doing what you expect it to do. (2) What is the error you get? I actually tested that code and it works.Ally
(1) Why is the RegEx wrong? or how should it be? I use the same RegEx for a similar method that just validates if it is a valid string or not, but this new method instead of returning true if it matched the RegEx, removes/replaces the invalid characters, I guess I need to use two different RegEx as one will not work on both cases right? (2) I forgot to add the add the directive for System.LinqBrownell
The regex matches one word or one of the following characters: ' - + at the beginning of the lineAlly
What's the difference advantage/disadvantage between your approach and @Trifling s approach? Any things I should get into consideration?Brownell
I already wrote about the problems with emfurry's approach in a comment to his answerAlly
This one line is beautifulPahl
What does .Cast<Match>() buy you? Don't you already have a MatchCollection?Chancellor
@ruffin: MatchCollection only implements IEnumerable but not IEnumerable<Match>, so you can't use it directly in a LINQ expression.Ally
When I F12 in, it looks like MatchCollection may have been updated to implement the latter, at least in .NET Core 2.0. So perhaps no longer necessary, depending on your target platform. public class MatchCollection : ICollection, IEnumerable, ICollection<Match>, IEnumerable<Match>, IList<Match>...Chancellor
Y
3

Thanks to Replace chars if not match answer I've created a helper method to strips unaccepted characters .

The allowed pattern should be in Regex format, expect them wrapped in square brackets. A function will insert a tilde after opening squere bracket. I anticipate that it could work not for all RegEx describing valid characters sets,but it works for relatively simple sets, that we are using.

 /// <summary>
               /// Replaces  not expected characters.
               /// </summary>
               /// <param name="text"> The text.</param>
               /// <param name="allowedPattern"> The allowed pattern in Regex format, expect them wrapped in brackets</param>
               /// <param name="replacement"> The replacement.</param>
               /// <returns></returns>
               /// //        https://mcmap.net/q/345888/-replace-chars-if-not-match.
               //https://mcmap.net/q/354899/-replace-remove-characters-that-do-not-match-the-regular-expression-net
               //[^ ] at the start of a character class negates it - it matches characters not in the class.
               //Replace/Remove characters that do not match the Regular Expression
               static public string ReplaceNotExpectedCharacters( this string text, string allowedPattern,string replacement )
              {
                     allowedPattern = allowedPattern.StripBrackets( "[", "]" );
                      //[^ ] at the start of a character class negates it - it matches characters not in the class.
                      var result = Regex .Replace(text, @"[^" + allowedPattern + "]", replacement);
                      return result;
              }

static public string RemoveNonAlphanumericCharacters( this string text)
              {
                      var result = text.ReplaceNotExpectedCharacters(NonAlphaNumericCharacters, "" );
                      return result;
              }
        public const string NonAlphaNumericCharacters = "[a-zA-Z0-9]";

There are a couple of functions from my StringHelper class http://geekswithblogs.net/mnf/archive/2006/07/13/84942.aspx , that are used here.

           /// <summary>
           /// ‘StripBrackets checks that starts from sStart and ends with sEnd (case sensitive).
           ///           ‘If yes, than removes sStart and sEnd.
           ///           ‘Otherwise returns full string unchanges
           ///           ‘See also MidBetween
           /// </summary>

           public static string StripBrackets( this string str, string sStart, string sEnd)
          {
                  if (CheckBrackets(str, sStart, sEnd))
                 {
                       str = str.Substring(sStart.Length, (str.Length – sStart.Length) – sEnd.Length);
                 }
                  return str;
          }
           public static bool CheckBrackets( string str, string sStart, string sEnd)
          {
                  bool flag1 = (str != null ) && (str.StartsWith(sStart) && str.EndsWith(sEnd));
                  return flag1;
          }
Youngran answered 28/10, 2012 at 2:50 Comment(3)
It doesn't answer how to replace/remove charachters not in matching groupsSpohr
NOTE: The function StripBrackets is not provided. also @"[^" + allowedPattern + "]" will not work for arbitrary patterns, however for simple cases this is a nice solution.Rainstorm
@shelbypereira, StripBrackets was in a linked article, I’ve added it now to the answer.Youngran

© 2022 - 2024 — McMap. All rights reserved.