How can I remove text within parentheses with a regex?
Asked Answered
P

9

109

I'm trying to handle a bunch of files, and I need to alter then to remove extraneous information in the filenames; notably, I'm trying to remove text inside parentheses. For example:

filename = "Example_file_(extra_descriptor).ext"

and I want to regex a whole bunch of files where the parenthetical expression might be in the middle or at the end, and of variable length.

What would the regex look like? Perl or Python syntax would be preferred.

Ptah answered 12/3, 2009 at 18:56 Comment(3)
Are you sure that the "extra_descriptor" cannot include a ")"? If it can the problem becomes much harder...Horney
@dmckee: It is harder if the parens can be nested, though if you just want to get rid of everything between the first '(' and the last ')' it's not much harder: just use a greedy '.*' instead of '.*?'.Consideration
@Consideration You're correct, it's hell of a lot harder since nested parentheses can't be recognized with a FSM (you have to keep track of the nesting level which is unlimited) and therefore not by a regex. For it to be possible you have to restrict yourself to a limited level of nesting.Everrs
H
176
s/\([^)]*\)//

So in Python, you'd do:

re.sub(r'\([^)]*\)', '', filename)
Herisau answered 12/3, 2009 at 18:59 Comment(13)
is there any reason to prefer .*? over [^)]*Echinus
@Kip: nope. I don't know why, but .* is always the first thing that comes to mind.Troudeloup
@Kip: .*? is not handled by all regex parsers, whereas your [^)]* is handled by almost all of them.Calistacalisthenics
@Kip: Another reason is backtracking.Argentous
.* gets everything between the first left paren and last right paren: 'a(b)c(d)e' will become 'ae'. [^)]* only removes between the first left paren and the first right paren: 'ac(d)e'. You'll also get different behaviors for nested parens.Shimberg
Oops, I was wrong in the last comment. The '?' in the .* example makes it behave like the negated character class. But then, so was Gumbo, since there won't be any backtracking with the non-greedy .*? construct.Shimberg
@J.F.Sebastian Inverted class [^)]* is much faster than minimal quantifier .*?. The only exception is Perl, which performs optimization on the symbol following minimal quantifier (at least this optimization was only implemented in Perl at the time the book, I read this info in, was written).Coeternity
@J.F.Sebastian Oh. My comment shouldn't have been addressed to you. I confused something. I commented because I think it's a good piece of information to supplement the answer.Coeternity
@ovgolovin: ok. btw, the regexes are not equivalent in general unless there is re.DOTALL flag. Also Have you tried to measure it in Python?Worktable
@J.F.Sebastian No, I din't try to measure it in Python. I only remembered this maxim from the book. But the author had Python regexp engine in mind while writing it (he usually refers to Python flavor in it).Coeternity
I was looking to do this inside Visual Studio. The regex for Visual Studio is \([^)]*\)Tameka
When there are multiple parantheses, it is better to use https://mcmap.net/q/195191/-how-can-i-remove-text-within-parentheses-with-a-regexGame
This answer will not work for filenames with nested brackets like this: "filename_abc(text(TM))" as the result will be "filename_abc)"Zymosis
F
133

The pattern that matches substrings in parentheses having no other ( and ) characters in between (like (xyz 123) in Text (abc(xyz 123)) is

\([^()]*\)

Details:

  • \( - an opening round bracket (note that in POSIX BRE, ( should be used, see sed example below)
  • [^()]* - zero or more (due to the * Kleene star quantifier) characters other than those defined in the negated character class/POSIX bracket expression, that is, any chars other than ( and )
  • \) - a closing round bracket (no escaping in POSIX BRE allowed)

Removing code snippets:

  • JavaScript: string.replace(/\([^()]*\)/g, '')
  • PHP: preg_replace('~\([^()]*\)~', '', $string)
  • Perl: $s =~ s/\([^()]*\)//g
  • Python: re.sub(r'\([^()]*\)', '', s)
  • C#: Regex.Replace(str, @"\([^()]*\)", string.Empty)
  • VB.NET: Regex.Replace(str, "\([^()]*\)", "")
  • Java: s.replaceAll("\\([^()]*\\)", "")
  • Ruby: s.gsub(/\([^()]*\)/, '')
  • R: gsub("\\([^()]*\\)", "", x)
  • Lua: string.gsub(s, "%([^()]*%)", "")
  • sed: sed 's/([^()]*)//g'
  • Tcl: regsub -all {\([^()]*\)} $s "" result
  • C++ std::regex: std::regex_replace(s, std::regex(R"(\([^()]*\))"), "")
  • Objective-C:
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"\\([^()]*\\)" options:NSRegularExpressionCaseInsensitive error:&error]; NSString *modifiedString = [regex stringByReplacingMatchesInString:string options:0 range:NSMakeRange(0, [string length]) withTemplate:@""];
  • Swift: s.replacingOccurrences(of: "\\([^()]*\\)", with: "", options: [.regularExpression])
  • Google BigQuery: REGEXP_REPLACE(col, "\\([^()]*\\)" , "")
Ferriferous answered 15/11, 2016 at 23:7 Comment(2)
Dear Wiktor I only have one question. If we were to exclude bracket [ instead of parentheses, do we have to escape them within [^ ] structure like [^\\[\\]] or it was not necessary as other characters?Backandforth
@AnoushiravanR It depends on the regex flavor. See this answer of mine.Leaky
A
24

I would use:

\([^)]*\)
Argentous answered 12/3, 2009 at 19:8 Comment(1)
This answer will not work for filenames with nested brackets like this: "filename_abc(text(TM))" as the result will be "filename_abc)"Zymosis
S
7

If you don't absolutely need to use a regex, useconsider using Perl's Text::Balanced to remove the parenthesis.

use Text::Balanced qw(extract_bracketed);

my ($extracted, $remainder, $prefix) = extract_bracketed( $filename, '()', '[^(]*' );

{   no warnings 'uninitialized';

    $filename = (defined $prefix or defined $remainder)
                ? $prefix . $remainder
                : $extracted;
}

You may be thinking, "Why do all this when a regex does the trick in one line?"

$filename =~ s/\([^}]*\)//;

Text::Balanced handles nested parenthesis. So $filename = 'foo_(bar(baz)buz)).foo' will be extracted properly. The regex based solutions offered here will fail on this string. The one will stop at the first closing paren, and the other will eat them all.

   $filename =~ s/\([^}]*\)//;
   # returns 'foo_buz)).foo'

   $filename =~ s/\(.*\)//;
   # returns 'foo_.foo'

   # text balanced example returns 'foo_).foo'

If either of the regex behaviors is acceptable, use a regex--but document the limitations and the assumptions being made.

Shimberg answered 12/3, 2009 at 22:55 Comment(2)
While I know you can't parse nested parenthesis with (classic) regexes, if you know you're never going to encounter nested parenthesis, you can simplify the problem to one that CAN be done with regexes, and fairly easily. It's overkill to use a parser tool when we don't need it.Quickie
@Chris Lutz - I should have said "consider" rather than "use" in the first sentence. In many cases a regex will do the job, which is why I said to use a regex if the behavior is acceptable.Shimberg
W
3

If a path may contain parentheses then the r'\(.*?\)' regex is not enough:

import os, re

def remove_parenthesized_chunks(path, safeext=True, safedir=True):
    dirpath, basename = os.path.split(path) if safedir else ('', path)
    name, ext = os.path.splitext(basename) if safeext else (basename, '')
    name = re.sub(r'\(.*?\)', '', name)
    return os.path.join(dirpath, name+ext)

By default the function preserves parenthesized chunks in directory and extention parts of the path.

Example:

>>> f = remove_parenthesized_chunks
>>> f("Example_file_(extra_descriptor).ext")
'Example_file_.ext'
>>> path = r"c:\dir_(important)\example(extra).ext(untouchable)"
>>> f(path)
'c:\\dir_(important)\\example.ext(untouchable)'
>>> f(path, safeext=False)
'c:\\dir_(important)\\example.ext'
>>> f(path, safedir=False)
'c:\\dir_\\example.ext(untouchable)'
>>> f(path, False, False)
'c:\\dir_\\example.ext'
>>> f(r"c:\(extra)\example(extra).ext", safedir=False)
'c:\\\\example.ext'
Worktable answered 12/3, 2009 at 20:3 Comment(0)
W
3

For those who want to use Python, here's a simple routine that removes parenthesized substrings, including those with nested parentheses. Okay, it's not a regex, but it'll do the job!

def remove_nested_parens(input_str):
    """Returns a copy of 'input_str' with any parenthesized text removed. Nested parentheses are handled."""
    result = ''
    paren_level = 0
    for ch in input_str:
        if ch == '(':
            paren_level += 1
        elif (ch == ')') and paren_level:
            paren_level -= 1
        elif not paren_level:
            result += ch
    return result

remove_nested_parens('example_(extra(qualifier)_text)_test(more_parens).ext')
Wes answered 14/12, 2017 at 22:30 Comment(0)
T
1

If you can stand to use sed (possibly execute from within your program, it'd be as simple as:

sed 's/(.*)//g'
Taranto answered 12/3, 2009 at 19:3 Comment(2)
You are just grouping the expression .*.Argentous
@Gumbo: No, he's not. In sed, "\(...\)" groups.Hugely
M
0
>>> import re
>>> filename = "Example_file_(extra_descriptor).ext"
>>> p = re.compile(r'\([^)]*\)')
>>> re.sub(p, '', filename)
'Example_file_.ext'
Muirhead answered 12/3, 2009 at 21:48 Comment(0)
W
0

Java code:

Pattern pattern1 = Pattern.compile("(\\_\\(.*?\\))");
System.out.println(fileName.replace(matcher1.group(1), ""));
Windy answered 3/8, 2012 at 9:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.