Python regex not to match http://
Asked Answered
A

4

6

I am facing a problem to match and replace certain words, not contained in http://

Present Regex:

 http://.*?\s+

This matches the pattern http://www.egg1.com http://www.egg2.com

I need a regex to match certain words contained outside the http://

Example:

"This is a sample. http://www.egg1.com and http://egg2.com. This regex will only match 
 this egg1 and egg2 and not the others contained inside http:// "

 Match: egg1 egg2

 Replaced: replaced1 replaced2

Final Output :

 "This is a sample. http://www.egg1.com and http://egg2.com. This regex will only 
  match this replaced1 and replaced2 and not the others contained inside http:// "

QUESTION: Need to match certain patterns (as in example : egg1 egg2) unless they are part of http:// .Do not match egg1 and egg2 if they are present within http://

Air answered 28/7, 2011 at 13:31 Comment(6)
In other words: You want to match certain patterns (in your example egg1 and egg2) unless they are part of an URL?Meridethmeridian
The way you state the question, it doesn't really matter, whether the match comes from the URL or not. What is it you actually want to match?Figwort
@Ferdinand Yes..you are right.. If egg1 and egg2 are inside an http:// do not match it.Air
So, in “http://foo.co.uk” you want the co?Beedon
He wants to find "Google" in "Google can be found at http://www.google.com", skipping the "google" within the URL.Meridethmeridian
Text after http:// isn't "contained inside" it.Infamy
M
6

One solution I can think of is to form a combined pattern for HTTP-URLs and your pattern, then filter the matches accordingly:

import re

t = "http://www.egg1.com http://egg2.com egg3 egg4"

p = re.compile('(http://\S+)|(egg\d)')
for url, egg in p.findall(t):
  if egg:
    print egg

prints:

egg3
egg4

UPDATE: To use this idiom with re.sub(), just supply a filter function:

p = re.compile(r'(http://\S+)|(egg(\d+))')

def repl(match):
    if match.group(2):
        return 'spam{0}'.format(match.group(3))
    return match.group(0)

print p.sub(repl, t)

prints:

http://www.egg1.com http://egg2.com spam3 spam4
Meridethmeridian answered 28/7, 2011 at 13:47 Comment(3)
Since I will be using re.sub to replace the matched words, Can I do the same you did with re.sub?Air
@Ferdinana Is it possible to use the filter functions with a matched group(like spam1 spam2)..updated in question..Air
Yes, of course. You can do anything you like in the filter function plus you can nest grouping parantheses in the regex. I updated my answer accordingly. I guess you can figure it out by yourself from here on?Meridethmeridian
L
2

This will not capture http://...:

(?:http://.*?\s+)|(egg1)
Laurin answered 28/7, 2011 at 14:13 Comment(2)
You should mention that it would require the usage of group(1) to get the correct value, and that it will still return matches for the URLs, just with the value for group 1 being None.Omeromero
@Omeromero Yeah, you are right. I just made an assumption that he knows how regexps work in general :)Laurin
S
1

You need to precede your pattern by a negative lookbehind assertion:

(?<!http://)egg[0-9]

In this regular expression, every time the regex engine finds a pattern matching egg[0-9] it will look back to verify if the preceding patters do not match http://. A negative lookbehind assertion starts with (?<! and ends with ). Everything between these delimiters should not precede the following pattern and will not be included in the result.

How to use it in your case:

>>> regex = re.compile('(?<!http://)egg[0-9]')
>>> a = "Example: http://egg1.com egg2 http://egg3.com egg4foo"
>>> regex.findall(a)
['egg2', 'egg4']
Selfabnegation answered 28/7, 2011 at 13:42 Comment(1)
@Ferdinand yes, you are right. Apparently, the lookbehind assertion is not the right tool for this job :)Selfabnegation
C
-2

Extending brandizzi's answer, I would just change his regex to this:

(?<!http://[\w\._-]*)(egg1|egg2)
Cradling answered 28/7, 2011 at 13:58 Comment(3)
It's a trap: Look-behind require fixed-width patterns, so your example won't compile.Meridethmeridian
have you even tried this? sre_constants.error: look-behind requires fixed-width pattern.Figwort
My bad, I tried it with a .Net regex utility, not a Python one.Cradling

© 2022 - 2024 — McMap. All rights reserved.