Can sed regex simulate lookbehind and lookahead?

Asked 15/2, 2013 at 1:24 Answered 15/2, 2013 at 14:59

Solved regex sed awk regex-negation regex-lookarounds

I'm trying to write a sed script that will capture all "naked" URL's in a text file and replace them with <a href=[URL]>[URL]</a>. By "naked" I mean a URL that is not wrapped inside an anchor tag.

My initial thought was that I should match URL's that do not have a " or a > in front of them, and also do not have a < or a " after them. However, I am running into difficulty with expressing the concept of "do not have in front of or behind" because as far as I know sed does not have look-ahead or look-behind.

Sample Input:

[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]

Sample Desired Output:

[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foo.bar">http://foo.bar</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]

Observe that the third line is unmodified, because it is already inside <a href>. On the other hand, both the first and second lines are modified. Finally, observe that all non-URL text is unmodified.

Ultimately, I am trying to do something like:

sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013

I began by verifying that the following will correctly match and remove a URL:

sed 's/http:\/\/[^\s]\+//g'

I then tried this, but it is not able to match URL's that start at the beginning of file / input:

sed 's/[^\>"]http:\/\/[^\s]\+//g'

Is there a way to work around this in sed, either by simulating lookbehind / lookahead, or explicitly matching beginning of file and end of file?

Gradeigh answered 15/2, 2013 at 1:24 Comment(4)

Why do you use [^\>"]? – Aphid 15/2, 2013 at 1:34

I'm looking for a URL that is not preceded by quotation mark or a greater than sign. – Gradeigh 15/2, 2013 at 1:35

Update your question to show some representative sample input and the expected output given that input - that's more important for us to see than what you have tried (though that's useful too). – Insult 15/2, 2013 at 2:49

@EdMorton, observe that the question has been updated with sample input and output. – Gradeigh 15/2, 2013 at 8:21

sed is an excellent tool for simple substitutions on a single line, for any other text manipulation problems just use awk.

Check the definition I'm using in the BEGIN section below for a regexp that matches URLs. It works for your sample but I don't know if it captures all possible URL formats. Even if it doesn't though it may be adequate for your needs.

$ cat file
[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]
$
$ awk -f tst.awk file
[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]
$
$ cat tst.awk
BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" }
{
    head = ""
    tail = $0
    while ( match(tail,urlRe) ) {
       url  = substr(tail,RSTART,RLENGTH)
       href = "href=\"" url "\""

       if (index(tail,href) == (RSTART - 6) ) {
          # this url is inside href="url" so skip processing it and the next url match.
          count = 2
       }

       if (! (count && count--)) {
          url = "<a " href ">" url "</a>"
       }

       head = head substr(tail,1,RSTART-1) url
       tail = substr(tail,RSTART+RLENGTH)
    }

    print head tail
}

Insult answered 15/2, 2013 at 14:59 Comment(2)

In the url regex you use _ as a valid host name character, shouldn't it be -? – Swallow 5/2, 2015 at 12:24

As I say at the top of the answer

Check the definition I'm using in the BEGIN section below for a regexp that matches URLs. It works for your sample but I don't know if it captures all possible URL formats.

. I'm no URL syntax expert. – Insult 5/2, 2015 at 13:39

The obvious problem with your command is

You did not escape the parenthesis "("

This is the weird thing about sed regex. It is different to Perl regex that many symbols are by default "literal". You have to escape them to "function". Try:

s/\([^>"]\?\)\(http:\/\/[^\s]\+\)/\1<a href="\2">\2<\/a>/g

Aphid answered 15/2, 2013 at 1:36 Comment(6)

As a clarification, I am trying to match URL's that do not have a " or a > in front of them. – Gradeigh 15/2, 2013 at 1:40

The given solution will not match http://google.com at the beginning of file or beginning of input. – Gradeigh 15/2, 2013 at 1:41

@Gradeigh I see what you mean. sed does not support look ahead/behind, I just edited. The question mark makes it optional – Aphid 15/2, 2013 at 1:43

About the weird \(, an option is to use sed -r so that ( doesn't need to be quoted. (I even have a rsed alias) – Yahweh 15/2, 2013 at 5:44

@texasbruce, when you make it optional, it now has the effect that it will match URL's inside <a href=, which is not the intent. – Gradeigh 15/2, 2013 at 8:17

Incidentally, you can use the -E flag to use "modern" regex. Then you don't need to escape brackets. – Lipetsk 15/9, 2016 at 17:40

Recommended topics

Hot tags