POSIX Regular Expressions: Excluding a word in an expression?
Asked Answered
S

2

13

I am trying to create a regular expression using POSIX (Extended) Regular Expressions that I can use in my C program code.

Specifically, I have come up with the following, however, I want to exclude the word "http" within the matched expressions. Upon some searching, it doesn't look like POSIX makes it obvious for catching specific strings. I am using something called a "negative look-ahead" in the below example (i.e., the (?!http:)). However, I fear that this may only be something available to regular expressions defined in dialects other than POSIX. Is negative lookahead allowed? Is the logical NOT operator allowed in POSIX (i.e. !)?

Working regular expression example:

href|HREF|src[[:space:]]*=[[:space:]]*\"(?!http:)[^\"]+\"\[/]

If I cannot use negative-lookahead like in other dialects, what can I do to the above regular expression to filter out the specific word "http:"? Ideally, is there any way without inverse logic and ultimately creating a ridiculously long regular expression in the process? (the one I have above is quite long already, I'd rather it not look more confusing if possible)

[NOTE: I have consulted other related threads in Stack Overflow, but the most relevant ones seem to only ask this question "generically", which means answers given didn't necessarily mean they were POSIX-flavored ==> in another thread or two, I've seen the above (?!insertWordToExcludeHere) negative lookahead, but I fear it's only for PHP.)

[NOTE 2: I will take any POSIX regular expression phrasings as well, any help would be appreciated. Does anyone have a suggestion on how whatever regular expression that would filter out "http:" would look like and how it could be fit into my current regular expression, replacing the (?!http:)?]

Scandium answered 13/3, 2013 at 5:5 Comment(1)
Actually, it is possible to match a reverse pattern if you need to check the absence of a certain word at a specific location in the string (start, end, after a specific part), but you can't match a string that does not contain some multicharacter substring anywhere inside a string.Gonroff
C
15

According to Regular-Expressions.info lookaheads and lookbehinds are not in the POSIX flavour.

You may consider thinking in terms of lexing (tokenization) and parsing if your problem is too complex to be represented cleanly as a regex.

Cageling answered 13/3, 2013 at 5:11 Comment(3)
Well, the above regular expression I have posted is close to what I need, minus the exclusion of the string "http:". Do you have any suggestions on how I get the exclusion of "http" worked into my regular expression using POSIX? In other words, any suggestions for how I can incorporate an expression in POSIX that would filter out "http:" but could put within my current regular expression?Scandium
It's possible by long expressions like ([^h"][^"]+|h[^t"][^+]+|ht[^t"][^t"]+|... but I wouldn't recommend it. I'd second Patashu's recommendation of thinking in terms of lexing and specifically recommend you look for an existing library for parsing HTML. It will get other details right like that the attributes can have single quotes as well as double quotes, something that looks attribute-like may be part of the body text or a comment or a CDATA section, etc.Health
While I completely agree with both of you, I'm going to have to have to just come up with a regex for this long form and insert where the negative look ahead currently is in. What are your thoughts on my expression below? It's inspired by a forum post on SO where someone wanted to filter out "tree", but I modified it to filter out "http": ^([^h]|(h[^t])|(ht[^t])|(htt[^p]))*($|(h($|(t($|p$)))))Scandium
T
0

You want to exclude a particular prefix.

This is possible to do but tricky. The method is to construct a deterministic or nondeterministic finite automaton (DFA or NFA) and then convert it to a regular expression. Algorithms for this are well known. JFLAP is a tool that can assist with the conversion.

A working example using grep in extended POSIX mode is below. Note the added parentheses around the href|HREF|src alternatives.

grep --extended-regexp '(href|HREF|src)[[:space:]]*=[[:space:]]*\"(h|ht|htt|([^h]|h[^t]|ht[^t]|htt[^p])[^"]*)\"' <<'EOInput'
href="foobar"
href="http://FAIL"
HREF="ok to have http later"
HREF="http but not first FAIL"
src="anything not starting with http"
src="httpshouldnotmatchFAIL"
FAIL
EOInput

Its output:

href="foobar"
HREF="ok to have http later"
src="anything not starting with http"
Tandratandy answered 10/5 at 7:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.