regex to match word (url) only if it does not contain character

Asked 1/4, 2016 at 13:52 Answered 4/4, 2016 at 9:31

Solved regex url regex-negation regex-lookarounds

I'm using an API that sometimes truncates links inside the text that it returns and instead of "longtexthere https://fancy.link" I get "longtexthere https://fa…".

I'm trying to get to match the link only if it's complete, or in other words does not contain "…" character.

So far I am able to get links by using the following regex:

((?:https?:)?\/\/\S+\/?)

but obviously it returns every link including broken ones.

I've tried to do something like this:

((?:https?:)?\/\/(?:(?!…)\S)+\/?)

Although that started to ignore the "…" character it was still returning the link but just without including the character, so with the case of "https://fa…" it returned "https://fa" whereas I simply want it to ignore that broken link and move on.

Been fighting this for hours and just can't get my head around it. :(

Thanks for any help in advance.

Kalli answered 1/4, 2016 at 13:52 Comment(9)

Does your regex engine allow possessive quantifiers? Try (?:https?:)?\/\/[^\s…]++(?!…)\/? – Insane 1/4, 2016 at 14:13

Note you can also remove the \/? at the end as it will not be matched ever. If your regex flavor is JavaScript or Python, try (?!\S+…)(?:https?:)?\/\/\S+ – Insane 1/4, 2016 at 14:21

If possessive quantifiers and lookbehind are supported by your regex flavor you can also try (?:https?:)?\/\/\S++(?<!…) The possessive quantifier will prevent from backtracking if the lookbehind does not match. – Zorine 1/4, 2016 at 16:16

Wow @WiktorStribiżew that worked!!! You should have posted it as an answer as that's the only correct answer. regex101.com/r/wC7tO5/1 – Kalli 4/4, 2016 at 9:19

Oh, actually @bobblebubble yours is working too! regex101.com/r/zN7jS3/1 – Kalli 4/4, 2016 at 9:20

Thanks guys, you're amazing! :) – Kalli 4/4, 2016 at 9:20

But what is the regex flavor? Which pattern works for you? – Insane 4/4, 2016 at 9:26

@user45173 My solution is similar to Wiktors first one, which I vote for. Also bear in mind that it is often essential to specify the regex flavor/tool you're working with. Else it's just guessing for the ones who want to answer. – Zorine 4/4, 2016 at 10:55

I'm using PHP 5.4, not sure which flavor of regex it uses? – Kalli 5/4, 2016 at 9:11

You can use

(?:https?:)?\/\/[^\s…]++(?!…)\/?

See the regex demo. The possessive quantifier [^\s…]++ will match all non-whitespace and non-… characters without later backtracking and then check if the next character is not …. If it is, no match will be found.

As an alternative, if your regex engine allow possessive quantifiers, use a negative lookahead version:

(?!\S+…)(?:https?:)?\/\/\S+\/?

See another regex demo. The lookahead (?!\S+…) will fail the match if 1+ non-whitespace characters are followed with ….

Chairborne answered 4/4, 2016 at 9:31 Comment(2)

Does exactly what I need! Thanks a lot. Also will mention here @bobblebubble suggestion from above: (?:https?:)?\/\/\S++(?<!…) as it seems to be similar but working too! – Kalli 5/4, 2016 at 9:22

Yes, it is very similar as it also uses possessive quantifier to prevent backtracking into the character class. \S++ matches all non-whitespace characters up to a whitespace or end of string and then checks if only the previous char was not an ellipsis. If it is, the match is failed. – Insane 5/4, 2016 at 9:25

Try:

 ((?:https?:)?\/\/\S+[^ \.]{3}\/?)

Its the same as your original pattern.. you just tell it that the last three characters should not be '.' (period) or ' ' (space)

UPDATE: Your second link worked.

and if you tweak your regex just slightly it will do what you want:

 ((?:https?:)?\/\/\S+[^ …] \/?)

Yes it looks just like what you had in there except I added a ' ' (space) after the part we do not want.. this will force the regular expression to match up until and including the space which it cannot with a url that has the '...' character. Without the space at the end it would match up until the not including the '...' which was why it was not doing what we wanted ;)

Bernadinebernadotte answered 1/4, 2016 at 14:5 Comment(4)

I've modified yours slightly (because it's a special character rather than three dots), although it didn't do the trick regex101.com/r/zJ7lM0/1 – Kalli 1/4, 2016 at 14:41

for some reason the url you have is blocked for me. :( – Bernadinebernadotte 1/4, 2016 at 15:20

Huh, you're the first person who couldn't open regex101.com . Maybe this link will work? regexr.com/3d53k – Kalli 4/4, 2016 at 9:12

@user45173 Sorry I did not realize the '...' was a single Unicode character. I was able to make it work by adding a space in the pattern you had on the regexr.com side. See my update. – Bernadinebernadotte 4/4, 2016 at 15:11

You can try following regex

https?:\/\/\w+(?:\.\w+\/?)+(?!\.{3})(\s|$)

See demo https://regex101.com/r/bS6tT5/3

Crystallization answered 1/4, 2016 at 14:25 Comment(1)

Yes, it was skipping urls ending with /. try again. It should match 4. rest are either not valid urls or doesn't match because of urls you have set. – Crystallization 4/4, 2016 at 10:35

Please try:

https?:\/\/[^ ]*?…|(https?:\/\/[^ ]+\.[^ ]+)

Here is the demo.

Auroora answered 1/4, 2016 at 14:3 Comment(3)

Updated regex pattern. Please check it out. – Auroora 1/4, 2016 at 15:23

Sorry to bother again but could you look at this please? regex101.com/r/iB3tK6/1 – Kalli 4/4, 2016 at 9:10

@user45173: Nice catch:) How about the new version? – Auroora 4/4, 2016 at 13:35

Recommended topics

Hot tags