regex to match word (url) only if it does not contain character
Asked Answered
K

4

6

I'm using an API that sometimes truncates links inside the text that it returns and instead of "longtexthere https://fancy.link" I get "longtexthere https://fa…".

I'm trying to get to match the link only if it's complete, or in other words does not contain "…" character.

So far I am able to get links by using the following regex:

((?:https?:)?\/\/\S+\/?)

but obviously it returns every link including broken ones.

I've tried to do something like this:

((?:https?:)?\/\/(?:(?!…)\S)+\/?)

Although that started to ignore the "…" character it was still returning the link but just without including the character, so with the case of "https://fa…" it returned "https://fa" whereas I simply want it to ignore that broken link and move on.

Been fighting this for hours and just can't get my head around it. :(

Thanks for any help in advance.

Kalli answered 1/4, 2016 at 13:52 Comment(9)
Does your regex engine allow possessive quantifiers? Try (?:https?:)?\/\/[^\s…]++(?!…)\/?Insane
Note you can also remove the \/? at the end as it will not be matched ever. If your regex flavor is JavaScript or Python, try (?!\S+…)(?:https?:)?\/\/\S+Insane
If possessive quantifiers and lookbehind are supported by your regex flavor you can also try (?:https?:)?\/\/\S++(?<!…) The possessive quantifier will prevent from backtracking if the lookbehind does not match.Zorine
Wow @WiktorStribiżew that worked!!! You should have posted it as an answer as that's the only correct answer. regex101.com/r/wC7tO5/1Kalli
Oh, actually @bobblebubble yours is working too! regex101.com/r/zN7jS3/1Kalli
Thanks guys, you're amazing! :)Kalli
But what is the regex flavor? Which pattern works for you?Insane
@user45173 My solution is similar to Wiktors first one, which I vote for. Also bear in mind that it is often essential to specify the regex flavor/tool you're working with. Else it's just guessing for the ones who want to answer.Zorine
I'm using PHP 5.4, not sure which flavor of regex it uses?Kalli
C
4

You can use

(?:https?:)?\/\/[^\s…]++(?!…)\/?

See the regex demo. The possessive quantifier [^\s…]++ will match all non-whitespace and non- characters without later backtracking and then check if the next character is not . If it is, no match will be found.

As an alternative, if your regex engine allow possessive quantifiers, use a negative lookahead version:

(?!\S+…)(?:https?:)?\/\/\S+\/?

See another regex demo. The lookahead (?!\S+…) will fail the match if 1+ non-whitespace characters are followed with .

Chairborne answered 4/4, 2016 at 9:31 Comment(2)
Does exactly what I need! Thanks a lot. Also will mention here @bobblebubble suggestion from above: (?:https?:)?\/\/\S++(?<!…) as it seems to be similar but working too!Kalli
Yes, it is very similar as it also uses possessive quantifier to prevent backtracking into the character class. \S++ matches all non-whitespace characters up to a whitespace or end of string and then checks if only the previous char was not an ellipsis. If it is, the match is failed.Insane
B
1

Try:

 ((?:https?:)?\/\/\S+[^ \.]{3}\/?)

Its the same as your original pattern.. you just tell it that the last three characters should not be '.' (period) or ' ' (space)

UPDATE: Your second link worked.

and if you tweak your regex just slightly it will do what you want:

 ((?:https?:)?\/\/\S+[^ …] \/?)

Yes it looks just like what you had in there except I added a ' ' (space) after the part we do not want.. this will force the regular expression to match up until and including the space which it cannot with a url that has the '...' character. Without the space at the end it would match up until the not including the '...' which was why it was not doing what we wanted ;)

Bernadinebernadotte answered 1/4, 2016 at 14:5 Comment(4)
I've modified yours slightly (because it's a special character rather than three dots), although it didn't do the trick regex101.com/r/zJ7lM0/1Kalli
for some reason the url you have is blocked for me. :(Bernadinebernadotte
Huh, you're the first person who couldn't open regex101.com . Maybe this link will work? regexr.com/3d53kKalli
@user45173 Sorry I did not realize the '...' was a single Unicode character. I was able to make it work by adding a space in the pattern you had on the regexr.com side. See my update.Bernadinebernadotte
C
1

You can try following regex

https?:\/\/\w+(?:\.\w+\/?)+(?!\.{3})(\s|$)

See demo https://regex101.com/r/bS6tT5/3

Crystallization answered 1/4, 2016 at 14:25 Comment(1)
Yes, it was skipping urls ending with /. try again. It should match 4. rest are either not valid urls or doesn't match because of urls you have set.Crystallization
A
0

Please try:

https?:\/\/[^ ]*?…|(https?:\/\/[^ ]+\.[^ ]+)

Here is the demo.

Auroora answered 1/4, 2016 at 14:3 Comment(3)
Updated regex pattern. Please check it out.Auroora
Sorry to bother again but could you look at this please? regex101.com/r/iB3tK6/1Kalli
@user45173: Nice catch:) How about the new version?Auroora

© 2022 - 2024 — McMap. All rights reserved.