Regular expression to match URLs in Java

Asked 2/10, 2008 at 16:40 Answered 17/2, 2022 at 6:50

100

I use RegexBuddy while working with regular expressions. From its library I copied the regular expression to match URLs. I tested successfully within RegexBuddy. However, when I copied it as Java String flavor and pasted it into Java code, it does not work. The following class prints false:

public class RegexFoo {

    public static void main(String[] args) {
        String regex = "\\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]";
        String text = "http://google.com";
        System.out.println(IsMatch(text,regex));
}

    private static boolean IsMatch(String s, String pattern) {
        try {
            Pattern patt = Pattern.compile(pattern);
            Matcher matcher = patt.matcher(s);
            return matcher.matches();
        } catch (RuntimeException e) {
        return false;
    }       
}   
}

Does anyone know what I am doing wrong?

Deaver answered 2/10, 2008 at 16:40 Comment(3)

Sergio, do not catch RuntimeException. It may introduce subtle bugs and is a bad practice overall. If you just want to ignore the scenario when the expression is illegal the use: } catch ( PatternSyntaxException pse ){} instead. See item 57 of: java.sun.com/docs/books/effective – Hunk 2/10, 2008 at 17:38

Or you could use Pattern patt = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE); to avoid changing the regex to match both uppercase and lowercase. – Riotous 2/10, 2008 at 17:57

I know that this is really old ('08), but for anyone having similar issues, RegexBuddy has the "Use" tab. Make sure you first select the Java 7 flavor, and then in the "Use" panel you can let it generate the Java code for your specific case. This worked nicely for me. – Dennard 1/6, 2016 at 21:34

116

Try the following regex string instead. Your test was probably done in a case-sensitive manner. I have added the lowercase alphas as well as a proper string beginning placeholder.

String regex = "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

This works too:

String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

Note:

String regex = "<\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // matches <http://google.com>

String regex = "<^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // does not match <http://google.com>

Equivocate answered 2/10, 2008 at 16:48 Comment(8)

Using your regular expression i get false too. – Deaver 2/10, 2008 at 16:58

Did you catch my last edit. I fat fingered the beginning of the string. I just copied it into Eclipse and I get "true". – Equivocate 2/10, 2008 at 17:12

thanks man, first time i see utility to the comments in stackoverflow – Deaver 2/10, 2008 at 17:15

No problem. If you're using Eclipse I like using the RegEx Tester plugin available here brosinski.com/regex – Equivocate 2/10, 2008 at 17:21

thansk for the link i am using eclipse – Deaver 2/10, 2008 at 17:25

I can not edit your answer. Maybe you could edit yourself and combine my last answer with yours – Deaver 2/10, 2008 at 17:26

Is this regular expression security-sensitive ? I got doubt because of OWASP page owasp.org/www-community/attacks/… – Spaniel 24/7, 2020 at 2:26

this fail if the URL has $ in some parameter – Mainstream 22/3, 2021 at 9:49

104

The best way to do it now is:

android.util.Patterns.WEB_URL.matcher(linkUrl).matches();

EDIT: Code of Patterns from https://github.com/android/platform_frameworks_base/blob/master/core/java/android/util/Patterns.java :

/*
 * Copyright (C) 2007 The Android Open Source Project
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package android.util;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Commonly used regular expression patterns.
 */
public class Patterns {
    /**
     *  Regular expression to match all IANA top-level domains.
     *  List accurate as of 2011/07/18.  List taken from:
     *  http://data.iana.org/TLD/tlds-alpha-by-domain.txt
     *  This pattern is auto-generated by frameworks/ex/common/tools/make-iana-tld-pattern.py
     *
     *  @deprecated Due to the recent profileration of gTLDs, this API is
     *  expected to become out-of-date very quickly. Therefore it is now
     *  deprecated.
     */
    @Deprecated
    public static final String TOP_LEVEL_DOMAIN_STR =
        "((aero|arpa|asia|a[cdefgilmnoqrstuwxz])"
        + "|(biz|b[abdefghijmnorstvwyz])"
        + "|(cat|com|coop|c[acdfghiklmnoruvxyz])"
        + "|d[ejkmoz]"
        + "|(edu|e[cegrstu])"
        + "|f[ijkmor]"
        + "|(gov|g[abdefghilmnpqrstuwy])"
        + "|h[kmnrtu]"
        + "|(info|int|i[delmnoqrst])"
        + "|(jobs|j[emop])"
        + "|k[eghimnprwyz]"
        + "|l[abcikrstuvy]"
        + "|(mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])"
        + "|(name|net|n[acefgilopruz])"
        + "|(org|om)"
        + "|(pro|p[aefghklmnrstwy])"
        + "|qa"
        + "|r[eosuw]"
        + "|s[abcdeghijklmnortuvyz]"
        + "|(tel|travel|t[cdfghjklmnoprtvwz])"
        + "|u[agksyz]"
        + "|v[aceginu]"
        + "|w[fs]"
        + "|(\u03b4\u03bf\u03ba\u03b9\u03bc\u03ae|\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435|\u0440\u0444|\u0441\u0440\u0431|\u05d8\u05e2\u05e1\u05d8|\u0622\u0632\u0645\u0627\u06cc\u0634\u06cc|\u0625\u062e\u062a\u0628\u0627\u0631|\u0627\u0644\u0627\u0631\u062f\u0646|\u0627\u0644\u062c\u0632\u0627\u0626\u0631|\u0627\u0644\u0633\u0639\u0648\u062f\u064a\u0629|\u0627\u0644\u0645\u063a\u0631\u0628|\u0627\u0645\u0627\u0631\u0627\u062a|\u0628\u06be\u0627\u0631\u062a|\u062a\u0648\u0646\u0633|\u0633\u0648\u0631\u064a\u0629|\u0641\u0644\u0633\u0637\u064a\u0646|\u0642\u0637\u0631|\u0645\u0635\u0631|\u092a\u0930\u0940\u0915\u094d\u0937\u093e|\u092d\u093e\u0930\u0924|\u09ad\u09be\u09b0\u09a4|\u0a2d\u0a3e\u0a30\u0a24|\u0aad\u0abe\u0ab0\u0aa4|\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe|\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8|\u0b9a\u0bbf\u0b99\u0bcd\u0b95\u0baa\u0bcd\u0baa\u0bc2\u0bb0\u0bcd|\u0baa\u0bb0\u0bbf\u0b9f\u0bcd\u0b9a\u0bc8|\u0c2d\u0c3e\u0c30\u0c24\u0c4d|\u0dbd\u0d82\u0d9a\u0dcf|\u0e44\u0e17\u0e22|\u30c6\u30b9\u30c8|\u4e2d\u56fd|\u4e2d\u570b|\u53f0\u6e7e|\u53f0\u7063|\u65b0\u52a0\u5761|\u6d4b\u8bd5|\u6e2c\u8a66|\u9999\u6e2f|\ud14c\uc2a4\ud2b8|\ud55c\uad6d|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)"
        + "|y[et]"
        + "|z[amw])";

    /**
     *  Regular expression pattern to match all IANA top-level domains.
     *  @deprecated This API is deprecated. See {@link #TOP_LEVEL_DOMAIN_STR}.
     */
    @Deprecated
    public static final Pattern TOP_LEVEL_DOMAIN =
        Pattern.compile(TOP_LEVEL_DOMAIN_STR);

    /**
     *  Regular expression to match all IANA top-level domains for WEB_URL.
     *  List accurate as of 2011/07/18.  List taken from:
     *  http://data.iana.org/TLD/tlds-alpha-by-domain.txt
     *  This pattern is auto-generated by frameworks/ex/common/tools/make-iana-tld-pattern.py
     *
     *  @deprecated This API is deprecated. See {@link #TOP_LEVEL_DOMAIN_STR}.
     */
    @Deprecated
    public static final String TOP_LEVEL_DOMAIN_STR_FOR_WEB_URL =
        "(?:"
        + "(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])"
        + "|(?:biz|b[abdefghijmnorstvwyz])"
        + "|(?:cat|com|coop|c[acdfghiklmnoruvxyz])"
        + "|d[ejkmoz]"
        + "|(?:edu|e[cegrstu])"
        + "|f[ijkmor]"
        + "|(?:gov|g[abdefghilmnpqrstuwy])"
        + "|h[kmnrtu]"
        + "|(?:info|int|i[delmnoqrst])"
        + "|(?:jobs|j[emop])"
        + "|k[eghimnprwyz]"
        + "|l[abcikrstuvy]"
        + "|(?:mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])"
        + "|(?:name|net|n[acefgilopruz])"
        + "|(?:org|om)"
        + "|(?:pro|p[aefghklmnrstwy])"
        + "|qa"
        + "|r[eosuw]"
        + "|s[abcdeghijklmnortuvyz]"
        + "|(?:tel|travel|t[cdfghjklmnoprtvwz])"
        + "|u[agksyz]"
        + "|v[aceginu]"
        + "|w[fs]"
        + "|(?:\u03b4\u03bf\u03ba\u03b9\u03bc\u03ae|\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435|\u0440\u0444|\u0441\u0440\u0431|\u05d8\u05e2\u05e1\u05d8|\u0622\u0632\u0645\u0627\u06cc\u0634\u06cc|\u0625\u062e\u062a\u0628\u0627\u0631|\u0627\u0644\u0627\u0631\u062f\u0646|\u0627\u0644\u062c\u0632\u0627\u0626\u0631|\u0627\u0644\u0633\u0639\u0648\u062f\u064a\u0629|\u0627\u0644\u0645\u063a\u0631\u0628|\u0627\u0645\u0627\u0631\u0627\u062a|\u0628\u06be\u0627\u0631\u062a|\u062a\u0648\u0646\u0633|\u0633\u0648\u0631\u064a\u0629|\u0641\u0644\u0633\u0637\u064a\u0646|\u0642\u0637\u0631|\u0645\u0635\u0631|\u092a\u0930\u0940\u0915\u094d\u0937\u093e|\u092d\u093e\u0930\u0924|\u09ad\u09be\u09b0\u09a4|\u0a2d\u0a3e\u0a30\u0a24|\u0aad\u0abe\u0ab0\u0aa4|\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe|\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8|\u0b9a\u0bbf\u0b99\u0bcd\u0b95\u0baa\u0bcd\u0baa\u0bc2\u0bb0\u0bcd|\u0baa\u0bb0\u0bbf\u0b9f\u0bcd\u0b9a\u0bc8|\u0c2d\u0c3e\u0c30\u0c24\u0c4d|\u0dbd\u0d82\u0d9a\u0dcf|\u0e44\u0e17\u0e22|\u30c6\u30b9\u30c8|\u4e2d\u56fd|\u4e2d\u570b|\u53f0\u6e7e|\u53f0\u7063|\u65b0\u52a0\u5761|\u6d4b\u8bd5|\u6e2c\u8a66|\u9999\u6e2f|\ud14c\uc2a4\ud2b8|\ud55c\uad6d|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)"
        + "|y[et]"
        + "|z[amw]))";

    /**
     * Good characters for Internationalized Resource Identifiers (IRI).
     * This comprises most common used Unicode characters allowed in IRI
     * as detailed in RFC 3987.
     * Specifically, those two byte Unicode characters are not included.
     */
    public static final String GOOD_IRI_CHAR =
        "a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF";

    public static final Pattern IP_ADDRESS
        = Pattern.compile(
            "((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(25[0-5]|2[0-4]"
            + "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]"
            + "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}"
            + "|[1-9][0-9]|[0-9]))");

    /**
     * RFC 1035 Section 2.3.4 limits the labels to a maximum 63 octets.
     */
    private static final String IRI
        = "[" + GOOD_IRI_CHAR + "]([" + GOOD_IRI_CHAR + "\\-]{0,61}[" + GOOD_IRI_CHAR + "]){0,1}";

    private static final String GOOD_GTLD_CHAR =
        "a-zA-Z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF";
    private static final String GTLD = "[" + GOOD_GTLD_CHAR + "]{2,63}";
    private static final String HOST_NAME = "(" + IRI + "\\.)+" + GTLD;

    public static final Pattern DOMAIN_NAME
        = Pattern.compile("(" + HOST_NAME + "|" + IP_ADDRESS + ")");

    /**
     *  Regular expression pattern to match most part of RFC 3987
     *  Internationalized URLs, aka IRIs.  Commonly used Unicode characters are
     *  added.
     */
    public static final Pattern WEB_URL = Pattern.compile(
        "((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)"
        + "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_"
        + "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?"
        + "(?:" + DOMAIN_NAME + ")"
        + "(?:\\:\\d{1,5})?)" // plus option port number
        + "(\\/(?:(?:[" + GOOD_IRI_CHAR + "\\;\\/\\?\\:\\@\\&\\=\\#\\~"  // plus option query params
        + "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"
        + "(?:\\b|$)"); // and finally, a word boundary or end of
                        // input.  This is to stop foo.sure from
                        // matching as foo.su

    public static final Pattern EMAIL_ADDRESS
        = Pattern.compile(
            "[a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}" +
            "\\@" +
            "[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}" +
            "(" +
                "\\." +
                "[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25}" +
            ")+"
        );

    /**
     * This pattern is intended for searching for things that look like they
     * might be phone numbers in arbitrary text, not for validating whether
     * something is in fact a phone number.  It will miss many things that
     * are legitimate phone numbers.
     *
     * <p> The pattern matches the following:
     * <ul>
     * <li>Optionally, a + sign followed immediately by one or more digits. Spaces, dots, or dashes
     * may follow.
     * <li>Optionally, sets of digits in parentheses, separated by spaces, dots, or dashes.
     * <li>A string starting and ending with a digit, containing digits, spaces, dots, and/or dashes.
     * </ul>
     */
    public static final Pattern PHONE
        = Pattern.compile(                      // sdd = space, dot, or dash
                "(\\+[0-9]+[\\- \\.]*)?"        // +<digits><sdd>*
                + "(\\([0-9]+\\)[\\- \\.]*)?"   // (<digits>)<sdd>*
                + "([0-9][0-9\\- \\.]+[0-9])"); // <digit><digit|sdd>+<digit>

    /**
     *  Convenience method to take all of the non-null matching groups in a
     *  regex Matcher and return them as a concatenated string.
     *
     *  @param matcher      The Matcher object from which grouped text will
     *                      be extracted
     *
     *  @return             A String comprising all of the non-null matched
     *                      groups concatenated together
     */
    public static final String concatGroups(Matcher matcher) {
        StringBuilder b = new StringBuilder();
        final int numGroups = matcher.groupCount();

        for (int i = 1; i <= numGroups; i++) {
            String s = matcher.group(i);

            if (s != null) {
                b.append(s);
            }
        }

        return b.toString();
    }

    /**
     * Convenience method to return only the digits and plus signs
     * in the matching string.
     *
     * @param matcher      The Matcher object from which digits and plus will
     *                     be extracted
     *
     * @return             A String comprising all of the digits and plus in
     *                     the match
     */
    public static final String digitsAndPlusOnly(Matcher matcher) {
        StringBuilder buffer = new StringBuilder();
        String matchingRegion = matcher.group();

        for (int i = 0, size = matchingRegion.length(); i < size; i++) {
            char character = matchingRegion.charAt(i);

            if (character == '+' || Character.isDigit(character)) {
                buffer.append(character);
            }
        }
        return buffer.toString();
    }

    /**
     * Do not create this static utility class.
     */
    private Patterns() {}
}

Sarasvati answered 20/9, 2013 at 11:18 Comment(11)

+1 for you! Thank you so much!!! This is great code! Everybody is trying this with difficult regex while it could be this easy. Awesome! – Tomfool 7/1, 2014 at 16:4

@JPM Except that the OP was looking for a Java solution not an Android specific one (easy to forget to look at the specific tags for a Q). Still, a good thing for those who do code for Android to know about so I upped. – Octavie 9/9, 2014 at 12:9

@Octavie lol when working in android for over 4 years its easy to forget that there is still a whole Java world out there and that Android is a subset of it... – Chuchuah 9/9, 2014 at 14:31

@Octavie luckily, Android is open source and you can fetch the code from github.com/android/platform_frameworks_base/blob/master/core/… :) – Wadleigh 14/1, 2015 at 17:32

@Wadleigh I think your comment is the only thing making this answer acceptable/helpful, as the question is looking for java solution, not android – Eyespot 23/2, 2015 at 17:49

@mmcrae well, I'm glad you found my comment helpful! In fact, I realized that you are absolutely correct, and made an edit to make the post more informative, in case my link stops working or something. – Wadleigh 23/2, 2015 at 21:25

Correct, OP was looking for a Java solution in '08. It's super you found something that works in '14 but perhaps the Android solution should be moved to it's own question. – Equivocate 24/6, 2015 at 17:26

it doesn't check whether url starts from http or https. I tested with www.www and it gave me true. So not useful for me – Weslee 3/4, 2017 at 6:46

deprecated + android specific solution – Homocyclic 19/10, 2017 at 19:17

It returned true even when I left of the URL extension (.com, .edu, etc.) This makes this class useless to me too. I will stick to regular RegEx – Doodlesack 30/3, 2020 at 14:50

this seems to say the url matches if I do just google.com, it should fail if https:// isn't included. – Roughhew 16/6, 2022 at 20:59

I'll try a standard "Why are you doing it this way?" answer... Do you know about java.net.URL?

URL url = new URL(stringURL);

The above will throw a MalformedURLException if it can't parse the URL.

Proportional answered 2/10, 2008 at 16:53 Comment(8)

I have to go through the regular expressions road. What i post here is as simple as possible to make my question clear. In my program I am using the URL regex inside a more complex regex. – Deaver 2/10, 2008 at 17:0

That's cool. I didn't have a better answer regex-wise, so I thought I'd post an alternative. Didn't think I'd get down-ticked for it, though. – Proportional 2/10, 2008 at 17:9

you are right, maybe down-ticked was a bit two much. The "I'll try the standard" just sounded a bit offensive. – Deaver 2/10, 2008 at 17:12

cool (sorry, quick vacation). Ya, definitely wasn't intended that way. I just see that a lot here, and sometimes it even helps. – Proportional 6/10, 2008 at 2:48

"new URL" only throws MalformedURLException if the port < 0 or it doesn't understand the protocol. Other than that anything goes. It won't catch: 1.2.3. 1.2.3.4.5 1.2,3.4.5: etc – Admirable 9/6, 2011 at 15:12

Very funny, but if you're verifying a lot of string most of which are not URL, it will be throwing lots of exceptions which you probably don't desire. – Darkish 12/8, 2016 at 16:38

This isn't really an URL detector, but a URL validator. – Angry 10/6, 2021 at 2:22

This doesn't detect some malformations that might throw other things off. For example, an extra : in the scheme won't be picked up by this, but Selenium webdriver will spank you for it – Pestilential 30/6, 2022 at 9:28

The problem with all suggested approaches: all RegEx is validating

All RegEx -based code is over-engineered: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package (filtering Email addresses from text) for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

If you want to filter Emails from text (because later on you will do NLP staff etc) - just remove all tokens containing "@" inside.

This is simple text where RegEx of Java 6 fails. Try it in divverent variants of Java. It takes about 1000 milliseconds per RegEx call, in a long running single threaded test application:

pattern = Pattern.compile("[A-Za-z0-9](([_\\.\\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\\.\\-]?[a-zA-Z0-9]+)*)\\.([A-Za-z]{2,})", Pattern.CASE_INSENSITIVE);

"Avalanna is such a sweet little girl! It would b heartbreaking if cancer won. She's so precious! #BeliebersPrayForAvalanna");
"@AndySamuels31 Hahahahahahahahahhaha lol, you don't look like a girl hahahahhaahaha, you are... sexy.";

Do not rely on regular expressions if you only need to filter words with "@", "http://", "ftp://", "mailto:"; it is huge engineering overhead.

If you really want to use RegEx with Java, try Automaton

Jacquline answered 20/1, 2013 at 15:23 Comment(2)

Lol. Automaton does not support capture groups. – Bram 13/5, 2014 at 8:58

I don't get your concern. The accepted answer's regex does in fact work well for validating URL's. You seem to deride it, saying it will find only valid URLs! -- that's the goal of OP's question. Am I missing something? – Eyespot 23/2, 2015 at 17:47

This works too:

String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

Note:

String regex = "<\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // matches <http://google.com>

String regex = "<^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // does not match <http://google.com>

So probably the first one is more useful for general use.

Deaver answered 2/10, 2008 at 17:24 Comment(0)

In line with billjamesdev answer, here is another approach to validate an URL without using a RegEx:

From Apache Commons Validator lib, look at class UrlValidator. Some example code:

Construct a UrlValidator with valid schemes of "http", and "https".

String[] schemes = {"http","https"}.
UrlValidator urlValidator = new UrlValidator(schemes);
if (urlValidator.isValid("ftp://foo.bar.com/")) {
   System.out.println("url is valid");
} else {
   System.out.println("url is invalid");
}

prints "url is invalid"

If instead the default constructor is used.

UrlValidator urlValidator = new UrlValidator();
if (urlValidator.isValid("ftp://foo.bar.com/")) {
   System.out.println("url is valid");
} else {
   System.out.println("url is invalid");
}

prints out "url is valid"

Barrelhouse answered 6/5, 2016 at 9:58 Comment(0)

((http?|https|ftp|file)://)?((W|w){3}.)?[a-zA-Z0-9]+\.[a-zA-Z]+

check here:- https://www.freeformatter.com/java-regex-tester.html#ad-output

It sorts out theses entries correctly

google.com
www.google.com
wwwgooglecom
ft.
Www.google.com
.ft
https://www.google.com
https://
https://www.
https://google.com

Barrens answered 12/6, 2019 at 10:36 Comment(3)

It fails for special characters – Geotropism 5/7, 2019 at 8:20

i needed regex for removing urls from song names. Couldnt find one or should i say most failed so i posted what worked for me and on those cases:) – Barrens 29/7, 2019 at 10:45

yeah it does. check it on the link provided. – Barrens 14/8, 2019 at 8:46

When using regular expressions from RegexBuddy's library, make sure to use the same matching modes in your own code as the regex from the library. If you generate a source code snippet on the Use tab, RegexBuddy will automatically set the correct matching options in the source code snippet. If you copy/paste the regex, you have to do that yourself.

In this case, as others pointed out, you missed the case insensitivity option.

Enthrone answered 9/11, 2008 at 15:11 Comment(0)

Here is a proposal of an URL parser regex that recognizes :

Protocol
Host
Port
Path (Document/folder)
Get parameters

^(?>(?<protocol>[[:alpha:]]+(?>\:[[:alpha:]]+)*)\:\/\/)?(?<host>(?>[[:alnum:]]|[-_.])+)(?>\:(?<port>[[:digit:]]+))?(?<path>\/(?>[[:alnum:]]|[-_.\/])*)?(?>\?(?<request>(?>[[:alnum:]]+=[[:alnum:]]+)(?>\&(?>[[:alnum:]]+=[[:alnum:]]+))*))?$

This regex is able to parse an URL such :

jdbc:hsqldb:hsql://localhost:91/index.

There can be many way to engineer a URL regex, thus the one I propose can be lightly adapted to match more accurate URL grammars.

It can be tested on the following page : https://regex101.com/r/Dy7HE0/5

Be aware that langages native API for regex (such as java.util.regex) don't support smart character classes such as [[:alnum:]] and [[:alpha:]].

Use instead \w and \d.

Alcaraz answered 10/2, 2021 at 11:35 Comment(0)

First, an regex example:

regex = “((http|https)://)(www.)?” 
+ “[a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]” 
+ “{2,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)”

*The URL must start with either http or https and *then followed by :// and *then it must contain www. and *then followed by subdomain of length (2, 256) and *last part contains top level domain like .com, .org etc.

In JAVA

// Java program to check URL is valid or not
// using Regular Expression
 
import java.util.regex.*;
 
class GFG {
 
    // Function to validate URL
    // using regular expression
    public static boolean
    isValidURL(String url)
    {
        // Regex to check valid URL
        String regex = "((http|https)://)(www.)?"
              + "[a-zA-Z0-9@:%._\\+~#?&//=]"
              + "{2,256}\\.[a-z]"
              + "{2,6}\\b([-a-zA-Z0-9@:%"
              + "._\\+~#?&//=]*)";
 
        // Compile the ReGex
        Pattern p = Pattern.compile(regex);
 
        // If the string is empty
        // return false
        if (url == null) {
            return false;
        }
 
        // Find match between given string
        // and regular expression
        // using Pattern.matcher()
        Matcher m = p.matcher(url);
 
        // Return if the string
        // matched the ReGex
        return m.matches();
    }
 
    // Driver code
    public static void main(String args[])
    {
        String url
            = "https://www.superDev.org";
        if (isValidURL(url) == true) {
            System.out.println("Yes");
        }
        else
            System.out.println("NO");
    }
}

In python 3

# Python3 program to check
# URL is valid or not
# using regular expression
import re
 
# Function to validate URL
# using regular expression
def isValidURL(str):
 
    # Regex to check valid URL
    regex = ("((http|https)://)(www.)?" +
             "[a-zA-Z0-9@:%._\\+~#?&//=]" +
             "{2,256}\\.[a-z]" +
             "{2,6}\\b([-a-zA-Z0-9@:%" +
             "._\\+~#?&//=]*)")
     
    # Compile the ReGex
    p = re.compile(regex)
 
    # If the string is empty
    # return false
    if (str == None):
        return False
 
    # Return if the string
    # matched the ReGex
    if(re.search(p, str)):
        return True
    else:
        return False
 
# Driver code
 
# Test Case 1:
url = "https://www.superDev.org"
 
if(isValidURL(url) == True):
    print("Yes")
else:
    print("No")

Blackout answered 12/8, 2021 at 14:52 Comment(0)

This regular expression works for me:

String regex = "(https?://|www\\.)[-a-zA-Z0-9+&@#/%?=~_|!:.;]*[-a-zA-Z0-9+&@#/%=~_|]";

Heins answered 17/2, 2022 at 6:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

In JAVA

In python 3

Recommended topics

Hot tags