Is regular expression recognition of an email address hard?
Asked Answered
C

19

62

I recently read somewhere that writing a regexp to match an email address, taking into account all the variations and possibilities of the standard is extremely hard and is significantly more complicated than what one would initially assume.

Why is that?

Are there any known and proven regexps that actually do this fully?

What are some good alternatives to using regexps for matching email addresses?

Catanzaro answered 1/10, 2008 at 6:22 Comment(7)
Something interesting about Email regular expression codinghorror.com/blog/archives/000214.htmlVerger
If you're just interested in matching common email patterns, you can have a look at some of the expressions here.Aliment
I think what you read pertains not to "validating an e-mail address according to the standard", but rather "validating an actual e-mail address". The difference is not subtle, even if the wording is. Currently, the answers below are a mix of the two. Perhaps you would clarify the question?Jahvist
I wrote up a blog post on this some time ago -- It's here: How to Validate an Email Address Using Regular Expressions it points out some of the challenges in catching all the different edge cases.Quade
possible duplicate of What is the best regular expression for validating email addresses?Bullis
It is a common idiocy to parse complex text with a SINGLE regexp. But it is easy to parse complex text (such as C source code) with a SET of regexps, e.g. using lex and yacc. This method also does support recursion. Blame Larry. :)Lhasa
Archive link to How to Validate an Email Address Using Regular ExpressionsMccurry
B
65

For the formal e-mail spec, yes, it is technically impossible via Regex due to the recursion of things like comments (especially if you don't remove comments to whitespace first), and the various different formats (an e-mail address isn't always [email protected]). You can get close (with some massive and incomprehensible Regex patterns), but a far better way of checking an e-mail is to do the very familiar handshake:

  • they tell you their e-mail
  • you e-mail them a confimation link with a Guid
  • when they click on the link you know that:

    1. the e-mail is correct
    2. it exists
    3. they own it

Far better than blindly accepting an e-mail address.

Baptistery answered 1/10, 2008 at 6:26 Comment(6)
Good advice, if you're writing a website, doesn't work so well if you're writing an email server / client :-)Kataway
If you're writing an email client or server, then you shouldn't be fake-parsing the only thing you have to parse (pretty much).Angelicaangelico
How do you email them a confirmation without blindly accepting their email address?Royce
@janm: the email server does the validation for you: If the message was delivered (and the link within clicked) the address was valid.Bijouterie
If you have a trustworth email server and you can get the email address to it reliably, great. (eg. qmail, postfix with Unix style exec(2)). If not, some care must still be taken, like with any data from an untrusted source.Royce
@Johan: replace "click on the link" with "reply to email"Naturopathy
B
22

There are a number of Perl modules (for example) that do this. Don't try and write your own regexp to do it. Look at

Mail::VRFY will do syntax and network checks (does and SMTP server somewhere accept this address)

https://metacpan.org/pod/Mail::VRFY

RFC::RFC822::Address - a recursive descent email address parser.

https://metacpan.org/pod/RFC::RFC822::Address

Mail::RFC822::Address - regexp-based address validation, worth looking at just for the insane regexp

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

Similar tools exist for other languages. Insane regexp below...

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
Babbette answered 1/10, 2008 at 6:34 Comment(7)
I remember someone saying that regex is both stupid (auto generated), and wrong. Does anyone else remember that?Wasteland
That wouldn't surprise me to be honest - that said all the attempts to validate an email via regexp to the actual standard that I've seen have been insane to some degree - I wouldn't even try and understand that one. A regexp a tenth the size of it probably means you shouldn't be using it ;)Babbette
i would file this under "cryptography"Saccharometer
Personally I'd rather use a simplified regex than that beast (even if it only handles 99.95% of real cases)... or none at all and do the handshake.Baptistery
@Simon, this is correct. You need to preprocess the string to remove comments before you can even apply this regex, and RFC822 is incredibly obsolete; it's from 1982(!)Thailand
'i would file this under "cryptography"'. I'd say that, or that it's "Hello world" written in APL. :-)Dalessio
@the I think you're confusing APL and Perl; in APL "Hello world" is probably one character, but it's a character that doesn't appear anywhere on your keyboardMansur
C
11

Validating e-mail addresses aren't really very helpful anyway. It will not catch common typos or made-up email addresses, since these tend to look syntactically like valid addresses.

If you want to be sure an address is valid, you have no choice but to send an confirmation mail.

If you just want to be sure that the user inputs something that looks like an email rather than just "asdf", then check for an @. More complex validation does not really provide any benefit.

(I know this doesn't answer your questions, but I think it's worth mentioning anyway)

Carlow answered 1/10, 2008 at 8:38 Comment(3)
I think it does answer the question.Jahvist
I also like to check that there is only 1 @ character and that is not the first or last character. When I know that the email address is going to be a "typically" formatted email address (i.e. [email protected]), then also like to check for 1 or more characters after the @ character, followed by a . character ("dot") followed by by atleast 1 or more characters.Penholder
@Adam: If you go down that road you have to do it correctly. See eg. janm's explanation of how you can have more than one @ in a valid email address.Carlow
R
8

There is a context free grammar in BNF that describes valid email addresses in RFC-2822. It is complex. For example:

" @ "@example.com

is a valid email address. I don't know of any regexps that do it fully; the examples usually given require comments to be stripped first. I wrote a recursive descent parser to do it fully once.

Royce answered 1/10, 2008 at 6:28 Comment(0)
F
8

I've now collated test cases from Cal Henderson, Dave Child, Phil Haack, Doug Lovell and RFC 3696. 158 test addresses in all.

I ran all these tests against all the validators I could find. The comparison is here: http://www.dominicsayers.com/isemail

I'll try to keep this page up-to-date as people enhance their validators. Thanks to Cal, Dave and Phil for their help and co-operation in compiling these tests and constructive criticism of my own validator.

People should be aware of the errata against RFC 3696 in particular. Three of the canonical examples are in fact invalid addresses. And the maximum length of an address is 254 or 256 characters, not 320.

Fullerton answered 10/2, 2009 at 16:16 Comment(0)
B
7

It's not all nonsense though as allowing characters such as '+' can be highly useful for users combating spam, e.g. [email protected] (instant disposable Gmail addresses).

Only when a site accepts it though.

Bunyabunya answered 1/10, 2008 at 7:27 Comment(1)
This is fairly common, not only with gmail; I've been doing it for about a decade (I use - rather than + because I prefer it and it's my server so I can, but + is normal).Leandroleaning
M
6

Whether or not to accept bizarre, uncommon email address formats depends, in my opinion, on what one wants to do with them.

If you're writing a mail server, you have to be very exact and excruciatingly correct in what you accept. The "insane" regex quoted above is therefore appropriate.

For the rest of us, though, we're mainly just interested in ensuring that something a user types in a web form looks reasonable and doesn't have some sort of sql injection or buffer overflow in it.

Frankly, does anyone really care about letting someone enter a 200-character email address with comments, newlines, quotes, spaces, parentheses, or other gibberish when signing up for a mailing list, newsletter, or web site? The proper response to such clowns is "Come back later when you have an address that looks like [email protected]".

The validation I do consists of ensuring that there is exactly one '@'; that there are no spaces, nulls or newlines; that the part to the right of the '@' has at least one dot (but not two dots in a row); and that there are no quotes, parentheses, commas, colons, exclamations, semicolons, or backslashes, all of which are more likely to be attempts at hackery than parts of an actual email address.

Yes, this means I'm rejecting valid addresses with which someone might try to register on my web sites - perhaps I "incorrectly" reject as many as 0.001% of real-world addresses! I can live with that.

Mormon answered 3/10, 2008 at 16:13 Comment(0)
D
4

Quoting and various other rarely used but valid parts of the RFC make it hard. I don't know enough about this topic to comment definitively, other than "it's hard" - but fortunately other people have written about it at length.

As to a valid regex for it, the Perl Mail::Rfc822::Address module contains a regular expression which will apparently work - but only if any comments have been replaced by whitespace already. (Comments in an email address? You see why it's harder than one might expect...)

Of course, the simplified regexes which abound elsewhere will validate almost every email address which is genuinely being used...

Derickderide answered 1/10, 2008 at 6:35 Comment(1)
What? A Jon Skeet answer with a score of 0? Preposterous.Grunenwald
N
3

Some flavours of regex can actually match nested brackets (e.g., Perl compatible ones). That said, I have seen a regex that claims to correctly match RFC 822 and it was two pages of text without any whitespace. Therefore, the best way to detect a valid email address is to send email to it and see if it works.

Newborn answered 1/10, 2008 at 6:28 Comment(0)
E
3

Just to add a regex that is less crazy than the one listed by @mmaibaum:

^[a-zA-Z]([.]?([a-zA-Z0-9_-]+)*)?@([a-zA-Z0-9\-_]+\.)+[a-zA-Z]{2,4}$ 

It is not bulletproof, and certainly does not cover the entire email spec, but it does do a decent job of covering most basic requirements. Even better, it's somewhat comprehensible, and can be edited.

Cribbed from a discussion at HouseOfFusion.com, a world-class ColdFusion resource.

Edessa answered 3/10, 2008 at 15:46 Comment(4)
That regex doesn't even cover [email protected], let alone [email protected]. If that's someone's idea of a world-class ColdFusion resource, thank $DEITY I don't program in CF.Decommission
As stated in my desctiption, it was not supposed to be exhaustive. It was supposed to be (relatively) straightforward, and easy to modify.Edessa
Also, are you really going to judge a language based on what a handful of its users came up with years ago to solve something that is no longer a problem in the language?Edessa
I don't have experience creating regexp, but if you want '[email protected]' be correctly validated use (validated with Expresso): ^[a-zA-Z]([.]?([.a-zA-Z0-9_-]+)*)?@([a-zA-Z0-9\-_]+\.)+[a-zA-Z]{2,4}$Aggarwal
T
3

An easy and good way to check email-adresses in Java is to use the EmailValidator of the Apache Commons Validator library.

I would always check an email-address in an input-form against something like this before sending an email - even if you only catch some typos. You probably don't want to write an automated scanner for "delivery failed" notification mails. :-)

That answered 6/1, 2009 at 20:22 Comment(0)
W
2

It's really hard because there are a lot of things that can be valid in an email address according to the Email Spec, RFC 2822. Things that you don't normally see such as + are perfectly valid characters for an email address.. according to the spec.

There's an entire section devoted to email addresses at http://regexlib.com, which is a great resource. I'd suggest that you determine what criteria matters to you and find one that matches. Most people really don't need full support for all possibilities allowed by the spec.

Windy answered 1/10, 2008 at 6:29 Comment(2)
-1 for "Most people really don't need full support for all possibilities allowed by the spec."Bijouterie
@David Schmitt : The addresses: Abc\@[email protected], customer/[email protected] and !def!xyz%[email protected] are all valid.. however 99.99% of people won't run into these types of addresses in a production site.Windy
M
2

If you're running on the .NET Framework, just try instantiating a MailAddress object and catching the FormatException if it blows up, or pulling out the Address if it succeeds. Without getting into any nonsense about the performance of catching exceptions (really, if this is just on a single Web form it is not going to make that much of a difference), the MailAddress class in the .NET framework goes through a quite complete parsing process (it doesn't use a RegEx). Open up Reflector and search for MailAddress and MailBnfHelper.ReadMailAddress() to see all of the fancy stuff it does. Someone smarter than me spent a lot of time building that parser at Microsoft, I'm going to use it when I actually send an e-mail to that address, so I might as well use it to validate the incoming address, too.

Milomilon answered 31/12, 2009 at 15:8 Comment(0)
M
1

Try this one:

"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"

Have a look here for the details.

However, rather than implementing the RFC822 standard, maybe it would be better to look at it from another viewpoint. It doesn't really matter what the standard says if mail servers don't mirror the standard. So I would argue that it would be better to imitate what the most popular mail servers do when validating email addresses.

Mook answered 1/10, 2008 at 6:29 Comment(1)
I posted the same link on a similiar question: stackoverflow.com/questions/210945/… I found that it explained the situation well!Effectually
K
1

Many have tried, and many come close. You may want to read the wikipedia article, and some others.

Specifically, you'll want to remember that many websites and email servers have relaxed validation of email addresses, so essentially they don't implement the standard fully. It's good enough for email to work all the time though.

Kataway answered 1/10, 2008 at 6:37 Comment(0)
L
1

This class for Java has a validator in it: http://www.leshazlewood.com/?p=23

This is written by the creator of Shiro (formally Ki, formally JSecurity)

The pros and cons of testing for e-mail address validity:

There are two types of regexes that validate e-mails:

  1. Ones that are too loose.
  2. Ones that are too strict.

It is not possible for a regular expression to match all valid e-mail addresses and no e-mail addresses that are not valid because some strings might look like valid e-mail addresses but do not actually go to anyone's inbox. The only way to test to see if an e-mail is actually valid is to send an e-mail to that address and see if you get some sort of response. With that in mind, regexes that are too strict at matching e-mails don't really seem to have much of a purpose.

I think that most people who ask for an e-mail regex are looking for the first option, regexes that are too loose. They want to test a string and see if it looks like an e-mail, if it is definitely not an email, then they can say to the user: "Hey, you are supposed to put an e-mail here and this definitely is not a valid e-mail. Perhaps you didn't realize that this field is for an e-mail or maybe there is a typo".

If a user puts in a string that looks a lot like a valid e-mail, but it actually is not one, then that is a problem that should be handled by a different part of the application.

Leftist answered 11/3, 2010 at 8:0 Comment(0)
B
0

Can anyone provide some insight as to why that is?

Yes, it is an extremely complicated standard that allows lots of stuff that no one really uses today. :)

Are there any known and proven regexps that actually do this fully?

Here is one attempt to parse the whole standard fully...

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

What are some good alternatives to using regexps for matching email addresses?

Using an existing framework for it in whatever language you are using I guess? Though those will probably use regexp internally. It is a complex string. Regexps are designed to parse complex strings, so that really is your best choice.

Edit: I should add that the regexp I linked to was just for fun. I do not endorse using a complex regexp like that - some people say that "if your regexp is more than one line, it is guaranteed to have a bug in it somewhere". I linked to it to illustrate how complex the standard is.

Betel answered 1/10, 2008 at 6:28 Comment(2)
Well, no. Regexps are an easy-to-write-quickly way of parsing strings, whether or not complex. They are not designed to handle things that they literally cannot handle because it is mathematically beyond them, or indeed things that require insane, unmaintainable regexes.Angelicaangelico
Is anything designed to handle things mathematically beyond them? :PBetel
T
0

For completeness of this post, also for PHP there is a language built-in function to validate e-mails.

For PHP Use the nice filter_var with the specific EMAIL validation type :)

No more insane email regexes in php :D

var_dump(filter_var('[email protected]', FILTER_VALIDATE_EMAIL));

http://www.php.net/filter_var

Testament answered 1/10, 2008 at 9:10 Comment(0)
C
0

There always seems to be an unaccounted for format when trying to create a regular expression to validate emails. Though there are some characters that are not valid in an email, the basic format is local-part@domain and is roughly 64 chars max on the local part and roughly 253 chars on the domain. Besides that, it's kind like the wild wild west.

I think the answer depends on your definition of a validated email address and what your business process has tolerance for. Regular expressions are great for making sure an email is formatted properly and as you know there are many variations of them that can work. Here are a couple of variations:

Variant 1:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Variant2:

\A(?:[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\z

Just because an email is syntactically correct doesn't mean it is valid.

An email can adhere to the RFC 5322 and pass the regex but there will be no true insight into the emails actual deliverability. What if you wanted to know if the email was a bogus email or if it was disposable or not deliverable or a known bot? What if you wanted to exclude emails that were vulgar or in some way factious or problematic? By the way, just so everyone knows, I work for a data validation company and with that I just wanted give full disclosure that I work for Service Objects but, being a professional in the email validation field, I feel the solution we offer provides better validation than a regex. Feel free to give it a look, I think it can help a lot. You can see more info about this in our dev guide. It actually does a lot of cool email checks and verification's.

Here's an example:

Email: [email protected]

{
  "ValidateEmailInfo":{
      "Score":4,
      "IsDeliverable":"false",
      "EmailAddressIn":"[email protected]",
      "EmailAddressOut":"[email protected]",
      "EmailCorrected":false,
      "Box":"mickeyMouse",
      "Domain":"gmail.com",
      "TopLevelDomain":".com",
      "TopLevelDomainDescription":"commercial",
      "IsSMTPServerGood":"true",
      "IsCatchAllDomain":"false",
      "IsSMTPMailBoxGood":"false",
      "WarningCodes":"22",
      "WarningDescriptions":"Email is Bad - Subsequent checks halted.",
      "NotesCodes":"16",
      "NotesDescriptions":"TLS"
  }
}
Complementary answered 24/1, 2020 at 21:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.