Regex validation of email addresses according to RFC5321/RFC5322
Asked Answered
P

1

25

Does anyone know a regex that validates email addresses according to RFC5321/RFC5322?

Since (nestable) comments make the grammar irregular, only addresses without comments should be regarded.

Of course, if you're interested in validating an address that is actually owned by someone then the only real validation is to send an email to the address and check if the owner received it. I am however purely interested in the RFC standards. For a practical approach this question is more relevant.

On top of comments I am willing to sacrifice folding white space, but apart from that I'm not interested in expressions that reject any addresses that are RFC5321/2-valid. (Arguably it would even make sense in some circumstances to disregard folding white space.)

Ideally the regex would reject anything that's not RFC-valid, but that's less important. It's not so interesting to include an exhausive list of top-level domains in the regex for example. Simply accepting any top-level domain will suffice.

I'm not sure if address tags (e.g. [email protected]) are part of the RFCs I mentioned, but I would like the regex to validate these.

IPv6 should definitly be handled correctly (RFC5952).

As I understand internationalized email (RFC6530, RFC6531, RFC6532, RFC6533) is still in the experimental phase, but an expression validating these addresses would also be interesting.

To make the answers universally interesting it would be nice if any regular expressions were in POSIX format.

Pillory answered 21/12, 2012 at 15:2 Comment(2)
That's impossible with traditional regex flavours. Email adresses can contain comments with arbitrarily deep nesting, and such is not parsable by a regular expression grammar.Deanedeaner
@Deanedeaner - True (and very good point). But if the (possibly nested) comments are first stripped out, then it can be done. This is how the perl regex solution linked to by Rafał Toboła does it.Cunha
P
31

Nestable comments make the grammar for email-addresses irregular (context-free). If you preclude comments however, the resulting grammar is regular. The primary definition allows for (folding) whitespace between lexical tokens (e.g. a @ b.com). Removing all folding whitespace results in a canonical form.

This is the regex for canonical email addresses according to RFC 5322 (precluding comments):

([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|\[[\t -Z^-~]*])

If you need to accept folding whitespace, then this is the regular expression for email addresses according to RFC 5322 (precluding comments):

((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?"(((([\t ]*\r\n)?[\t ]+)?([]!#-[^-~]|(\\[\t -~])))+(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?)"(([\t ]*\r\n)?[\t ]+)?)@((([\t ]*\r\n)?[\t ]+)?[-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*(([\t ]*\r\n)?[\t ]+)?|(([\t ]*\r\n)?[\t ]+)?\[((([\t ]*\r\n)?[\t ]+)?[!-Z^-~])*(([\t ]*\r\n)?[\t ]+)?](([\t ]*\r\n)?[\t ]+)?)

Valid email addresses are further restricted in RFC 5321 (SMTP). It basically leaves alone the part before the @-sign, but accepts only host names or address literals after the @-sign. ("---.---" is a valid dot-atom, but not a valid host name and "[...]" is a valid domain literal, but not a valid address literal.)

The grammar presented in RFC 5321 is too lenient when it comes to both host names and IP addresses. I took the liberty of "correcting" the rules in question, using this draft and RFC 1034 (section 3.5) as guidelines. Here's the resulting regex.

([!#-'*+/-9=?A-Z^-~-]+(\.[!#-'*+/-9=?A-Z^-~-]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@([0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?(\.[0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?)*|\[((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}|IPv6:((((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){6}|::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){5}|[0-9A-Fa-f]{0,4}::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){4}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):)?(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){3}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,2}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){2}|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,3}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,4}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::)((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3})|(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,5}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,6}(0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}))?::)|(?!IPv6:)[0-9A-Za-z-]*[0-9A-Za-z]:[!-Z^-~]+)])

All regexes are POSIX EREs. The last one uses a negative lookahead. See here for the derivations of the regular expressions.

Pillory answered 18/11, 2014 at 8:8 Comment(7)
This regexps are no complaint with rfc6532, due it restricts contact part to ascii.Ruching
@MihailKrivushin Couldn’t agree more. The question was about RFC5321/2 specifically though...Pillory
Why is there no a-z in the character groups in the first regex. And what characters does the ^-~ include? Is that range wanted?China
@China The a-z range is included in ^-~. If you search for an ASCII table you can see which characters are included in the ranges.Pillory
this throws a empty character class warning from eslint - eslint.org/docs/rules/no-empty-character-classInheritrix
The first regex is not valid as it accepts "test@test" as valid.Perforate
@TomislavBrabec That's because "test@test" is a valid address according to the RFC. See this answer for more about extra validations on top of the RFCs.Pillory

© 2022 - 2024 — McMap. All rights reserved.