How to extract valid email from larger string in Scala
Asked Answered
L

3

5

My scala version 2.7.7

Im trying to extract an email adress from a larger string. the string itself follows no format. the code i've got:

import scala.util.matching.Regex
import scala.util.matching._
val Reg = """\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b""".r
"yo my name is joe : [email protected]" match {
    case Reg(e) => println("match: " + e)
    case _ => println("fail")
}

the Regex passes in RegExBuilder but does not pass for scala. Also if there is another way to do this without regex that would be fine also. Thanks!

Littman answered 17/5, 2010 at 0:8 Comment(0)
R
7

As Alan Moore pointed out, you need to add the (?i) to the beginning of the pattern to make it case-insensitive. Also note that using the Regex directly matches the whole string. If you want to find one within a larger string, you can call findFirstIn() or use one of the similar methods of Regex.

val reg = """(?i)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b""".r
reg findFirstIn "yo my name is joe : [email protected]"  match {
    case Some(email) => println("match: " + email)
    case None => println("fail")
}
Riboflavin answered 17/5, 2010 at 2:37 Comment(1)
Not the solution i went with, but it works! thanks! (silly me, did not check what class was being returned : would have found Some(x))Littman
Z
3

It looks like you're trying to do a case-insensitive search, but you aren't specifying that anywhere. Try adding (?i) to the beginning of the regex:

"""(?i)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b""".r
Zincograph answered 17/5, 2010 at 0:43 Comment(0)
H
1

Well, the ways to do it other than REs are probably a lot messier. The next step up would probably the a combinator parser. A lot of random string dissection code would be even more general and almost certainly a whole lot more painful. In part what's a suitable tactic depends on how complete (and how strict or lenient) your recognizer needs to be. E.g., the common form: Rudolf Reindeer <rudy.caribou@north_pole.rth> is not accepted by your RE (even after the case-sensitivity is relaxed). Full-blown RFC 2822 address parsing is rather challenging for an RE-based approach.

Holbrook answered 17/5, 2010 at 2:53 Comment(1)
Did someone say RFC 2822? (There are two breaks in the stuff. For this to work, they should be single ticks but escaping is hard.) (?:[a-z0-9!#$%&'*+/=?^_``{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_``{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])Berners

© 2022 - 2024 — McMap. All rights reserved.