F# Mapping Regular Expression Matches with Active Patterns
Asked Answered
P

1

22

I found this useful article on using Active Patterns with Regular Expressions: http://www.markhneedham.com/blog/2009/05/10/f-regular-expressionsactive-patterns/

The original code snippet used in the article was this:

open System.Text.RegularExpressions

let (|Match|_|) pattern input =
    let m = Regex.Match(input, pattern) in
    if m.Success then Some (List.tl [ for g in m.Groups -> g.Value ]) else None

let ContainsUrl value = 
    match value with
        | Match "(http:\/\/\S+)" result -> Some(result.Head)
        | _ -> None

Which would let you know if at least one url was found and what that url was (if I understood the snippet correctly)

Then in the comment section Joel suggested this modification:

Alternative, since a given group may or may not be a successful match:

List.tail [ for g in m.Groups -> if g.Success then Some g.Value else None ]

Or maybe you give labels to your groups and you want to access them by name:

(re.GetGroupNames()
 |> Seq.map (fun n -> (n, m.Groups.[n]))
 |> Seq.filter (fun (n, g) -> g.Success)
 |> Seq.map (fun (n, g) -> (n, g.Value))
 |> Map.ofSeq)

After trying to combine all of this I came up with the following code:

let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"

let (|Match|_|) pattern input =
    let re = new Regex(pattern)
    let m = re.Match(input) in
    if m.Success then Some ((re.GetGroupNames()
                                |> Seq.map (fun n -> (n, m.Groups.[n]))
                                |> Seq.filter (fun (n, g) -> g.Success)
                                |> Seq.map (fun (n, g) -> (n, g.Value))
                                |> Map.ofSeq)) else None

let GroupMatches stringToSearch = 
    match stringToSearch with
        | Match "(http:\/\/\S+)" result -> printfn "%A" result
        | _ -> ()


GroupMatches testString;;

When I run my code in an interactive session this is what is output:

map [("0", "http://www.bob.com"); ("1", "http://www.bob.com")]

The result I am trying to achieve would look something like this:

map [("http://www.bob.com", 2); ("http://www.b.com", 1); ("http://www.bill.com", 1);]

Basically a mapping of each unique match found followed by the count of the number of times that specific matching string was found in the text.

If you think I'm going down the wrong path here please feel free to suggest a completely different approach. I'm somewhat new to both Active Patterns and Regular Expressions so I have no idea where to even begin in trying to fix this.

I also came up with this which is basically what I would do in C# translated to F#.

let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"

let matches =
    let matchDictionary = new Dictionary<string,int>()
    for mtch in (Regex.Matches(testString, "(http:\/\/\S+)")) do
        for m in mtch.Captures do
            if(matchDictionary.ContainsKey(m.Value)) then
                matchDictionary.Item(m.Value) <- matchDictionary.Item(m.Value) + 1
            else
                matchDictionary.Add(m.Value, 1)
    matchDictionary

Which returns this when run:

val matches : Dictionary = dict [("http://www.bob.com", 2); ("http://www.b.com", 1); ("http://www.bill.com", 1)]

This is basically the result I am looking for, but I'm trying to learn the functional way to do this, and I think that should include active patterns. Feel free to try to "functionalize" this if it makes more sense than my first attempt.

Thanks in advance,

Bob

Pose answered 16/4, 2011 at 2:7 Comment(0)
R
26

Interesting stuff, I think everything you are exploring here is valid. (Partial) active patterns for regular expression matching work very well indeed. Especially when you have a string which you want to match against multiple alternative cases. The only thing I'd suggest with the more complex regex active patterns is that you give them more descriptive names, possibly building up a collection of different regex active patterns with differing purposes.

As for your C# to F# example, you can have functional solution just fine without active patterns, e.g.

let testString = "http://www.bob.com http://www.b.com http://www.bob.com http://www.bill.com"

let matches input =
    Regex.Matches(input, "(http:\/\/\S+)") 
    |> Seq.cast<Match>
    |> Seq.groupBy (fun m -> m.Value)
    |> Seq.map (fun (value, groups) -> value, (groups |> Seq.length))

//FSI output:
> matches testString;;
val it : seq<string * int> =
  seq
    [("http://www.bob.com", 2); ("http://www.b.com", 1);
     ("http://www.bill.com", 1)]

Update

The reason why this particular example works fine without active patterns is because 1) you are only testing one pattern, 2) you are dynamically processing the matches.

For a real world example of active patterns, let's consider a case where 1) we are testing multiple regexes, 2) we are testing for one regex match with multiple groups. For these scenarios, I use the following two active patterns, which are a bit more general than the first Match active pattern you showed (I do not discard first group in the match, and I return a list of the Group objects, not just their values -- one uses the compiled regex option for static regex patterns, one uses the interpreted regex option for dynamic regex patterns). Because the .NET regex API is so feature filled, what you return from your active pattern is really up to what you find useful. But returning a list of something is good, because then you can pattern match on that list.

let (|InterpretedMatch|_|) pattern input =
    if input = null then None
    else
        let m = Regex.Match(input, pattern)
        if m.Success then Some [for x in m.Groups -> x]
        else None

///Match the pattern using a cached compiled Regex
let (|CompiledMatch|_|) pattern input =
    if input = null then None
    else
        let m = Regex.Match(input, pattern, RegexOptions.Compiled)
        if m.Success then Some [for x in m.Groups -> x]
        else None

Notice also how these active patterns consider null a non-match, instead of throwing an exception.

OK, so let's say we want to parse names. We have the following requirements:

  1. Must have first and last name
  2. May have middle name
  3. First, optional middle, and last name are separated by a single blank space in that order
  4. Each part of the name may consist of any combination of at least one or more letters or numbers
  5. Input may be malformed

First we'll define the following record:

type Name = {First:string; Middle:option<string>; Last:string}

Then we can use our regex active pattern quite effectively in a function for parsing a name:

let parseName name =
    match name with
    | CompiledMatch @"^(\w+) (\w+) (\w+)$" [_; first; middle; last] ->
        Some({First=first.Value; Middle=Some(middle.Value); Last=last.Value})
    | CompiledMatch @"^(\w+) (\w+)$" [_; first; last] ->
        Some({First=first.Value; Middle=None; Last=last.Value})
    | _ -> 
        None

Notice one of the key advantages we gain here, which is the case with pattern matching in general, is that we are able to simultaneously test that an input matches the regex pattern, and decompose the returned list of groups if it does.

Rellia answered 16/4, 2011 at 3:57 Comment(11)
Would it be possible for you to add a variation that does use active patterns? Not only was I looking for a snippet written in a functional style that works in my scenario, I was also hoping to learn how to apply Active Patterns to a real world situation. For some reason I seem to struggle with that specific topic, and would greatly appreciate an additional example I could learn from. As would many others I'm sure.Pose
@Pose - sure thing, just waking up at the moment, I'll drink some coffee and get right on itRellia
This is one of the best answers I have ever received for one of my questions here on SO. If you don't have a blog you should start one, because you have a real talent for explaining things.Pose
@Pose - Thanks so much! I've thought about keeping a general blog, but so far SO has been my main outlet. You can see on my profile I have a few other outlets too.Rellia
As explained in the second comment in blogs.msdn.com/b/bclteam/archive/2006/10/19/… , I think the CompiledMatch active pattern would compile the regex each time it's applied... so performance would actually backfire most of the time...Schematize
@MauricioScheffer - my reading of the article and comment you site appears consistent with my understanding (please correct me where you see me going wrong): 1) since CompiledMatch uses the static method Regex.Match, the Regex instance created under-the-hood is cached, 2) the RegexOptions.Compiled flag is used in constructing the (created once and cached) Regex instance, not with each match. Hence for each unique regex pattern used with CompiledMatch, there is only one compiled regex created which is cached for subsequent calls after the first.Rellia
@StephenSwensen : ahh, my bad, didn't see it was using the static Regex.Match!Schematize
@StephenSwensen : BTW we could really use some functional "adapter" for the Regex API in FSharpx... I invite you to discuss this over at groups.google.com/group/fsharpx ;-) There are many different active patterns for Regex around the net, we should pick the best and standardize it...Schematize
Thanks for the invite @MauricioScheffer - I have a few real-world uses-cases for regex active patterns in Unquote which have allowed me to put some thought into the topic. Currently I am using the two active patterns in this SO answer, but I've been meaning to do some refactoring around them based on practical experience, once I've done that I'll start a discussion in the fsharpx group.Rellia
I believe that @MauricioScheffer is correct, the Regex is recompiled every time. There is a Regex cache, but it is limited to the 15 most recent patterns passed to the Regex static functions. Compilation negates any of the benefits you might get from that cache. If one is compiling Regexes, she should be caching instances outside the body of the function and checking the cache before compiling anything new.Cork
See also How does RegexOptions.Compiled work? #513912Cork

© 2022 - 2024 — McMap. All rights reserved.