Converting a MatchCollection to string array
Asked Answered
A

7

95

Is there a better way than this to convert a MatchCollection to a string array?

MatchCollection mc = Regex.Matches(strText, @"\b[A-Za-z-']+\b");
string[] strArray = new string[mc.Count];
for (int i = 0; i < mc.Count;i++ )
{
    strArray[i] = mc[i].Groups[0].Value;
}

P.S.: mc.CopyTo(strArray,0) throws an exception:

At least one element in the source array could not be cast down to the destination array type.

Asafetida answered 10/7, 2012 at 15:0 Comment(0)
G
205

Try:

var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
    .Cast<Match>()
    .Select(m => m.Value)
    .ToArray();
Gallicize answered 10/7, 2012 at 15:2 Comment(9)
I would have used OfType<Match>() for this instead of Cast<Match>() ... Then again, the outcome would be the same.Komatik
@Komatik You know that everything returned will be a Match, so there's no need to check it again at runtime. Cast makes more sense.Ecg
@DaveBish I posted some sort-of benchmarking code below, OfType<> turns out to be slightly faster.Komatik
@DaveBish: don't worry about OfType vs Cast performance. Your #1 performance dog is Regex.Matches.Tajuanatak
for future visitors, this have been argued: https://mcmap.net/q/110525/-why-is-oftype-lt-gt-faster-than-cast-lt-gtPrecipitation
Is there any reason not to do this: var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b") .Cast<Match>() .Select(m => m.Value) .ToArray(); Since there aren't any parenthesis in the pattern I don't see a reason to use the Group property unless it doesn't do what I think it does.Avellaneda
@Frontenderman - Nope, I was just aligning it with the askers questionGallicize
You would think it would be a simple command to turn a MatchCollection into a string[], as it is for Match.ToString(). It's pretty obvious the final type needed in a lot of Regex uses would be a string, so it should have been easy to convert.Algo
@Algo I agree, though the first annoying thing is having to deal with a non-generic ICollection and IEnumerable type, though to be totally fair, I'm pretty sure this API was made prior even to generic C# support.Turaco
K
33

Dave Bish's answer is good and works properly.

It's worth noting although that replacing Cast<Match>() with OfType<Match>() will speed things up.

Code wold become:

var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
    .OfType<Match>()
    .Select(m => m.Groups[0].Value)
    .ToArray();

Result is exactly the same (and addresses OP's issue the exact same way) but for huge strings it's faster.

Test code:

// put it in a console application
static void Test()
{
    Stopwatch sw = new Stopwatch();
    StringBuilder sb = new StringBuilder();
    string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";

    Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
    strText = sb.ToString();

    sw.Start();
    var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
              .OfType<Match>()
              .Select(m => m.Groups[0].Value)
              .ToArray();
    sw.Stop();

    Console.WriteLine("OfType: " + sw.ElapsedMilliseconds.ToString());
    sw.Reset();

    sw.Start();
    var arr2 = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
              .Cast<Match>()
              .Select(m => m.Groups[0].Value)
              .ToArray();
    sw.Stop();
    Console.WriteLine("Cast: " + sw.ElapsedMilliseconds.ToString());
}

Output follows:

OfType: 6540
Cast: 8743

For very long strings Cast() is therefore slower.

Komatik answered 10/7, 2012 at 15:28 Comment(4)
Very surprising! Given that OfType must do an 'is' comparison somewhere inside and a cast (I'd have thought?) Any ideas on why Cast<> is slower? I've got nothing!Gallicize
I honestly don't have a clue, but it "feels" right to me (OfType<> is just a filter, Cast<> is ... well, is a cast)Komatik
More benchmarks seem to show this particular result is due to regex more than specific linq extension usedKomatik
I've written a more complex benchmark that hopefully answers this once and for all: https://mcmap.net/q/111935/-converting-a-matchcollection-to-string-arrayGiusto
P
6

I ran the exact same benchmark that Alex has posted and found that sometimes Cast was faster and sometimes OfType was faster, but the difference between both was negligible. However, while ugly, the for loop is consistently faster than both of the other two.

Stopwatch sw = new Stopwatch();
StringBuilder sb = new StringBuilder();
string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";
Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
strText = sb.ToString();

//First two benchmarks

sw.Start();
MatchCollection mc = Regex.Matches(strText, @"\b[A-Za-z-']+\b");
var matches = new string[mc.Count];
for (int i = 0; i < matches.Length; i++)
{
    matches[i] = mc[i].ToString();
}
sw.Stop();

Results:

OfType: 3462
Cast: 3499
For: 2650
Pimental answered 14/5, 2014 at 13:55 Comment(3)
no surprise that linq is slower than for loop. Linq may be easier to write for some people and "increase" their productivity at the expense executing time. that can be good sometimesThreesquare
So the original post is the most efficient method really.Algo
I've written a more precise benchmark and it's not as clear cut as this: https://mcmap.net/q/111935/-converting-a-matchcollection-to-string-arrayGiusto
T
4

One could also make use of this extension method to deal with the annoyance of MatchCollection not being generic. Not that it's a big deal, but this is almost certainly more performant than OfType or Cast, because it's just enumerating, which both of those also have to do.

(Side note: I wonder if it would be possible for the .NET team to make MatchCollection inherit generic versions of ICollection and IEnumerable in the future? Then we wouldn't need this extra step to immediately have LINQ transforms available).

public static IEnumerable<Match> ToEnumerable(this MatchCollection mc)
{
    if (mc != null) {
        foreach (Match m in mc)
            yield return m;
    }
}
Turaco answered 14/2, 2018 at 18:23 Comment(0)
G
1

Since netcore3.0/netstandard2.1 the IMatchCollection class implements the generic IEnumerable<Match> interface (thanks Poul Bak for pointing that out). This makes .Cast/.OfType obsolete on these newer NET targets. I've wrapped it all in a neat extension method:

public static class RegexpExtensions
{
    public static IEnumerable<string> AsStrings(this MatchCollection matches)
#if NETCOREAPP3_0_OR_GREATER
        => matches.Select(m => m.ToString());
#else
        => matches.Cast<Match>().Select(m => m.ToString());
#endif
}

I choose to return an IEnumerable<string> because enumerating the MatchCollection is lazily done, while calling .ToArray is an expensive way to populate the MatchCollection (MatchCollection remarks). If you really need a string array simply chain a call to it: matches.AsStrings().ToArray.

The accepted answer is indeed pretty good and correct, but there's an open debate on what is faster: OfType or Cast, and starting with netcore 3.0 or netstandard 2.1 MatchCollection implements the generic ICollection<Match> interface which allows us to use Select without .Cast. People did some benchmarking, but all benchmarks I see were poorly done as they didn't use a proper tool for benchmarking, and included the regex run time in the results. So I decided to put the two approaches to the test with a properly done benchmark by using the amazing BenchmarckDotNet. I've also included the memory diagnoser in the results to see if there's any difference in allocations and memory usage for each approach. Here are the results (with a netcore3.1+ version added for completeness):

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.3930/22H2/2022Update)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 7.0.403
  [Host]     : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2 [AttachedDebugger]
  DefaultJob : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2
Method Mean Error Gen0 Gen1 Gen2 Allocated
BenchSelect 89.37 ms 0.796 ms 3500.0000 3333.3333 166.6667 67.9 MB
BenchCast 89.72 ms 1.132 ms 3500.0000 3333.3333 166.6667 67.9 MB
BenchOfType 120.70 ms 1.069 ms 3500.0000 3333.3333 166.6667 83.9 MB
BenchNet6 88.54 ms 1.027 ms 3500.0000 3333.3333 166.6667 67.9 MB

These results are exactly what I expected: Cast is faster and uses less memory, and the difference in speed is negligible. Cast also allocates less memory than OfType. Also, netcore3.1+ version without .Cast is slightly faster, but nothing noticeable, and it changes between runs (since it's within the margin of error of the benchmark) with BenchSelect sometimes beating it.

The benchmark code follows (updated with the netcore3.1+ version):

using System.Text.RegularExpressions;
using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;

BenchmarkRunner.Run<OfTypeVsCastBenchmark>();
//BenchmarkRunner.Run<ConvertMatchColletionToStringArrayBenchmarks>();

[HideColumns("matches")]
[MemoryDiagnoser]
public class OfTypeVsCastBenchmark
{
    public OfTypeVsCastBenchmark() {
        StringBuilder sb = new StringBuilder();
        string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";
        Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
        Text = sb.ToString();
    }

    public string Text { get; }

    public IEnumerable<object> Matches() {
        yield return Regex.Matches(Text, @"\b[A-Za-z-']+\b");
    }

    [Benchmark, ArgumentsSource(nameof(Matches))]
    public string[] BenchSelect(MatchCollection matches) => matches
        .Cast<Match>().Select(m => m.ToString()).ToArray();

    [Benchmark, ArgumentsSource(nameof(Matches))]
    public string[] BenchCast(MatchCollection matches) => matches
        .Cast<Match>()
        .Select(m => m.ToString())
        .ToArray();

    [Benchmark, ArgumentsSource(nameof(Matches))]
    public string[] BenchOfType(MatchCollection matches) => matches
          .OfType<Match>()
          .Select(m => m.ToString())
          .ToArray();

#if NETCOREAPP3_0_OR_GREATER
    [Benchmark, ArgumentsSource(nameof(Matches))]
    public string[] BenchNet6(MatchCollection matches) => matches
        .Select(m => m.ToString()).ToArray();
#endif
}

This answers once and for all which Linq-based solution is faster, which is Cast. But what is the fastest way to convert a MatchCollection to a string[]?

Is foreach or for faster than Linq?

To see if using LINQ to convert MatchColletion into string[] is wasteful I've written two variants to compare:

public string[] ConvertByForEachIntoListThenToArray() {
    var list = new List<string>();
    foreach (Match match in regex.Matches(Text))
        list.Add(match.Groups[0].Value);
    return list.ToArray();

}

public string[] ConvertByForEachIntoAllocatedArray() {
    var matches = regex.Matches(Text);
    var result = new string[matches.Count];
    for (int i = 0; i < matches.Count; i++)
        result[i] = matches[i].Groups[0].Value;
    return result;
}

Surprisingly using for or foreach to iterate over the MatchCollection and building the array manually is not noticeably faster than the more expressive LINQ version. The surprise for me is that using a List<string> then calling ToArray() on it is faster than preallocating the array and filling it. It happens because the MatchCollection is lazily evaluated when accessing its enumerator, but accessing Count will populate the collection which is expensive. Also NET 7 is noticeably faster than 4.7.1, so if you can then you should convert to NET 7!

Here are the results:

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.3930/22H2/2022Update)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 7.0.403
  [Host]               : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 7.0             : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2
  .NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
Method Runtime Mean Allocated
ConvertByLinqWithCastToArray .NET 7.0 796.9 ms 501.97 MB
ConvertByForEachIntoListThenToArray .NET 7.0 732.9 ms 533.97 MB
ConvertByForEachIntoAllocatedArray .NET 7.0 802.1 ms 501.97 MB
ConvertByLinqWithCastToArray .NET Framework 4.7.1 1,199.5 ms 542.97 MB
ConvertByForEachIntoListThenToArray .NET Framework 4.7.1 1,177.6 ms 542.97 MB
ConvertByForEachIntoAllocatedArray .NET Framework 4.7.1 1,224.8 ms 510.98 MB

And here is the class used for this specific benchmark:

[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net471)]
[SimpleJob(RuntimeMoniker.Net70)]
public class ConvertMatchColletionToStringArrayBenchmarks
{
    public ConvertMatchColletionToStringArrayBenchmarks() {
        StringBuilder sb = new StringBuilder();
        string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";
        Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
        Text = sb.ToString();
    }

    public string Text { get; }

    Regex regex = new Regex(@"\b[A-Za-z-']+\b", RegexOptions.Compiled);

    [Benchmark]
    public string[] ConvertByLinqWithCastToArray() => regex
        .Matches(Text)
        .Cast<Match>()
        .Select(m => m.Groups[0].Value)
        .ToArray();
    [Benchmark]
    public string[] ConvertByForEachIntoListThenToArray() { 
        var list = new List<string>();
        foreach (Match match in regex.Matches(Text))
            list.Add(match.Groups[0].Value);
        return list.ToArray();

    }
    [Benchmark]
    public string[] ConvertByForEachIntoAllocatedArray() {
        var matches = regex.Matches(Text);
        var result = new string[matches.Count];
        for (int i = 0; i < matches.Count; i++)
            result[i] = matches[i].Groups[0].Value;
        return result;
    }
}
Giusto answered 18/1 at 21:16 Comment(4)
Your benchmark is not necessary anymore: learn.microsoft.com/en-us/dotnet/api/…Cholecalciferol
@PoulBak Why not?Giusto
.Cast<Match>()is not necessary.Cholecalciferol
@PoulBak Oh, I see. You're assuming all of us moved on to netcore3.0+ or netstandard2.1... If you're targeting netstandard2.0 (which most of my libraries do) or if you are stuck with a NET Framework app (which I'm since 2011 - a huge WebForms product I'm the maintainer) then this benchmark is somewhat useful (it's not much relevant because the difference in performance is negligible for the vast majority of use cases, but still interesting...). I'll add a version for NET6.0 to the benchmark for completeness.Giusto
C
0

Consider the following code...

var emailAddress = "[email protected]; [email protected]; [email protected]";
List<string> emails = new List<string>();
emails = Regex.Matches(emailAddress, @"([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                .Cast<Match>()
                .Select(m => m.Groups[0].Value)
                .ToList();
Correct answered 22/11, 2013 at 2:1 Comment(1)
ugh... That regex is horrendous to look at. BTW, as there doesn't exist a foolproof regex for validating emails, use the MailAddress object. https://mcmap.net/q/18715/-how-can-i-validate-an-email-address-using-a-regular-expressionCytogenesis
F
0

If you need a recursive capture, eg. Tokenizing Math Equations:

//INPUT (I need this tokenized to do math)
    string sTests = "(1234+5678)/ (56.78-   1234   )";
            
    Regex splitter = new Regex(@"([\d,\.]+|\D)+");
    Match match = splitter.Match(sTests.Replace(" ", ""));
    string[] captures = (from capture in match.Groups.Cast<Group>().Last().Captures.Cast<Capture>()
                         select capture.Value).ToArray();

...because you need to go after the last captures in the last group.

Falciform answered 15/8, 2021 at 15:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.