Since netcore3.0/netstandard2.1 the IMatchCollection
class implements the generic IEnumerable<Match>
interface (thanks Poul Bak for pointing that out). This makes .Cast
/.OfType
obsolete on these newer NET targets. I've wrapped it all in a neat extension method:
public static class RegexpExtensions
{
public static IEnumerable<string> AsStrings(this MatchCollection matches)
#if NETCOREAPP3_0_OR_GREATER
=> matches.Select(m => m.ToString());
#else
=> matches.Cast<Match>().Select(m => m.ToString());
#endif
}
I choose to return an IEnumerable<string>
because enumerating the MatchCollection
is lazily done, while calling .ToArray
is an expensive way to populate the MatchCollection
(MatchCollection remarks). If you really need a string array simply chain a call to it: matches.AsStrings().ToArray
.
The accepted answer is indeed pretty good and correct, but there's an open debate on what is faster: OfType
or Cast
, and starting with netcore 3.0 or netstandard 2.1 MatchCollection
implements the generic ICollection<Match>
interface which allows us to use Select
without .Cast
. People did some benchmarking, but all benchmarks I see were poorly done as they didn't use a proper tool for benchmarking, and included the regex
run time in the results. So I decided to put the two approaches to the test with a properly done benchmark by using the amazing BenchmarckDotNet. I've also included the memory diagnoser in the results to see if there's any difference in allocations and memory usage for each approach. Here are the results (with a netcore3.1+ version added for completeness):
BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.3930/22H2/2022Update)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 7.0.403
[Host] : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2 [AttachedDebugger]
DefaultJob : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2
Method |
Mean |
Error |
Gen0 |
Gen1 |
Gen2 |
Allocated |
BenchSelect |
89.37 ms |
0.796 ms |
3500.0000 |
3333.3333 |
166.6667 |
67.9 MB |
BenchCast |
89.72 ms |
1.132 ms |
3500.0000 |
3333.3333 |
166.6667 |
67.9 MB |
BenchOfType |
120.70 ms |
1.069 ms |
3500.0000 |
3333.3333 |
166.6667 |
83.9 MB |
BenchNet6 |
88.54 ms |
1.027 ms |
3500.0000 |
3333.3333 |
166.6667 |
67.9 MB |
These results are exactly what I expected: Cast
is faster and uses less memory, and the difference in speed is negligible. Cast
also allocates less memory than OfType
. Also, netcore3.1+ version without .Cast
is slightly faster, but nothing noticeable, and it changes between runs (since it's within the margin of error of the benchmark) with BenchSelect
sometimes beating it.
The benchmark code follows (updated with the netcore3.1+ version):
using System.Text.RegularExpressions;
using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Jobs;
BenchmarkRunner.Run<OfTypeVsCastBenchmark>();
//BenchmarkRunner.Run<ConvertMatchColletionToStringArrayBenchmarks>();
[HideColumns("matches")]
[MemoryDiagnoser]
public class OfTypeVsCastBenchmark
{
public OfTypeVsCastBenchmark() {
StringBuilder sb = new StringBuilder();
string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";
Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
Text = sb.ToString();
}
public string Text { get; }
public IEnumerable<object> Matches() {
yield return Regex.Matches(Text, @"\b[A-Za-z-']+\b");
}
[Benchmark, ArgumentsSource(nameof(Matches))]
public string[] BenchSelect(MatchCollection matches) => matches
.Cast<Match>().Select(m => m.ToString()).ToArray();
[Benchmark, ArgumentsSource(nameof(Matches))]
public string[] BenchCast(MatchCollection matches) => matches
.Cast<Match>()
.Select(m => m.ToString())
.ToArray();
[Benchmark, ArgumentsSource(nameof(Matches))]
public string[] BenchOfType(MatchCollection matches) => matches
.OfType<Match>()
.Select(m => m.ToString())
.ToArray();
#if NETCOREAPP3_0_OR_GREATER
[Benchmark, ArgumentsSource(nameof(Matches))]
public string[] BenchNet6(MatchCollection matches) => matches
.Select(m => m.ToString()).ToArray();
#endif
}
This answers once and for all which Linq-based solution is faster, which is Cast
. But what is the fastest way to convert a MatchCollection
to a string[]
?
Is foreach or for faster than Linq?
To see if using LINQ to convert MatchColletion
into string[]
is wasteful I've written two variants to compare:
public string[] ConvertByForEachIntoListThenToArray() {
var list = new List<string>();
foreach (Match match in regex.Matches(Text))
list.Add(match.Groups[0].Value);
return list.ToArray();
}
public string[] ConvertByForEachIntoAllocatedArray() {
var matches = regex.Matches(Text);
var result = new string[matches.Count];
for (int i = 0; i < matches.Count; i++)
result[i] = matches[i].Groups[0].Value;
return result;
}
Surprisingly using for
or foreach
to iterate over the MatchCollection
and building the array manually is not noticeably faster than the more expressive LINQ version. The surprise for me is that using a List<string>
then calling ToArray()
on it is faster than preallocating the array and filling it. It happens because the MatchCollection
is lazily evaluated when accessing its enumerator, but accessing Count
will populate the collection which is expensive. Also NET 7 is noticeably faster than 4.7.1, so if you can then you should convert to NET 7!
Here are the results:
BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.3930/22H2/2022Update)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 7.0.403
[Host] : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2 [AttachedDebugger]
.NET 7.0 : .NET 7.0.13 (7.0.1323.51816), X64 RyuJIT AVX2
.NET Framework 4.7.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256
Method |
Runtime |
Mean |
Allocated |
ConvertByLinqWithCastToArray |
.NET 7.0 |
796.9 ms |
501.97 MB |
ConvertByForEachIntoListThenToArray |
.NET 7.0 |
732.9 ms |
533.97 MB |
ConvertByForEachIntoAllocatedArray |
.NET 7.0 |
802.1 ms |
501.97 MB |
ConvertByLinqWithCastToArray |
.NET Framework 4.7.1 |
1,199.5 ms |
542.97 MB |
ConvertByForEachIntoListThenToArray |
.NET Framework 4.7.1 |
1,177.6 ms |
542.97 MB |
ConvertByForEachIntoAllocatedArray |
.NET Framework 4.7.1 |
1,224.8 ms |
510.98 MB |
And here is the class used for this specific benchmark:
[MemoryDiagnoser]
[SimpleJob(RuntimeMoniker.Net471)]
[SimpleJob(RuntimeMoniker.Net70)]
public class ConvertMatchColletionToStringArrayBenchmarks
{
public ConvertMatchColletionToStringArrayBenchmarks() {
StringBuilder sb = new StringBuilder();
string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";
Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
Text = sb.ToString();
}
public string Text { get; }
Regex regex = new Regex(@"\b[A-Za-z-']+\b", RegexOptions.Compiled);
[Benchmark]
public string[] ConvertByLinqWithCastToArray() => regex
.Matches(Text)
.Cast<Match>()
.Select(m => m.Groups[0].Value)
.ToArray();
[Benchmark]
public string[] ConvertByForEachIntoListThenToArray() {
var list = new List<string>();
foreach (Match match in regex.Matches(Text))
list.Add(match.Groups[0].Value);
return list.ToArray();
}
[Benchmark]
public string[] ConvertByForEachIntoAllocatedArray() {
var matches = regex.Matches(Text);
var result = new string[matches.Count];
for (int i = 0; i < matches.Count; i++)
result[i] = matches[i].Groups[0].Value;
return result;
}
}
OfType<Match>()
for this instead ofCast<Match>()
... Then again, the outcome would be the same. – Komatik