This post is the top stackoverflow hit when I search "searching large text files" tagged with c#. Although, the problem still exists, some things have changed since this post was originally made. Like 300-600 MB no longer being considered a large file, and like the performance of System.Text.RegularExpressions.Regex being greatly improved. For these reasons I feel it's fair to update the answer.
In short, using System.Text.RegularExpressions.Regex
from the current version of dotnet is going to be very fast for just about any search you can come up with. It's gotten really fast.
Starting with .NET7 Regex incorporates 4 different engines depending on how it's instantiated. These engines provide highly optimized searching "in many cases, to the point where it ends up being significantly better than Boyer-Moore in all but the most corner of corner cases."
Of the 4 engines using RegexOptions.Compiled
or GeneratedRegex
will produce the fastest code (ie. the best best-case performance). For most cases this is a good solution.
However, if your use case needs maximum stability or is susceptible to input abuse then using RegexOptions.NonBacktracking
will provide "the best worst-case performance" "in exchange for reduced best-case performance" by switching to an engine based on finite automata which "guarantees it’ll only ever do an ammortized-constant amount of work per character in the input."
Here is Stephen Toub's full blog about the many impressive optimizations added to Regex in .NET7.
To further boost the performance of System.Text.RegularExpressions.Regex
through parallelism or to process files that exceed RAM you may also want to have a look at Gigantor.