Searching large text file

Asked 19/12, 2012 at 19:10 Answered 7/6, 2023 at 1:21

Solved c#performance search io full-text-search

I am trying to optimize the search for a string in a large text file (300-600mb). Using my current method, it is taking too long.

Currently I have been using IndexOf to search for the string, but the time it takes is way too long (20s) to build an index for each line with the string.

How can I optimize searching speed? I've tried Contains() but that is slow as well. Any suggestions? I was thinking regex match but I don't see that having a significant speed boost. Maybe my search logic is flawed

example

while ((line = myStream.ReadLine()) != null)
{
    if (line.IndexOf(CompareString, StringComparison.OrdinalIgnoreCase) >= 0)
    {
        LineIndex.Add(CurrentPosition);
        LinesCounted += 1;
    }
}

Bankable answered 19/12, 2012 at 19:10 Comment(5)

What are you searching for exactly? Words? – Johnnyjohnnycake 19/12, 2012 at 19:12

What is your CompareString.. please show an example of what you are looking for.. – Dulcine 19/12, 2012 at 19:15

Are you sure it's your searching part? How long does it take to do no checking whatsoever and just read the file line-by-line? – Surber 19/12, 2012 at 19:15

Without knowing what the file contents are and what you're searching for in it, this is hard to answer. You'll get very different results if you're searching for a phrase in a text document compared to a word in a list of alphabetized words. – Encratia 19/12, 2012 at 19:21

sorry, let me specify what i am search for. i am looking at a large log file, for example a line could read like this 61 - order for burger [9=1, 51=0, 59=1]. where 9, 51, 59 are hashes for say toppings (ketchup (9) = yes, mayo(51) = no, mustard(59) = yes. so a search could be 'order' (displays all orders) or '51=0' (displays all orders where mayo was not used). without searching, i can load the file-in within ~5seconds reading line by line. but with searching the way i have implemented - it takes much longer. so it is definitely the way i am searching that is slowing it down – Bankable 19/12, 2012 at 19:27

The brute force algorithm you're using performs in O(nm) time, where n is the length of the string being searched and m the length of the substring/pattern you're trying to find. You need to use a string search algorithm:

Boyer-Moore is "the standard", I think: http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
But there are lots more out there: http://www-igm.univ-mlv.fr/~lecroq/string/
including Morris-Pratt: http://www.stoimen.com/blog/2012/04/09/computer-algorithms-morris-pratt-string-searching/
and Knuth-Morris-Pratt: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

However, using a regular expression crafted with care might be sufficient, depending on what you are trying to find. See Jeffrey's Friedl's tome, Mastering Regular Expressions for help on building efficient regular expressions (e.g., no backtracking).

You might also want to consult a good algorithms text. I'm partial to Robert Sedgewick's Algorithms in its various incarnations (Algorithms in [C|C++|Java])

Circumbendibus answered 19/12, 2012 at 19:23 Comment(1)

thanks! i will try using a regex search - if it is too slow. i will look into the different search algos you listed above – Bankable 19/12, 2012 at 19:36

Unfortunately, I don't think there's a whole lot you can do in straight C#.

I have found the Boyer-Moore algorithm to be extremely fast for this task. But I found there was no way to make even that as fast as IndexOf. My assumption is that this is because IndexOf is implemented in hand-optimized assembler while my code ran in C#.

You can see my code and performance test results in the article Fast Text Search with Boyer-Moore.

Bree answered 19/12, 2012 at 19:42 Comment(5)

hm so you would suggest IndexOf is the fastest way I can search a simple a string? so far using this method has increased my file reading to about 30s. i guess i will see if there are any alternatives to increase speed in searching... – Bankable 27/12, 2012 at 20:29

Yes, if your search is case-sensitive and culture-sensitive. Otherwise, the considerations change. – Bree 27/12, 2012 at 20:40

no, my search is not case-sensitive nor culture-sensitive. simple string text search, was wondering if IndexOf is the fastest that can be implemented for this task in c# - if it is - then I would need to change my design and choose another platform – Bankable 27/12, 2012 at 20:47

The answer depends on your needs. I don't know if you read the article I referenced in my original article. If you are doing a case-insensitive search, then IndexOf() becomes quite a bit slower. In which case, you could try the Boyer-Moore algorithms I present in that article. For super fast search, you'd need to build an index. It depends on your needs. – Bree 27/12, 2012 at 21:1

thanks, i was just reading through it - ill try some tests and see the performance for c# before jumping ship – Bankable 27/12, 2012 at 21:11

This post is the top stackoverflow hit when I search "searching large text files" tagged with c#. Although, the problem still exists, some things have changed since this post was originally made. Like 300-600 MB no longer being considered a large file, and like the performance of System.Text.RegularExpressions.Regex being greatly improved. For these reasons I feel it's fair to update the answer.

In short, using System.Text.RegularExpressions.Regex from the current version of dotnet is going to be very fast for just about any search you can come up with. It's gotten really fast.

Starting with .NET7 Regex incorporates 4 different engines depending on how it's instantiated. These engines provide highly optimized searching "in many cases, to the point where it ends up being significantly better than Boyer-Moore in all but the most corner of corner cases."

Of the 4 engines using RegexOptions.Compiled or GeneratedRegex will produce the fastest code (ie. the best best-case performance). For most cases this is a good solution.

However, if your use case needs maximum stability or is susceptible to input abuse then using RegexOptions.NonBacktracking will provide "the best worst-case performance" "in exchange for reduced best-case performance" by switching to an engine based on finite automata which "guarantees it’ll only ever do an ammortized-constant amount of work per character in the input."

Here is Stephen Toub's full blog about the many impressive optimizations added to Regex in .NET7.

To further boost the performance of System.Text.RegularExpressions.Regex through parallelism or to process files that exceed RAM you may also want to have a look at Gigantor.

Wilkens answered 7/6, 2023 at 1:21 Comment(0)

Have you seen these questions (and answers)?

Doing it the way you are now seems to be the way to go if all you want to do is read the text file. Other ideas:

If it is possible to pre-sort the data, such as when it gets inserted into the text file, that could help.
You could insert the data into a database and query it as needed.
You could use a hash table

Skelp answered 19/12, 2012 at 19:15 Comment(0)

You can user regexp.Match(String). RegExp Match is faster.

static void Main()

{

  string text = "One car red car blue car";
  string pat = @"(\w+)\s+(car)";

  // Instantiate the regular expression object.
  Regex r = new Regex(pat, RegexOptions.IgnoreCase);

  // Match the regular expression pattern against a text string.
  Match m = r.Match(text);
  int matchCount = 0;
  while (m.Success) 
  {
     Console.WriteLine("Match"+ (++matchCount));
     for (int i = 1; i <= 2; i++) 
     {
        Group g = m.Groups[i];
        Console.WriteLine("Group"+i+"='" + g + "'");
        CaptureCollection cc = g.Captures;
        for (int j = 0; j < cc.Count; j++) 
        {
           Capture c = cc[j];
           System.Console.WriteLine("Capture"+j+"='" + c + "', Position="+c.Index);
        }
     }
     m = m.NextMatch();
  }

}

Amyloid answered 19/12, 2012 at 19:26 Comment(0)

Recommended topics

Hot tags