High performance "contains" search in list of strings in C#

Asked 29/9, 2011 at 16:16 Answered 18/12, 2018 at 8:55

I have a list of approx. 500,000 strings, each approx. 100 characters long. Given a search term, I want to identify all strings in the list that contain the search term. At the moment I am doing this with a plain old dataset using the Select method ("MATCH %term%"). This takes about 600ms on my laptop. I'd like to make it faster, maybe 100-200ms.

What would be a recommended approach?

Performance is critical so I can trade memory footprint for better performance if necessary (within reason). The list of strings will not change once initialised so calculating hashes would also be an option.

Does anyone have a recommendation and which C# data structures are best suited to the task?

Imposing answered 29/9, 2011 at 16:16 Comment(9)

This task seems quite suited for parallel execution. How many CPU cores you got? That plus Boyer Moore should give good enough improvement. – Kriss 29/9, 2011 at 16:20

Have you tried loading into a database and let the database engine do what its good at doing? – Virginity 29/9, 2011 at 16:25

Are you looking for words or arbitrary substrings? – Hoem 29/9, 2011 at 16:27

MongusPong: I don't have a database and I don't think this would be any quicker. A DB could not take advantage of indexing because of the "LIKE %...%" clause. I'm looking for a tailor-made optimization. Gabe: Arbitrary substrings. If I were looking for words I could tokenize the strings and build my own index. – Imposing 30/9, 2011 at 8:31

@Kriss Why BM? There are multiple string search algorithms (Aho-Corasick, Wu-Manber) or tries. – Amperage 30/9, 2011 at 14:21

@KonradRudolph: It would be simple, and only a one time setup (or any of the other classic string search algorithms). I dont know the others you mention, but tries would certainly be more inefficient ito complexity and that it is not really suited for a single string search. – Kriss 30/9, 2011 at 15:25

Protip (IIRC, .NET 1.1): A Regex will attempt a Boyer Moore search for simple patterns, eg "foobar". – Kriss 30/9, 2011 at 15:27

@Kriss Sorry, I mistakenly assumed that the OP needed a multi-pattern search. – Amperage 30/9, 2011 at 21:19

Can you share your final option? @nrj101 ? – Laminous 27/9, 2018 at 8:26

I've heard good things about Lucene.NET when it comes to performing quick full-text searches. They've done the work to figure out the fastest data structures and such to use. I'd suggest giving that a shot.

Otherwise, you might just try something like this:

var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();

But it probably won't get you down to 100ms.

Hereditament answered 29/9, 2011 at 16:23 Comment(3)

Thanks for suggestion. Lucene seems like quite a big overhead (in terms of framework) for just this one situation but I could look at it. Do you know if it handles arbitrary substrings? S free-text search will not help if it tokenizes the strings into words. – Imposing 30/9, 2011 at 8:33

@njreed.myopenid.com: I updated my answer based on my comments on another question. Given the requirements, though, you'll likely need an algorithm change to get the performance boost you're looking for. – Hereditament 30/9, 2011 at 14:16

Thanks @StriplingWarrior. The search is now down to about 50ms. I don't need to implement any special algorithm. The combination of using Linq with a list of strings was enough. Guess the dataset was brining more overhead with it than I thought. – Imposing 13/10, 2011 at 14:42

Have you tried the following?

list.FindAll(x => x.Contains("YourTerm")).ToList();

For some reason the List.AsParallel().Where(...) is slower than list.FindAll(...) on my PC.

list.AsParallel().Where(x => x.Contains("YourTerm")).ToList();

Hope this will help you.

Rozellarozelle answered 13/11, 2013 at 14:39 Comment(0)

A trie or suffix tree would help in making this faster - this is essentially what fulltext search (usually) is using.

There are implementations in C# you can use, also see this SO thread: Looking for the suffix tree implementation in C#?

Also as mentioned by @leppie parallel execution will likely be already provide you with the x3 performance gain you are looking for. But then again you will have to measure closely, without that it's anyone's guess.

Blemish answered 29/9, 2011 at 16:28 Comment(1)

Thanks for suggestion. I'll see how much boost parallel execution gives and then see if I need a change to the implementation. – Imposing 30/9, 2011 at 8:40

Have you tried loading your strings into a List<string> and then using the Linq extensions Contains method?

var myList = new List<string>();
//Code to load your list goes here...

var searchTerm = "find this";
var match = myList.Contains(searchTerm);

Swain answered 29/9, 2011 at 16:22 Comment(4)

I think that will only match if one of the strings in the list is exactly what's being searched for. The person asking the question wants to look inside all of those strings as well. – Cherey 29/9, 2011 at 16:29

Yup, to fit the OP, you'd have to do myList.Where(s => s.Contains(searchTerm)). You could throw an AsParallel in there to make the query run in parallel, and you might tweak the parameters to string.Contains if you know that you want an exact match instead of a culture-aware one. Even so, you wouldn't get an order of magnitude of difference in speed. – Hereditament 29/9, 2011 at 16:33

Good suggestion Stripling. The parallel processing might just give me the boost I need. I need to try this. – Imposing 30/9, 2011 at 8:35

Thanks @Paige-Cook, using a list of strings was a great suggestion; it's much faster. The search is now down to about 50ms. I can only accept one answer so I've accepted Stripling because it included the use of Linq to query the list as required in the question. – Imposing 13/10, 2011 at 14:40

public static bool ContainsFast<T>(this IList<T> list, T item)
{
    return list.IndexOf(item) >= 0;
}

Base on tests that I did, this variation of Contains was about 33% faster on my side.

Picofarad answered 29/9, 2011 at 19:19 Comment(3)

It was probably faster only because it was doing an exact comparison rather than a culture-sensitive one. – Hoem 29/9, 2011 at 20:7

It also doesn't check for substrings, which is a requirement of the OP. – Hereditament 30/9, 2011 at 14:7

What is ContainsFast<>? – Lyon 10/1, 2019 at 21:25

According to these benchmarks, the fastest way to check if a string occurs in a string is the following:

for (int x = 0; x < ss.Length; x++)
    for (int y = 0; y < sf.Length; y++
        c[y] += ((ss[x].Length - ss[x].Replace(sf[y], String.Empty).Length) / sf[y].Length > 0 ? 1 : 0);

Thus, you could:

Loop through the list using a Parallel.For construct
Implement the code above to check if a string contains what you're looking for. "SS" is the string[] of strings to search; "SF" is the string[] of strings to search for; c[y] is the total count of each one found.

Obviously you'd have to adapt them to your List[string] (or whatever data structure you're using).

Demoss answered 18/12, 2018 at 8:55 Comment(0)

Recommended topics

Hot tags