High performance "contains" search in list of strings in C#
Asked Answered
I

6

16

I have a list of approx. 500,000 strings, each approx. 100 characters long. Given a search term, I want to identify all strings in the list that contain the search term. At the moment I am doing this with a plain old dataset using the Select method ("MATCH %term%"). This takes about 600ms on my laptop. I'd like to make it faster, maybe 100-200ms.

What would be a recommended approach?

Performance is critical so I can trade memory footprint for better performance if necessary (within reason). The list of strings will not change once initialised so calculating hashes would also be an option.

Does anyone have a recommendation and which C# data structures are best suited to the task?

Imposing answered 29/9, 2011 at 16:16 Comment(9)
This task seems quite suited for parallel execution. How many CPU cores you got? That plus Boyer Moore should give good enough improvement.Kriss
Have you tried loading into a database and let the database engine do what its good at doing?Virginity
Are you looking for words or arbitrary substrings?Hoem
MongusPong: I don't have a database and I don't think this would be any quicker. A DB could not take advantage of indexing because of the "LIKE %...%" clause. I'm looking for a tailor-made optimization. Gabe: Arbitrary substrings. If I were looking for words I could tokenize the strings and build my own index.Imposing
@Kriss Why BM? There are multiple string search algorithms (Aho-Corasick, Wu-Manber) or tries.Amperage
@KonradRudolph: It would be simple, and only a one time setup (or any of the other classic string search algorithms). I dont know the others you mention, but tries would certainly be more inefficient ito complexity and that it is not really suited for a single string search.Kriss
Protip (IIRC, .NET 1.1): A Regex will attempt a Boyer Moore search for simple patterns, eg "foobar".Kriss
@Kriss Sorry, I mistakenly assumed that the OP needed a multi-pattern search.Amperage
Can you share your final option? @nrj101 ?Laminous
H
19

I've heard good things about Lucene.NET when it comes to performing quick full-text searches. They've done the work to figure out the fastest data structures and such to use. I'd suggest giving that a shot.

Otherwise, you might just try something like this:

var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();

But it probably won't get you down to 100ms.

Hereditament answered 29/9, 2011 at 16:23 Comment(3)
Thanks for suggestion. Lucene seems like quite a big overhead (in terms of framework) for just this one situation but I could look at it. Do you know if it handles arbitrary substrings? S free-text search will not help if it tokenizes the strings into words.Imposing
@njreed.myopenid.com: I updated my answer based on my comments on another question. Given the requirements, though, you'll likely need an algorithm change to get the performance boost you're looking for.Hereditament
Thanks @StriplingWarrior. The search is now down to about 50ms. I don't need to implement any special algorithm. The combination of using Linq with a list of strings was enough. Guess the dataset was brining more overhead with it than I thought.Imposing
R
4

Have you tried the following?

list.FindAll(x => x.Contains("YourTerm")).ToList();

For some reason the List.AsParallel().Where(...) is slower than list.FindAll(...) on my PC.

list.AsParallel().Where(x => x.Contains("YourTerm")).ToList();

Hope this will help you.

Rozellarozelle answered 13/11, 2013 at 14:39 Comment(0)
B
2

A trie or suffix tree would help in making this faster - this is essentially what fulltext search (usually) is using.

There are implementations in C# you can use, also see this SO thread: Looking for the suffix tree implementation in C#?

Also as mentioned by @leppie parallel execution will likely be already provide you with the x3 performance gain you are looking for. But then again you will have to measure closely, without that it's anyone's guess.

Blemish answered 29/9, 2011 at 16:28 Comment(1)
Thanks for suggestion. I'll see how much boost parallel execution gives and then see if I need a change to the implementation.Imposing
S
1

Have you tried loading your strings into a List<string> and then using the Linq extensions Contains method?

var myList = new List<string>();
//Code to load your list goes here...

var searchTerm = "find this";
var match = myList.Contains(searchTerm);
Swain answered 29/9, 2011 at 16:22 Comment(4)
I think that will only match if one of the strings in the list is exactly what's being searched for. The person asking the question wants to look inside all of those strings as well.Cherey
Yup, to fit the OP, you'd have to do myList.Where(s => s.Contains(searchTerm)). You could throw an AsParallel in there to make the query run in parallel, and you might tweak the parameters to string.Contains if you know that you want an exact match instead of a culture-aware one. Even so, you wouldn't get an order of magnitude of difference in speed.Hereditament
Good suggestion Stripling. The parallel processing might just give me the boost I need. I need to try this.Imposing
Thanks @Paige-Cook, using a list of strings was a great suggestion; it's much faster. The search is now down to about 50ms. I can only accept one answer so I've accepted Stripling because it included the use of Linq to query the list as required in the question.Imposing
P
1
public static bool ContainsFast<T>(this IList<T> list, T item)
{
    return list.IndexOf(item) >= 0;
}

Base on tests that I did, this variation of Contains was about 33% faster on my side.

Picofarad answered 29/9, 2011 at 19:19 Comment(3)
It was probably faster only because it was doing an exact comparison rather than a culture-sensitive one.Hoem
It also doesn't check for substrings, which is a requirement of the OP.Hereditament
What is ContainsFast<>?Lyon
D
0

According to these benchmarks, the fastest way to check if a string occurs in a string is the following:

for (int x = 0; x < ss.Length; x++)
    for (int y = 0; y < sf.Length; y++
        c[y] += ((ss[x].Length - ss[x].Replace(sf[y], String.Empty).Length) / sf[y].Length > 0 ? 1 : 0);

Thus, you could:

  1. Loop through the list using a Parallel.For construct
  2. Implement the code above to check if a string contains what you're looking for. "SS" is the string[] of strings to search; "SF" is the string[] of strings to search for; c[y] is the total count of each one found.

Obviously you'd have to adapt them to your List[string] (or whatever data structure you're using).

Demoss answered 18/12, 2018 at 8:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.