Fastest way to search in a string collection
Asked Answered
E

17

81

Problem:

I have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection.

The search method will occur every time the user change the text of a TextBox and the result should be the strings that contain the text in TextBox.

I don't have to change the list, just pull the results and put them in a ListBox.

What I've tried so far:

I tried with two different collections/containers, which I'm dumping the string entries from an external text file (once, of course):

  1. List<string> allUsers;
  2. HashSet<string> allUsers;

With the following LINQ query:

allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();

My search event (fires when user change the search text):

private void textBox_search_TextChanged(object sender, EventArgs e)
{
    if (textBox_search.Text.Length > 2)
    {
        listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();
    }
    else
    {
        listBox_choices.DataSource = null;
    }
}

Results:

Both gave me a poor response time (around 1-3 seconds between each key press).

Question:

Where do you think my bottleneck is? The collection I've used? The search method? Both?

How can I get better performance and more fluent functionality?

Ethelred answered 27/1, 2014 at 11:7 Comment(15)
HashSet<T> won't help you here, because you're searching the part of string.Cantilever
You may try a performance check while using parallel linq #9139040Counteroffensive
Moreover you don't need ToList(). At least you do not need to create a true list. I would wrap IEnumerable within a class that implement only needed parts of IList.Isom
@Counteroffensive partitioning input and locking output collection may quickly become hard to tune.Isom
i wonder why you have these problems - i tested this on my machine and im coming up with results like 30ms (linq) or 8ms (using plinq)Counteroffensive
Do, you actually mean Contain? Or do you mean starts with? (Or perhaps, wordwise starts-with) If you mean starts-with, a treebased approch might work really well. In particular a Trie, or a variation there of. Scrolling down I see Groo suggested this also.Iolanthe
Check out Suffix arrays.Alphabetical
Don't ask "what's the fastest way to", because that would take literally weeks to years of research. Rather, say "I need a solution that runs in less than 30 ms", or whatever your performance goal is. You don't need the fastest device, you need a fast enough device.Equilateral
Also, get a profiler. Don't guess about where the slow part is; such guesses are often wrong. The bottleneck might be somewhere surprising.Equilateral
@EricLippert O(N) searches are always slow on large enough inputs. That's why theory wins. Sometimes.Langille
As @Alphabetical mentioned, there are much better data structures for your use-case, such as a suffix array.Octal
@Langille Actually the bottle-neck is the repeated call to textBox_search.Text which makes native API calls to get the content of the textbox. So practice wins yet again. See my answer bellow for details.Incipit
@Basilevs: I once wrote a lovely O(1) hash table that was extremely slow in practice. I profiled it to find out why and discovered that on every search it was calling a method that -- no kidding -- ended up asking the registry "are we in Thailand right now?". Not caching whether the user is in Thailand was the bottleneck in that O(1) code. The location of the bottleneck can be deeply counterintuitive. Use a profiler.Equilateral
+100 for profiling. In my current product, we were doing millions of calculations over a list. It was not fast enough and we thought that we need to implement some sort of caching mechanism for our calculations. When we profiled the code, we found that the LINQ Sum method was a hot spot. Diving deeper, we learned that SUM method was getting called millions of times and was generating GC pressure to dispose off all the enumerators. We added ListSum method which indexed over the list using for loop and we saw 50% speed bump with reduced memory usage.Metalware
Not in million years I would have guessed that LINQ SUM method is slowing us down. Please listen to experts and do not even attempt to guess which part is slow. Profile the code, look at the hot spot, eliminate it. Keep repeating till you achieve desired performance. I have used both ANTS profiler and dotTrace profiler in production and I highly recommend dotTrace profiler for the 'subsystems' feature (youtube.com/watch?v=iEU_SU6oV1c). It makes it very easy to identify bottlenecks in your system.Metalware
L
48

You could consider doing the filtering task on a background thread which would invoke a callback method when it's done, or simply restart filtering if input is changed.

The general idea is to be able to use it like this:

public partial class YourForm : Form
{
    private readonly BackgroundWordFilter _filter;

    public YourForm()
    {
        InitializeComponent();

        // setup the background worker to return no more than 10 items,
        // and to set ListBox.DataSource when results are ready

        _filter = new BackgroundWordFilter
        (
            items: GetDictionaryItems(),
            maxItemsToMatch: 10,
            callback: results => 
              this.Invoke(new Action(() => listBox_choices.DataSource = results))
        );
    }

    private void textBox_search_TextChanged(object sender, EventArgs e)
    {
        // this will update the background worker's "current entry"
        _filter.SetCurrentEntry(textBox_search.Text);
    }
}

A rough sketch would be something like:

public class BackgroundWordFilter : IDisposable
{
    private readonly List<string> _items;
    private readonly AutoResetEvent _signal = new AutoResetEvent(false);
    private readonly Thread _workerThread;
    private readonly int _maxItemsToMatch;
    private readonly Action<List<string>> _callback;

    private volatile bool _shouldRun = true;
    private volatile string _currentEntry = null;

    public BackgroundWordFilter(
        List<string> items,
        int maxItemsToMatch,
        Action<List<string>> callback)
    {
        _items = items;
        _callback = callback;
        _maxItemsToMatch = maxItemsToMatch;

        // start the long-lived backgroud thread
        _workerThread = new Thread(WorkerLoop)
        {
            IsBackground = true,
            Priority = ThreadPriority.BelowNormal
        };

        _workerThread.Start();
    }

    public void SetCurrentEntry(string currentEntry)
    {
        // set the current entry and signal the worker thread
        _currentEntry = currentEntry;
        _signal.Set();
    }

    void WorkerLoop()
    {
        while (_shouldRun)
        {
            // wait here until there is a new entry
            _signal.WaitOne();
            if (!_shouldRun)
                return;

            var entry = _currentEntry;
            var results = new List<string>();

            // if there is nothing to process,
            // return an empty list
            if (string.IsNullOrEmpty(entry))
            {
                _callback(results);
                continue;
            }

            // do the search in a for-loop to 
            // allow early termination when current entry
            // is changed on a different thread
            foreach (var i in _items)
            {
                // if matched, add to the list of results
                if (i.Contains(entry))
                    results.Add(i);

                // check if the current entry was updated in the meantime,
                // or we found enough items
                if (entry != _currentEntry || results.Count >= _maxItemsToMatch)
                    break;
            }

            if (entry == _currentEntry)
                _callback(results);
        }
    }

    public void Dispose()
    {
        // we are using AutoResetEvent and a background thread
        // and therefore must dispose it explicitly
        Dispose(true);
    }

    private void Dispose(bool disposing)
    {
        if (!disposing)
            return;

        // shutdown the thread
        if (_workerThread.IsAlive)
        {
            _shouldRun = false;
            _currentEntry = null;
            _signal.Set();
            _workerThread.Join();
        }

        // if targetting .NET 3.5 or older, we have to
        // use the explicit IDisposable implementation
        (_signal as IDisposable).Dispose();
    }
}

Also, you should actually dispose the _filter instance when the parent Form is disposed. This means you should open and edit your Form's Dispose method (inside the YourForm.Designer.cs file) to look something like:

// inside "xxxxxx.Designer.cs"
protected override void Dispose(bool disposing)
{
    if (disposing)
    {
        if (_filter != null)
            _filter.Dispose();

        // this part is added by Visual Studio designer
        if (components != null)
            components.Dispose();
    }

    base.Dispose(disposing);
}

On my machine, it works pretty quickly, so you should test and profile this before going for a more complex solution.

That being said, a "more complex solution" would possibly be to store the last couple of results in a dictionary, and then only filter them if it turns out that the new entry differs by only the first of last character.

Lampas answered 27/1, 2014 at 12:28 Comment(12)
I just tested your solution and it works perfectly! Nicely done. The only problem I have is that I can't make the _signal.Dispose(); to compile (error about protection-level).Ethelred
@etaiso: that's weird, where exaclty you calling _signal.Dispose() Is it somewhere outside the BackgroundWordFilter class?Lampas
No. Exactly as in your snippet.Ethelred
That's weird. AutoResetEvent inherits from WaitHandle, which implements IDisposable.Dispose(), meaning that Dispose must be a public method. Is this running on Mono, perhaps? I really haven't encountered anything like that before.Lampas
@Groo It's explicit implementation, meaning you can't call it directly. You're supposed to use either a using block, or call WaitHandle.Close()Yearn
@MatthewWatson: Can you back that up with a link of some sort? MSDN says it's a public method, and I've been using it that way for ages (and it works on my machine for this concrete snippet too). I don't see why they would choose to hide it, and then also provide an explicit implementation.Lampas
@Groo your own link to AutoResetEvent backs it up. Scroll down to "Explicit interface implementations"Decelerate
@AlexG: that only means it provides an explicit implementation. The Dispose method is public. Are you saying that new AutoResetEvent(true).Dispose(); won't compile on your machine?Lampas
Ok, now it makes sense, the method was made public in .NET 4. The MSDN page for .NET 4 lists it under public methods, while the page for .NET 3.5 shows it under protected ones. That also explains why there is a conditional define in the Mono source for WaitHandle.Lampas
@etaiso: ok, it turns out that the method was protected prior to .NET 4 (and I presume you're using .NET 3.5 since you didn't complain about the lambda. This means you can simply change that line to (_signal as IDisposable).Dispose();, unless you decide to switch to .NET 4.Lampas
@Groo Sorry, I should have mentioned I was talking about an older version of .Net - sorry about the confusion! However note that he doesn't need to cast - he can call .Close() instead, which itself calls .Dispose().Yearn
@Matthew: sure, that's an even better idea, I guess that's how they planned for it to be used. I wonder if static code analyzer would complain about pulling that off, though. :-DLampas
Y
36

I've done some testing, and searching a list of 120,000 items and populating a new list with the entries takes a negligible amount of time (about a 1/50th of a second even if all strings are matched).

The problem you're seeing must therefore be coming from the populating of the data source, here:

listBox_choices.DataSource = ...

I suspect you are simply putting too many items into the listbox.

Perhaps you should try limiting it to the first 20 entries, like so:

listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text))
    .Take(20).ToList();

Also note (as others have pointed out) that you are accessing the TextBox.Text property for each item in allUsers. This can easily be fixed as follows:

string target = textBox_search.Text;
listBox_choices.DataSource = allUsers.Where(item => item.Contains(target))
    .Take(20).ToList();

However, I timed how long it takes to access TextBox.Text 500,000 times and it only took 0.7 seconds, far less than the 1 - 3 seconds mentioned in the OP. Still, this is a worthwhile optimisation.

Yearn answered 27/1, 2014 at 11:28 Comment(6)
Thanks Matthew. I tried your solution but I don't think the problem is with the population of the ListBox. I think that I need a better approach as this kind of filtering is very naive (for e.g - search for "abc" returns 0 results, then I should not even look for "abcX" and so on..)Ethelred
@Ethelred right (even if Matthew's solution may work great if you do not really need to preset all matches), that's why I suggested as second step to refine search instead of performing a full search each time.Isom
@Ethelred Well, the searching time is negligible like I said. I tried it with 120,000 strings and searching for a very long string that gave no matches and a very short string that gave many matches both took under 1/50th second.Yearn
Does textBox_search.Text contribute a measurable amount to the time? Getting the Text property on a text box once for each of 120k strings probably sends 120k messages to the edit control window.Rowlock
@Rowlock Yes it does. See my answer for details.Incipit
@Rowlock Good point! I'll add that to the answer for completeness.Yearn
L
28

Use Suffix tree as index. Or rather just build a sorted dictionary that associates every suffix of every name with the list of corresponding names.

For input:

Abraham
Barbara
Abram

The structure would look like:

a -> Barbara
ab -> Abram
abraham -> Abraham
abram -> Abram
am -> Abraham, Abram
aham -> Abraham
ara -> Barbara
arbara -> Barbara
bara -> Barbara
barbara -> Barbara
bram -> Abram
braham -> Abraham
ham -> Abraham
m -> Abraham, Abram
raham -> Abraham
ram -> Abram
rbara -> Barbara

Search algorithm

Assume user input "bra".

  1. Bisect the dictionary on user input to find the user input or the position where it could go. This way we find "barbara" - last key lower than "bra". It is called lower bound for "bra". Search will take logarithmic time.
  2. Iterate from the found key onwards until user input no longer matches. This would give "bram" -> Abram and "braham" -> Abraham.
  3. Concatenate iteration result (Abram, Abraham) and output it.

Such trees are designed for quick search of substrings. It performance is close to O(log n). I believe this approach will work fast enough to be used by GUI thread directly. Moreover it will work faster then threaded solution due to absence of synchronization overhead.

Langille answered 27/1, 2014 at 14:42 Comment(10)
From what I know suffix arrays are usually a better choice than suffix trees. Easier to implement and lower memory usage.Alphabetical
I propose SortedList which is very easy to build and maintain at cost of memory overhead which can be minimized by providing lists capacities.Langille
Also, it seems arrays (and original ST) are designed to handle large texts, while here we have large amount of short chunks which is different task.Langille
+1 for the good approach, but I'd use a hash map or actual search tree rather than manually searching a list.Thomas
Is there any advantage of using suffix tree instead of prefix tree?Cowman
@OrangeDog, how do use hash map to search for prefix? I could not find a collection with built-in bisection in .NET could you propose one to use as search tree?Langille
@jnovacho, there is no advantage except for ability to search for a substringLangille
@Langille you add all the prefixes as hash keys. A trie is better for prefix-search though.Thomas
Linking to all substrings would increase collection size proportionally to the word length (while trie might save up on string length we still have to deal with references). As it is already pretty large I would not try that.Langille
The question is only concerned with access time, not memory usage. A search tree is preferred to a hash map because hash collision is going to erode the O(1) complexity advantage the hash map would otherwise have.Thomas
C
15

You need either a text search engine (like Lucene.Net), or database (you may consider an embedded one like SQL CE, SQLite, etc.). In other words, you need an indexed search. Hash-based search isn't applicable here, because you searching for sub-string, while hash-based search is well for searching for exact value.

Otherwise it will be an iterative search with looping through the collection.

Cantilever answered 27/1, 2014 at 11:25 Comment(4)
Indexing is a hash-based search. You just add all sub-strings as keys instead of just the value.Thomas
@OrangeDog: disagree. indexed search can be implemented as hash-based search by index keys, but it isn't necessary, and it isn't a hash-based search by the string value itself.Cantilever
@Cantilever Agree. +1 to cancel the ghost -1.Disassemble
+1 because implementations like a text search engine have smart(er) optimizations than string.Contains. Ie. searching for ba in bcaaaabaa will result in a (indexed) skip list. The first b is considered, but doesn't match because the next is a c, so it will skip to the next b.Gautious
B
12

It might also be useful to have a "debounce" type of event. This differs from throttling in that it waits a period of time (for example, 200 ms) for changes to finish before firing the event.

See Debounce and Throttle: a visual explanation for more information about debouncing. I appreciate that this article is JavaScript focused, instead of C#, but the principle applies.

The advantage of this is that it doesn't search when you're still entering your query. It should then stop trying to perform two searches at once.

Bezel answered 27/1, 2014 at 11:13 Comment(1)
See for a C# implementation of an event throttler the EventThrotler class in the Algorithmia library: github.com/SolutionsDesign/Algorithmia/blob/master/…Obligation
I
12

Update:

I did some profiling.

(Update 3)

  • List content: Numbers generated from 0 to 2.499.999
  • Filter text: 123 (20.477 results)
  • Core i5-2500, Win7 64bit, 8GB RAM
  • VS2012 + JetBrains dotTrace

The initial test run for 2.500.000 records took me 20.000ms.

Number one culprit is the call to textBox_search.Text inside Contains. This makes a call for each element to the expensive get_WindowText method of the textbox. Simply changing the code to:

    var text = textBox_search.Text;
    listBox_choices.DataSource = allUsers.Where(item => item.Contains(text)).ToList();

reduced the execution time to 1.858ms.

Update 2 :

The other two significant bottle-necks are now the call to string.Contains (about 45% of the execution time) and the update of the listbox elements in set_Datasource (30%).

We could make a trade-off between speed and memory usage by creating a Suffix tree as Basilevs has suggested to reduce the number of necessary compares and push some processing time from the search after a key-press to the loading of the names from file which might be preferable for the user.

To increase the performance of loading the elements into the listbox I would suggest to load only the first few elements and indicate to the user that there are further elements available. This way you give a feedback to the user that there are results available so they can refine their search by entering more letters or load the complete list with a press of a button.

Using BeginUpdate and EndUpdate made no change in the execution time of set_Datasource.

As others have noted here, the LINQ query itself runs quite fast. I believe your bottle-neck is the updating of the listbox itself. You could try something like:

if (textBox_search.Text.Length > 2)
{
    listBox_choices.BeginUpdate(); 
    listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();
    listBox_choices.EndUpdate(); 
}

I hope this helps.

Incipit answered 27/1, 2014 at 13:52 Comment(4)
I don't think this will improve anything as the BeginUpdate and EndUpdate are intended to use when adding items individually or when using AddRange().Ethelred
It depends on how the DataSource property is implemented. It might be worth a try.Incipit
Your profiling results are very different from mine. I was able to search 120k strings in 30ms, but adding them to the listbox took 4500ms. It sounds like you're adding 2.5M strings to the listbox in under 600ms. How is that possible?Rowlock
@Rowlock While profiling I used an input where the filter text eliminated a big part of the original list. If I use an input where the filter text removes nothing from the list I get similar results to yours. I will update my response to clarify what I have measured.Incipit
N
11

Run the search on another thread, and show some loading animation or a progress bar while that thread is running.

You may also try to parallelize the LINQ query.

var queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList();

Here is a benchmark that demonstrates the performance advantages of AsParallel():

{
    IEnumerable<string> queryResults;
    bool useParallel = true;

    var strings = new List<string>();

    for (int i = 0; i < 2500000; i++)
        strings.Add(i.ToString());

    var stp = new Stopwatch();

    stp.Start();

    if (useParallel)
        queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList();
    else
        queryResults = strings.Where(item => item.Contains("1")).ToList();

    stp.Stop();

    Console.WriteLine("useParallel: {0}\r\nTime Elapsed: {1}", useParallel, stp.ElapsedMilliseconds);
}
Nakano answered 27/1, 2014 at 11:9 Comment(14)
I know it's a possibility. But my question here is if and how can I shorten this process?Ethelred
@Ethelred it shouldn't really be a problem unless you're developing on some really low end hardware, make sure you're not running the debugger, CTRL + F5Nakano
Sure parallel overhead isn't much higher thant String.Contains() itself? A parallel execution is possible but not so trivial.Isom
This is not a good candidate for PLINQ since the method String.Contains is not expensive. msdn.microsoft.com/en-us/library/dd997399.aspxCytogenetics
@TimSchmelter when we're talking about tons of strings, it is!Nakano
@TimSchmelter I'm not saying that it will change the implementation of that method, but we're talking about an IEnumerable with lots of strings...Nakano
@TimSchmelter I have no idea what you're trying to prove, using the code I provided will most likely increase the performance for the OP, and here's a benchmark that demonstrates how it works: pastebin.com/ATYa2BGt --- Period ---Nakano
@animaonline: I stay corrected. The AsParallel seems to be really faster with those sample strings although that contradicts MSDN.Cytogenetics
It does not contradict MSDN. The main factor of using PLINQ is "a cost of each delegate multiplied by the number of elements in the source collection". Processing 120k elements with string comparison is definately big enough task to justify paralelization.Damselfly
@Damselfly please... let's end this conversation, we got it all figured out XDNakano
It is important to know not only that it is faster, but most of all WHY is it faster. EOT.Damselfly
@Damselfly just FYI, I'm not here to discuss when it's better to make something parallel, tested in release (animaonline benchmark code) and parallel version is slightly slower. Tested in debug and parallel version is pretty slower.Isom
@Adriano don't forget that it also depends on your hardware, number of cores and the available memory, etc.Nakano
@Nakano yes, that's my point. Performance are pretty similar (in my case little bit worse) but very hw-dependent so I wouldn't suggest to make it parallel as solution because of that.Isom
L
9

Assuming you are only matching by prefixes, the data structure you are looking for is called a trie, also known as "prefix tree". The IEnumerable.Where method that you're using now will have to iterate through all items in your dictionary on each access.

This thread shows how to create a trie in C#.

Lampas answered 27/1, 2014 at 11:24 Comment(3)
Assuming he's filtering his records with a prefix.Damselfly
Notice he's using String.Contains() method instead of String.StartsWith(), so it may not be exactly what we're looking for. Still - your idea is doubtlessly better, than ordinary filtering with StartsWith() extension in prefix scenario.Damselfly
If he does mean starts with, then the Trie can be combined with the background worker approach for improved performanceIolanthe
G
8

The WinForms ListBox control really is your enemy here. It will be slow to load the records and the ScrollBar will fight you to show all 120,000 records.

Try using an old-fashioned DataGridView data-sourced to a DataTable with a single column [UserName] to hold your data:

private DataTable dt;

public Form1() {
  InitializeComponent();

  dt = new DataTable();
  dt.Columns.Add("UserName");
  for (int i = 0; i < 120000; ++i){
    DataRow dr = dt.NewRow();
    dr[0] = "user" + i.ToString();
    dt.Rows.Add(dr);
  }
  dgv.AutoSizeColumnsMode = DataGridViewAutoSizeColumnsMode.Fill;
  dgv.AllowUserToAddRows = false;
  dgv.AllowUserToDeleteRows = false;
  dgv.RowHeadersVisible = false;
  dgv.DataSource = dt;
}

Then use a DataView in the TextChanged event of your TextBox to filter the data:

private void textBox1_TextChanged(object sender, EventArgs e) {
  DataView dv = new DataView(dt);
  dv.RowFilter = string.Format("[UserName] LIKE '%{0}%'", textBox1.Text);
  dgv.DataSource = dv;
}
Gobang answered 27/1, 2014 at 21:26 Comment(2)
+1 while everybody else was trying to optimize the search that only takes 30ms, you're the only person who recognized that the problem is actually in filling the list box.Rowlock
Depends on whether you need all the info in the listbox? A listbox should only have, at most, a few dozen items in it. Groo's implementation limits the output so it's not a problem. I found this thread because I was implementing the exact same strategy with a list of about 50,000. Groo's suggestion works perfect as long as you don't need 120,000 items in a listbox.Jabe
I
7

First I would change how ListControl sees your data source, you're converting result IEnumerable<string> to List<string>. Especially when you just typed few characters this may be inefficient (and unneeded). Do not make expansive copies of your data.

  • I would wrap .Where() result to a collection that implements only what is required from IList (search). This will save you to create a new big list for each character is typed.
  • As alternative I would avoid LINQ and I'd write something more specific (and optimized). Keep your list in memory and build an array of matched indices, reuse array so you do not have to reallocate it for each search.

Second step is to do not search in the big list when small one is enough. When user started to type "ab" and he adds "c" then you do not need to research in the big list, search in the filtered list is enough (and faster). Refine search every time is possible, do not perform a full search each time.

Third step may be harder: keep data organized to be quickly searched. Now you have to change the structure you use to store your data. imagine a tree like this:

A        B         C
 Add      Better    Ceil
 Above    Bone      Contour

This may simply be implemented with an array (if you're working with ANSI names otherwise a dictionary would be better). Build the list like this (illustration purposes, it matches beginning of string):

var dictionary = new Dictionary<char, List<string>>();
foreach (var user in users)
{
    char letter = user[0];
    if (dictionary.Contains(letter))
        dictionary[letter].Add(user);
    else
    {
        var newList = new List<string>();
        newList.Add(user);
        dictionary.Add(letter, newList);
    }
}

Search will be then done using first character:

char letter = textBox_search.Text[0];
if (dictionary.Contains(letter))
{
    listBox_choices.DataSource =
        new MyListWrapper(dictionary[letter].Where(x => x.Contains(textBox_search.Text)));
}

Please note I used MyListWrapper() as suggested in first step (but I omitted by 2nd suggestion for brevity, if you choose right size for dictionary key you may keep each list short and fast to - maybe - avoid anything else). Moreover note that you may try to use first two characters for your dictionary (more lists and shorter). If you extend this you'll have a tree (but I don't think you have such big number of items).

There are many different algorithms for string searching (with related data structures), just to mention few:

  • Finite state automaton based search: in this approach, we avoid backtracking by constructing a deterministic finite automaton (DFA) that recognizes stored search string. These are expensive to construct—they are usually created using the powerset construction—but are very quick to use.
  • Stubs: Knuth–Morris–Pratt computes a DFA that recognizes inputs with the string to search for as a suffix, Boyer–Moore starts searching from the end of the needle, so it can usually jump ahead a whole needle-length at each step. Baeza–Yates keeps track of whether the previous j characters were a prefix of the search string, and is therefore adaptable to fuzzy string searching. The bitap algorithm is an application of Baeza–Yates' approach.
  • Index methods: faster search algorithms are based on preprocessing of the text. After building a substring index, for example a suffix tree or suffix array, the occurrences of a pattern can be found quickly.
  • Other variants: some search methods, for instance trigram search, are intended to find a "closeness" score between the search string and the text rather than a "match/non-match". These are sometimes called "fuzzy" searches.

Few words about parallel search. It's possible but it's seldom trivial because overhead to make it parallel can be easily much higher that search itself. I wouldn't perform search itself in parallel (partitioning and synchronization will become soon too expansive and maybe complex) but I would move search to a separate thread. If main thread isn't busy your users won't feel any delay while they're typing (they won't note if list will appear after 200 ms but they'll feel uncomfortable if they have to wait 50 ms after they typed). Of course search itself must be fast enough, in this case you don't use threads to speed up search but to keep your UI responsive. Please note that a separate thread will not make your query faster, it won't hang UI but if your query was slow it'll still be slow in a separate thread (moreover you have to handle multiple sequential requests too).

Isom answered 27/1, 2014 at 11:22 Comment(5)
As some have already pointed out, OP does not wants to limit the results to prefixes only (i.e. he uses Contains, not StartsWith). On a side note, it's usually better to use the generic ContainsKey method when searching for a key to avoid boxing, and even better to use TryGetValue to avoid two lookups.Lampas
@Groo you're right, as I said it's for illustration purposes only. The point of that code isn't a working solution but a hint: if you tried everything else - avoid copies, refine search, move it to another thread - and it isn't enough then you have to change data structure you're using. Example is for beginning of string just to stay simple.Isom
@Adriano thanks for clear and detailed answer! I agree with most of things you mentioned but as Groo said, the last part of keeping the data organized is not applicable in my case. But I think maybe to hold a similar dictionary with keys as the contained letter (though there will still be duplicates)Ethelred
after a quick check and calculation the "contained letter" idea is not good for only one character (and if we go for combinations of two or more we will end up with a very large hash table)Ethelred
@Ethelred yes, you may keep a list of two letters (to quickly reduce sub-lists) but a true tree may works better (each letter is linked to its successors, no matters where it's inside the string so for "HOME" you have "H->O", "O->M" and "M->E". If you're searching for "om" you'll quickly find it. Problem is it becomes pretty more complicate and it may be too much for you scenario (IMO).Isom
F
4

You could try using PLINQ (Parallel LINQ). Although this does not garantee a speed boost, this you need to find out by trial and error.

Fractostratus answered 27/1, 2014 at 11:18 Comment(0)
D
4

I doubt you'll be able to make it faster, but for sure you should:

a) Use the AsParallel LINQ extension method

a) Use some kind of timer to delay filtering

b) Put a filtering method on another thread

Keep some kind of string previousTextBoxValue somewhere. Make a timer with a delay of 1000 ms, that fires searching on tick if previousTextBoxValue is same as your textbox.Text value. If not - reassign previousTextBoxValue to the current value and reset the timer. Set the timer start to the textbox changed event, and it'll make your application smoother. Filtering 120,000 records in 1-3 seconds is OK, but your UI must remain responsive.

Damselfly answered 27/1, 2014 at 11:20 Comment(2)
I don't agree to make it parallel but I absolutely agree with the other two points. It may even be enough to satisfy UI requirements.Isom
Forgot to mention that but I'm using .NET 3.5 so AsParallel is not an option.Ethelred
G
3

You can also try using BindingSource.Filter function. I have used it and it works like a charm to filter from bunch of records, every time update this property with the text being search. Another option would be to use AutoCompleteSource for TextBox control.

Hope it helps!

Glyconeogenesis answered 28/1, 2014 at 7:54 Comment(0)
S
2

I would try to sort collection, search to match only start part and limit search by some number.

so on ininialization

allUsers.Sort();

and search

allUsers.Where(item => item.StartWith(textBox_search.Text))

Maybe you can add some cache.

Sweetmeat answered 27/1, 2014 at 16:3 Comment(2)
He's not working with beginning of string (that's why he's using String.Contains()). With Contains() a sorted list doesn't change performance.Isom
Yes, with 'Contains' it's useless. I like suggestion with sufix tree https://mcmap.net/q/258243/-fastest-way-to-search-in-a-string-collection There is a lot interesting answers in the thread, but it depends how much time he can spend on this task.Sweetmeat
W
1

Use Parallel LINQ. PLINQ is a parallel implementation of LINQ to Objects. PLINQ implements the full set of LINQ standard query operators as extension methods for the T:System.Linq namespace and has additional operators for parallel operations. PLINQ combines the simplicity and readability of LINQ syntax with the power of parallel programming. Just like code that targets the Task Parallel Library, PLINQ queries scale in the degree of concurrency based on the capabilities of the host computer.

Introduction to PLINQ

Understanding Speedup in PLINQ

Also you can use Lucene.Net

Lucene.Net is a port of the Lucene search engine library, written in C# and targeted at .NET runtime users. The Lucene search library is based on an inverted index. Lucene.Net has three primary goals:

Wattmeter answered 27/1, 2014 at 16:56 Comment(0)
F
1

According to what I have seen I agree with the fact to sort the list.

However to sort when the list is construct will be very slow, sort when building, you will have a better execution time.

Otherwise if you don't need to display the list or to keep the order, use a hashmap.

The hashmap will hash your string and search at the exact offset. It should be faster I think.

Fantasy answered 28/1, 2014 at 4:14 Comment(3)
Hashmap with what key? I want to be able to find keywords that contained in the strings.Ethelred
for he key you can put the number in the list, if you want more complciated you can add number plus name the choice is yours.Fantasy
for the rest either i didn't read everything either there was a bad explanation (probably both ;) ) [quote] have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection. [/quote] I thought it was just a string search.Fantasy
S
1

Try use BinarySearch method it should work faster then Contains method.

Contains will be an O(n) BinarySearch is an O(lg(n))

I think that sorted collection should work faster on search and slower on adding new elements, but as I understood you have only search perfomance problem.

Schoenfelder answered 29/1, 2014 at 8:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.