Search a string as you type the character
Asked Answered
S

3

11

I have contacts stored in my mobile. Lets say my contacts are

Ram

Hello

Hi

Feat

Eat

At

When I type letter 'A' I should get all the matching contacts say "Ram, Feat, Eat, At".

Now I type one more letter T. Now my total string is "AT" now my program should reuse the results of previous search for "A". Now it should return me "Feat, Eat, At"

Design and develop a program for this.

This is interview question at Samsung mobile development

I tried solving with Trie data structures. Could not get good solution for reusing already searched string results. I also tried solution with dictionary data structure, solution has same disadvantage as Trie.

question is how do I search the contacts for each letter typed reusing the search results of earlier searched string? What data structure and algorithm should be used for efficiently solving the problem.

I am not asking for program. So programming language is immaterial for me.

State machine appears to be good solution. Does anyone have suggestion?

Solution should be fast enough for million contacts.

Superheterodyne answered 17/7, 2013 at 15:11 Comment(0)
M
17

It kind of depends on how many items you're searching. If it's a relatively small list, you can do a string.contains check on everything. So when the user types "A", you search the entire list:

for each contact in contacts
    if contact.Name.Contains("A")
        Add contact to results

Then the user types "T", and you sequentially search the previous returned results:

for each contact in results
    if contact.Name.Contains("AT")
        Add contact to new search results

Things get more interesting if the list of contacts is huge, but for the number of contacts that you'd normally have in a phone (a thousand would be a lot!), this is going to work very well.

If the interviewer said, "use the results from the previous search for the new search," then I suspect that this is the answer he was looking for. It would take longer to create a new suffix tree than to just sequentially search the previous result set.

You could optimize this a little bit by storing the position of the substring along with the contact so that all you have to do the next time around is check to see if the next character is as expected, but doing so complicates the algorithm a bit (you have to treat the first search as a special case, and you have to explicitly check string lengths, etc.), and is unlikely to provide much benefit after the first few characters because the size of the list to be searched would be pretty small. The pure sequential search with contains check is going to be plenty fast. Users wouldn't notice the few microseconds you'd save with that optimization.

Update after edit to question

If you want to do this with a million contacts, sequential search might not be the best way to go at the start. Although I'd still give it a try. "Fast enough for a million contacts" raises the question of what exactly "fast enough" means. How long does it take to search one million contacts for the existence of a single letter? How long is the user willing to wait? Remember also that you only have to show one page of contacts before the user takes another action. And you can almost certainly to that before the user presses the second key. Especially if you have a background thread doing the search while the foreground thread handles input and writing the first page of matched strings to the display.

Anyway, you could speed up the initial search by creating a bigram index. That is, for each bigram (sequence of two characters), build a list of names that contain that bigram. You'll also want to create a list of strings for each single character. So, given your list of names, you'd have:

r - ram
a - ram, feat, eat, a
m - ram
h - hello, hi
...
ra - ram
am - ram
...
at - feat, eat, at
...
etc.

I think you get the idea.

That bigram index gets stored in a dictionary or hash map. There are only 325 possible bigrams in the English language, and of course the 26 letters, so at most your dictionary is going to have 351 entries.

So you have almost instant lookup of 1- and 2-character names. How does this help you?

An analysis of Project Gutenberg text shows that the most common bigram in the English language occurs only 3.8% of the time. I realize that names won't share exactly that distribution, but that's a pretty good rough number. So after the first two characters are typed, you'll probably be working with less than 5% of the total names in your list. Five percent of a million is 50,000. With just 50,000 names, you can start using the sequential search algorithm that I described originally.

The cost of this new structure isn't too bad, although it's expensive enough that I'd certainly try the simple sequential search first, anyway. This is going to cost you an extra 2 million references to the names, in the worst case. You could reduce that to a million extra references if you build a 2-level trie rather than a dictionary. That would take slightly longer to lookup and display the one-character search results, but not enough to be noticeable by the user.

This structure is also very easy to update. To add a name, just go through the string and make entries for the appropriate characters and bigrams. To remove a name, go through the name extracting bigrams, and remove the name from the appropriate lists in the bigram index.

Mantilla answered 17/7, 2013 at 15:17 Comment(0)
O
7

Look up "generalized suffix tree", e.g. https://en.wikipedia.org/wiki/Generalized_suffix_tree . For a fixed alphabet size this data structure gives asymptotically optimal solution to find all z matches of a substring of length m in a set of strings in O(z + m) time. Thus you get the same sort of benefit as if you restricted your search to the matches for the previous prefix. Also the structure has optimal O(n) space and build time where n is the total length of all your contacts. I believe you can modify the structure so that you just find the k strings that contain the substring in O(k + m) time, but in general you probably shouldn't have too many matches per contact that have a match, so this may not even be necessary.

Oleaceous answered 17/7, 2013 at 15:40 Comment(6)
I like this idea for large set of names. But, I have one question. A suffix tree would help identify substrings. Do we need to map the sub-strings to the original words that generated them? I.e. in the example the user gave, it would seem that we need to generate full names and not suffixes. Right?Bani
My understanding is that the suffix tree finds all the locations of the matches (locations in the concatenation of contacts where each contact ends with a unique symbol), so these locations can be mapped back easily based on the starting position of each contact in the concatenation to figure out which strings contain matches.Oleaceous
Or maybe figuring out which string is already taken care of by the classical generalized suffix tree query algorithm to find the matches in a set of strings. That seems most likely, but I'd need to go back and read the papers to be sure.Oleaceous
Jon Bentley had a nice article in Dr. Dobb's Journal in April 2001 that described a time and space efficient suffix array (which is very similar in concept to a suffix tree). I can't find the DDJ link right now, but this seems to be a copy of the article: collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/…Davisson
Ah, here's the article on the DDJ site: drdobbs.com/architecture-and-design/algorithm-alley/…Davisson
Yeah my understanding is that suffix arrays can be used to save some storage without sacrificing asymptotic performance, benefiting from some extra trickery.Oleaceous
S
1

What I'm thinking to do is, keeping track of the so far matched string. Suppose in the first step, we identify the strings those have "A" in them and we keep trace of the positions of 'A". Then in the next step we only iterate over these strings and instead of searching them full we only check for occurrence of "T" as the next character to "A" we kept trace in the previous step and so on.

Slab answered 17/7, 2013 at 15:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.