Database/datasource optimized for string matching?
Asked Answered
P

9

6

I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.

For example, here is a sample content:

  • Folder1
  • Folder1/Folder2
  • Folder1/*
  • Folder1/Folder2/Folder3
  • Folder2/Folder*
  • */Folder4
  • */Fo*4

(each line has additionnal data too, like tags, but the matching is only against that key)

Here is an example of what I would like to match against the data:

  • Folder1
  • Folder1/Folder2/Folder3
  • Folder3

(* being a wildcard here, it can be a different character)

I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).

So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...

Phreno answered 22/2, 2013 at 12:55 Comment(15)
This isn't really an answer, so I will post it as a comment. You might want to look into the glob function provided by PHP to search directories. php.net/manual/en/function.glob.php In one of the comments on that page, I see that you might run out of memory if you try and search over 500,000 files.Grettagreuze
@Grettagreuze even though my example suggest so, it is not about "real" files it was just an examplePhreno
How about some type of caching solution, that caches search results for specific search criteria? Would be slow on the first lookup, but subsequent lookups would be lightning fast.Trigraph
@Trigraph yes that's a good idea, however it wouldn't be sufficient in itself I believe, because the strings to be matched against the database vary a lot and I expect a lot of requests so cache invalidation may become very high (unless the cache is very big).Phreno
Sorry everybody, I made a mistake I inverted data and string to match to the data (wildcards are in the data actually). I edited the question, but I believe it doesn't change the problem a lot.Phreno
Are there any limitations that you could enforce that would make it easier to index? It's not feasible to build an index that would provide a direct hit for any wildcard, but you could build a partial one that trimmed down the amount of data, and then scan the rest in PHP or a standard SQL query.Levan
if theres only thousands of strings, not hundreds of thousands or millions, you should seriously consider just using mysql(you can force it to load a table from disk into memory on startup). even if the indexes aren't ideal, a native compiled c code program like mysql can iterate through in-memory strings so incredibly fast, it will probably far exceed your performance requirements.Roussillon
@rambocoder interesting, that would be an easy solution. Thousands of strings in memory are not that much?Phreno
@Matthieu It depends on the length of the strings of course.Trigraph
food for thought - 10000 strings x 50 bytes per string = half a MB...of course the index adds overhead, plus some internal mysql data structures. but, my point is its gonna be maybe a few MB...who cares? its tiny.Roussillon
Did you consider Full Text Search (natural), as you do not need Boolean operators..): dev.mysql.com/doc/refman/5.1/en/fulltext-natural-language.htmlThong
@Thong Full text search will return results that don't match 100% my pattern (because that's natural search). I don't think it is appropriate here.Phreno
You are right. Sorry... another idea: what about Oracle and function based index (I do not know if Mysql supports this feature)? You can create a function which will use regular expression or custom search.Thong
Oh you store the wildcard strings in your database and your "search" strings are those without wildcard? Then my answer is not suitable as i thought you have a lot of strings stored and match them with wildcard search strings... :-/Floppy
Full text search doesn't return your desired results! Why? You can create another non-transactional table using Trigger or Database Scheduler so that your Full-text search can return 100% desired result. I believe it will consume less programming effort.Agretha
H
1

Have you considered using MySQL's regular expression engine? Try something like this:

SELECT * FROM your_table WHERE your_query_string REGEXP pattern_column

This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.

More info here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html

Haemolysis answered 14/3, 2013 at 4:49 Comment(3)
Yes I have considered it but MySQL can't use an index on such operations so it will do a full scan. I never considered loading all the table content into PHP however. But I believe a MongoDB query of the same type will be more efficient: it will still do a full scan (no index can be used), but it will perform the matching in parallel (multithreading).Phreno
Can you help us to understand what your ultimate goal is? It looks like you have a dynamic tree data structure and rules functionality that applies rules to elements that are positioned in the tree according to a pattern. Maybe it would be faster to run all of the rules against the data structure rather than running each element of the data structure against all of the rules? You could store the rules in the data structure so that rules that only apply to one branch only exist in that branch. I'm speculating at this point, but I would be interested in any additional info you can provide.Haemolysis
Well I'm building an ACL system using pattern matching: you declare access rules (like User-1 can access Category-1/Article-*). Then you can test "Can User-1 access Category-1/Article-19?"Phreno
A
0

You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.

Ashcan answered 26/2, 2013 at 1:20 Comment(1)
Won't there be a lot of overhead to use something like Gearman and multiprocesses? I wan't to query the database a lot (like a MySQL database).Phreno
B
0

Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.

e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/'; for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'

Bilbao answered 26/2, 2013 at 1:26 Comment(1)
Very interesting idea. Indeed that doesn't cover all the cases, but that will probably be my fallback solution if I don't find anything betterPhreno
F
0

If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?

  • Choose a new directory like myindex
  • Create an empty file for each entry using the string key as location & file name in myindex

Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries. If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.

But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.

This solution will only compete on a huge data-set (but not too huge as @Kyle mentioned) with a rather deep than wide hierarchical structure.

EDIT Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.

Floppy answered 27/2, 2013 at 15:38 Comment(0)
G
0

As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:

dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),

If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.

If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.

You can then do a match within your query using:

indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))

This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).

A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it I put my thing down, flip it and reverse it" (http://www.youtube.com/watch?v=Ke1MoSkanS4). So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:

indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))

of course you could do the "REVERSE" already in your application logic.

Now about the tricky part. Something like "*/Fo*4" should get split up into two records:

# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo

# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4

Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.

Gahan answered 28/2, 2013 at 10:40 Comment(0)
Y
0

Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.

You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.

I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.

Yarn answered 2/3, 2013 at 17:13 Comment(2)
Right now this is what I am doing: I'm loading them in memory, turn the strings into regex and find matches for each entry. PHP is mono-thread, so there is no parallel processing done. I'm looking into Javascript solutions for the same thing, but multithreaded.Phreno
I understand. In that case, multi-threading in PHP is the problem you should be solving. It will be much easier than rewriting the whole thing using some kind of platform (be it a DB or something else). I don't know PHP but I do now you can call other stuff from it. Write it on Java or C#. C# is best imo because it's very easy on multi-threading and has built-in regex support.Yarn
T
0

I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.

Tactics answered 4/3, 2013 at 20:7 Comment(0)
E
-1

If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?

If not, that means you can produce an SQL that would generate a valid sphinx index (see http://sphinxsearch.com/).

Ehrlich answered 22/2, 2013 at 15:3 Comment(2)
Yes the values are unique. Do you have additional information, like is it possible to match string with a wildcard (didn't find it in the docs)? How does performances compare to MySQL for this specific operation? It doesn't seem to me that Sphinx is specialized for this task (from what I've seen)Phreno
Once you get a valid index, you configure sphinxsearch to run as a daemon. With the sphinx php client library you can build a query to sphinxsearch that will return the mysql table ids matching your query. Recomended reading to get an idea of the full process: ibm.com/developerworks/library/os-php-sphinxsearch Also read chapter 5 of current doc, specially sphinxsearch.com/docs/current.html#extended-syntaxEhrlich
C
-1

I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:

Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)

Crossfertilize answered 27/2, 2013 at 18:3 Comment(2)
As said for another answer, Sphinx doesn't enable to do that (the wildcard is in the database, not the query term). Solr doesn't too, and it seems from the docs that Elastic search too (elasticsearch.org/guide/reference/query-dsl/wildcard-query.html). Furthermore elasticsearch states the limitations for wildcards are the same as for MySQL queries.Phreno
The true is I didn't get the full sense of your question but my recommendation is still the same. Use search engine for text search. The only question is how are you going to store the data. They easiest way would be breaking content at "*" into multiple tokens and searching against that. Some additional fitlers can be defined to enforce order. Solr has many custom field types and I'm sure you can find something which will fit your needs.Crossfertilize

© 2022 - 2024 — McMap. All rights reserved.