MySQL - How to find word with the most similar beginning
Asked Answered
L

4

5

How to find varchar-word that has the most similar beginning of the specified word in MySQL database?

For example:

+-------------------+
|    word_column    | 
+-------------------+
| StackOferflow     |
| StackExchange     |
| MetaStackExchange |
|       ....        |
+-------------------+

query: call get_with_similar_begin('StackExch_bla_bla_bla');
output: 'StackExchange'

query: call get_with_similar_begin('StackO_bla_bla_bla');
output: 'StackOferflow'


UPDATE :

Select * from words where word_column like 'StackExch_bla_bla_bla' will not give the correct result, because 'StackExchange' does not match this filter.

Additional info: I has BTREE-index on word_column and I would like to use it whenever possible

Lorenalorene answered 22/10, 2017 at 9:55 Comment(1)
like and regexp are not suitable because I want to get a word that begins the same way as the input word.Lorenalorene
H
2

In SQL Server we can use CTE like below query to achieve what you want:

declare @search nvarchar(255) = 'StackExch_bla_bla_bla';

-- A cte that contains `StackExch_bla_bla_bla` sub-strings: {`StackExch_bla_bla_bla`, `StackExch_bla_bla_bl`, ...,  `S`}
with cte(part, lvl) as (  
    select @search, 1
    union all 
    select substring(@search, 1, len(@search) - lvl), lvl + 1
    from cte
    where lvl < len(@search)
), t as (   -- Now below cte will find match level of each word_column
    select t.word_column, min(cte.lvl) matchLvl
    from yourTable t
    left join cte
      on t.word_column like cte.part+'%'
    group by t.word_column
)
select top(1) word_column
from t
where matchLvl is not null   -- remove non-matched rows
order by matchLvl;

SQL Server Fiddle Demo

I need more time to find a way in MySQL for it, Hope some MySQL experts answer faster ;).

My best try in MySQL is this:

select tt.word_column
from (
  select t.word_column, min(lvl) matchLvl
  from yourTable t
  join (
    select 'StackExch_bla_bla_bla' part, 1 lvl
    union all select 'StackExch_bla_bla_bl', 2
    union all select 'StackExch_bla_bla_b', 3
    union all select 'StackExch_bla_bla_', 4
    union all select 'StackExch_bla_bla', 5
    union all select 'StackExch_bla_bl', 6
    union all select 'StackExch_bla_b', 7
    union all select 'StackExch_bla_', 8
    union all select 'StackExch_bla', 9
    union all select 'StackExch_bl', 10
    union all select 'StackExch_b', 11
    union all select 'StackExch_', 12
    union all select 'StackExch', 13
    union all select 'StackExc', 14
    union all select 'StackEx', 15
    union all select 'StackE', 16
    union all select 'Stack', 17
    union all select 'Stac', 18
    union all select 'Sta', 19
    union all select 'St', 20
    union all select 'S', 21
  ) p on t.word_column like concat(p.part, '%')
  group by t.word_column
  ) tt
order by matchLvl
limit 1;

I think by creating a stored procedure and using a temp table to store values in p sub-select you can achieve what you want -HTH ;).

MySQL Fiddle Demo

Haber answered 22/10, 2017 at 10:33 Comment(2)
Its a worked solution but its not good. Because my column has btree-index and I think database can use it index only one time (without creating a large number of word-beginnings).Lorenalorene
I know this is not so good, I just add it to make my algorithm more clear, Hope someone give you a good answer ;).Haber
K
2

This is a slight variation on @shA.t's answer. The aggregation is not necessary:

select t.*, p.lvl
from yourTable t join
     (select 'StackExch_bla_bla_bla' as part, 1 as lvl union all
      select 'StackExch_bla_bla_bl', 2 union all
      select 'StackExch_bla_bla_b', 3 union all
      select 'StackExch_bla_bla_', 4 union all
      select 'StackExch_bla_bla', 5 union all
      select 'StackExch_bla_bl', 6 union all
      select 'StackExch_bla_b', 7 union all
      select 'StackExch_bla_', 8 union all
      select 'StackExch_bla', 9 union all
      select 'StackExch_bl', 10 union all
      select 'StackExch_b', 11 union all
      select 'StackExch_', 12 union all
      select 'StackExch', 13 union all
      select 'StackExc', 14 union all
      select 'StackEx', 15 union all
      select 'StackE', 16 union all
      select 'Stack', 17 union all
      select 'Stac', 18 union all
      select 'Sta', 19 union all
      select 'St', 20 union all
      select 'S', 21
     ) p
     on t.word_column like concat(p.part, '%')
order by matchLvl
limit 1;

A faster way is to use case:

select t.*,
       (case when t.word_column like concat('StackExch_bla_bla_bla', '%') then 'StackExch_bla_bla_bla'
             when t.word_column like concat('StackExch_bla_bla_bl', '%') then 'StackExch_bla_bla_bl'
             when t.word_column like concat('StackExch_bla_bla_b', '%') then 'StackExch_bla_bla_b'
             . . .
             when t.word_column like concat('S', '%') then 'S'
             else ''
        end) as longest_match
from t
order by length(longest_match) desc
limit 1;

Neither of these will make effective use of the index.

If you want a version that uses the index, then do the looping at the application layer, and repeated run the query as:

select t.*
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1;

Then stop when you hit the first match. MySQL should be using the index for the like comparison.

You can come pretty close to this using a union all:

(select t.*, 'StackExch_bla_bla_bla' as matching
 from t
 where t.word_column like 'StackExch_bla_bla_bla%'
 limit 1
) union all
(select t.*, 'StackExch_bla_bla_bl'
 from t
 where t.word_column like 'StackExch_bla_bla_bl%'
 limit 1
) union all
(select t.*, 'StackExch_bla_bla_b'
 from t
 where t.word_column like 'StackExch_bla_bla_b%'
 limit 1
) union al
. . .
(select t.*, 'S'
 from t
 where t.word_column like 'S%'
 limit 1
)
order by length(matching) desc
limit 1;
Kilburn answered 22/10, 2017 at 12:11 Comment(1)
Interesting.. your CASE query is at least 10 times faster than any single query with a join, even though it needs a full table/index scan and a temptable sort. However - The UNION solution is instant.Receiver
T
2

Create table/insert data.

CREATE DATABASE IF NOT EXISTS stackoverflow;
USE stackoverflow;

DROP TABLE IF EXISTS word;
CREATE TABLE IF NOT EXISTS word(
      word_column VARCHAR(255)
    , KEY(word_column)
)
;

INSERT INTO word
    (`word_column`)
VALUES
    ('StackOverflow'),
    ('StackExchange'),
    ('MetaStackExchange')
;

This solution depends on generating a large number list. We can do that with this query. This query generates numbers from 1 to 1000. I do this so this query will support searches up to 1000 chars.

Query

SELECT 
 @row := @row + 1 AS ROW
FROM (
  SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) 
 row1
CROSS JOIN (
  SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
  SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
  SELECT @row := 0
) AS init_user_param

result

  row  
--------
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
     ...
     ...
     990
     991
     992
     993
     994
     995
     996
     997
     998
     999
    1000

Now we use the last query as delivered table in combination with DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, [number])to find a unique word list.

Query

SELECT 
 DISTINCT  
   SUBSTRING('StackExch_bla_bla_bla', 1, rows.row) AS word
FROM (

  SELECT 
   @row := @row + 1 AS ROW
  FROM (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) 
   row1
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row2
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row3
  CROSS JOIN (
    SELECT @row := 0
  ) AS init_user_param
) ROWS

Result

word                   
-----------------------
S                      
St                     
Sta                    
Stac                   
Stack                  
StackE                 
StackEx                
StackExc               
StackExch              
StackExch_             
StackExch_b            
StackExch_bl           
StackExch_bla          
StackExch_bla_         
StackExch_bla_b        
StackExch_bla_bl       
StackExch_bla_bla      
StackExch_bla_bla_     
StackExch_bla_bla_b    
StackExch_bla_bla_bl   
StackExch_bla_bla_bla  

Now want can join and use REPLACE(word_column, word, '') and CHAR_LENGTH(REPLACE(word_column, word, ''))to generate a list.

Query

SELECT 
 *
 , REPLACE(word_column, word, '') AS replaced
 , CHAR_LENGTH(REPLACE(word_column, word, '')) chars_afterreplace
FROM (
 SELECT 
   DISTINCT  
     SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
  FROM (

    SELECT 
     @row := @row + 1 AS row_number
    FROM (
      SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
    ) 
     row1
    CROSS JOIN (
      SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
    ) row2
    CROSS JOIN (
      SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
    ) row3
    CROSS JOIN (
      SELECT @row := 0
    ) AS init_user_param
  ) ROWS
) words
INNER JOIN
  word
ON
 word.word_column LIKE CONCAT(words.word, '%')

Result

word        word_column    replaced       chars_afterreplace  
----------  -------------  -------------  --------------------
S           StackExchange  tackExchange                     12
S           StackOverflow  tackOverflow                     12
St          StackExchange  ackExchange                      11
St          StackOverflow  ackOverflow                      11
Sta         StackExchange  ckExchange                       10
Sta         StackOverflow  ckOverflow                       10
Stac        StackExchange  kExchange                         9
Stac        StackOverflow  kOverflow                         9
Stack       StackExchange  Exchange                          8
Stack       StackOverflow  Overflow                          8
StackE      StackExchange  xchange                           7
StackEx     StackExchange  change                            6
StackExc    StackExchange  hange                             5
StackExch   StackExchange  ange                              4
StackExch_  StackExchange  StackExchange                    13

Now we can clearly see we want the word with the lowest chars_afterreplace. So we want to do ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC LIMIT 1

Query

SELECT 
 word.word_column
FROM (
 SELECT 
   DISTINCT  
     SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (

  SELECT 
    @row := @row + 1 AS row_number
  FROM (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) 
   row1
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row2
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row3
  CROSS JOIN (
    SELECT @row := 0
  ) AS init_user_param
) ROWS

) words
INNER JOIN word
ON word.word_column LIKE CONCAT(words.word, '%')
ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC
LIMIT 1

Results

word_column    
---------------
StackExchange  
Tobiastobie answered 22/10, 2017 at 13:8 Comment(0)
R
0

The following solutions need a table containing sequence numbers from 1 to (at least) the length of your word_column. Assuming the word_column is VARCHAR(190) you need a table with numbers from 1 to 190. If you use MariaDB with the sequence plugin, you can use the table seq_1_to_190. If you don't have it, there are many ways to create it. One simple way is to use the information_schema.columns table:

create table if not exists seq_1_to_190 (seq tinyint unsigned auto_increment primary key)
    select null as seq from information_schema.columns limit 190;

You can also create it on-the-fly in a subquery, but that would complicate your queries.

I will use the session variable @word to store the search string.

set @word = 'StackExch_bla_bla_bla';

But you can replace all its occurrences with the constant search string.

Now we can use the sequence table to create all prefix substrings with

select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)

http://rextester.com/BWU18001

and use it for the LIKE condition when you join it with your words table:

select w.word_column
from (
    select seq as l, left(@word, seq) as substr
    from seq_1_to_190 s
    where s.seq <= char_length(@word)
) s
join words w on w.word_column like concat(replace(s.substr, '_', '\_'), '%')
order by s.l desc
limit 1

http://rextester.com/STQP82942

Note that _ is a placeholder and you need to escape it in your search string with \_. You also need to do that for % if your string can contain it, but I will skip this part in my answer.

The query can also be written without the subquery:

select w.word_column
from seq_1_to_190 s
join words w on w.word_column like concat(replace(left(@word, seq), '_', '\_'), '%')
where s.seq <= char_length(@word)
order by s.seq desc
limit 1

http://rextester.com/QVZI59071

These queries do the job and in theorie they should also be fast. But MySQL (In my case MariaDB 10.0.19) creates a bad execution plan and doesn't use the index for the ORDER BY clause. Both queries run in about 1.8 seconds on a 100K rows data set.

Best I could do to improve the performance with a single query is

select (
    select word_column
    from words w
    where w.word_column like concat(replace(left(@word, s.seq), '_', '\_'), '%')
    limit 1
) as word_column
from seq_1_to_190 s
where s.seq <= char_length(@word)
having word_column is not null
order by s.seq desc
limit 1

http://rextester.com/APZHA8471

This one is faster, but still needs like 670 msec. Note that Gordons CASE query runs in 125 msec, though it needs a full table/index scan and filesort.

However I managed to force the engine to use the index for the ORDER BY clause with an indexed temporary table:

drop temporary table if exists tmp;
create temporary table tmp(
    id tinyint unsigned auto_increment primary key,
    pattern varchar(190)
) engine=memory
    select null as id, left(@word, seq) as pattern
    from seq_1_to_190 s
    where s.seq <= char_length(@word)
    order by s.seq desc;

select w.word_column
from tmp force index for order by (primary)
join words w 
    on  w.word_column >= tmp.pattern
    and w.word_column <  concat(tmp.pattern, char(127))
order by tmp.id asc
limit 1

http://rextester.com/OOE82089

This query is "instant" (less than 1 msec) on my 100K rows test table. If I remove FORCE INDEX or use a LIKE condition, it will be slow again.

Note that char(127) seems to work for ASCII strings. You might need to find another character according to your character set.

After all that, I must say that my first thought was to use a UNION ALL query, which was also suggested by Gordon Linoff. However - here is a SQL only solution:

set @subquery = '(
    select word_column
    from words
    where word_column like {pattern}
    limit 1
)';

set session group_concat_max_len = 1000000;
set @sql = (
    select group_concat(
        replace(
            @subquery,
            '{pattern}',
            replace(quote(concat(left(@word, seq), '%')), '_', '\_')
        )
        order by s.seq desc
        separator ' union all '
    )
    from seq_1_to_190 s
    where s.seq <= char_length(@word)
);
set @sql = concat(@sql, ' limit 1');

prepare stmt from @sql;
execute stmt;

http://rextester.com/OPTJ37873

It is also "instant".

If you like strored procedures/functions - Here's a function:

create function get_with_similar_begin(search_str text) returns text
begin
    declare l integer;
    declare res text;
    declare pattern text;

    set l = char_length(search_str);
    while l > 0 and res is null do
        set pattern = left(search_str, l);
        set pattern = replace(pattern, '_', '\_');
        set pattern = replace(pattern, '%', '\%');
        set pattern = concat(pattern, '%');
        set res = (select word_column from words where word_column like pattern);
        set l = l - 1;
    end while;
    return res;
end

Use it as

select get_with_similar_begin('StackExch_bla_bla_bla');
select get_with_similar_begin('StackO_bla_bla_bla');

http://rextester.com/CJTU4629

It is probably the fastest way. Though for long strings a kind of divide and conquer algorinthm might decrease the average number of lookups. But might also be just overkill.

If you want to test your queries on a big table - I used the following code to create my test table (for MariaDB with sequence plugin):

drop table if exists words;
create table words(
    id mediumint auto_increment primary key,
    word_column varchar(190),
    index(word_column)
);

insert into words(word_column)
    select concat('Stack', rand(1)) as word_column
    from seq_1_to_100000;

insert into words(word_column)values('StackOferflow'),('StackExchange'),('MetaStackExchange');
Receiver answered 22/10, 2017 at 23:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.