Suggestions for Querying Database for Names

create index forenameFTX on mytable(forename) indextype is ctxsys.context; create index surnameFTX on mytable(surname) indextype is ctxsys.context; select score(1)+score(2) relevance, id, forename, surname from mytable where contains(forename,'!%john%',1) > 0 and contains(surname,'!%smith%',2) > 0 order by relevance desc;

I have come up with a solution which works pretty well, following the suggestions in the comments. Particularly, @X-Zero's suggestion of creating a table of Soundexes: In my case, I can create new tables, but altering the existing schema is not allowed!

So, my process is as follows:

Create a new table with columns: ID, token, sound and position; with the primary key over (ID, sound,position) and an additional index over (ID,sound).
Go through each person in the biographical table:
- Concatenate their forename and surname.
- Change the codepage to us7ascii, so accented characters are normalised. This is because the Soundex algorithm doesn't work with accented characters.
- Convert all non-alphabetic characters into whitespace and consider this the boundary between tokens.
- Tokenise this string and insert into the table the token (in lowercase), the Soundex of the token and the position the token comes in the original string; associate this with ID.

Like so:

declare
  nameString varchar2(82);
  token varchar2(40);
  posn integer;
  cursor myNames is
    select id,
           forename||' '||surname person_name
    from   mypeople;
begin
  for person in myNames
  loop
    nameString := trim(
                    utl_i18n.escape_reference(
                      regexp_replace(
                        regexp_replace(person.person_name,'[^[:alpha:]]',' '),
                        '\s+',' '),
                      'us7ascii')
                    )||' ';
    posn := 1;
    while nameString is not null
    loop
      token := substr(nameString,1,instr(nameString,' ') - 1);
      insert into personsearch values (person.id,lower(token),soundex(token),posn);
      nameString := substr(nameString,instr(nameString,' ') + 1);
      posn := posn + 1;
    end loop;
  end loop;
end;
/

So, for example, "Siân O'Conner" gets tokenised into "sian" (position 1), "o" (position 2) and "conner" (position 3) and those three entries, with their Soundex, get inserted into personsearch along with their ID.

To search, we do the same process: tokenise the search criteria and then return results where the Soundexes and relative positions match. We order by the position and then the Levenshtein distance (ld) from the original search for each token, in turn.

This query, for example, will search against two tokens (i.e., pre-tokenised search string):

with     searchcriteria as (
         select 'john'  token1,
                'smith' token2
         from   dual)
select   alpha.id,
         mypeople.forename||' '||mypeople.surname
from     peoplesearch alpha
join     mypeople
on       mypeople.student_id = alpha.student_id
join     peoplesearch beta
on       beta.student_id = alpha.student_id
and      beta.position   > alpha.position
join     searchcriteria
on       1 = 1
where    alpha.sound = soundex(searchcriteria.token1)
and      beta.sound  = soundex(searchcriteria.token2)
order by alpha.position,
         ld(alpha.token,searchcriteria.token1),
         beta.position,
         ld(beta.token,searchcriteria.token2),
         alpha.student_id;

To search against an arbitrary number of tokens, we would need to use dynamic SQL: joining the search table as many times as there are tokens, where the position field in the joined table must be greater than the position of the previously joined table... I plan to write a function to do this -- as well as the search string tokenisation -- which will return a table of IDs. However, I just post this here so you get the idea :)

As I say, this works pretty well: It returns good results pretty quickly. Even searching for "John Smith", once cached by the server, runs in less than 0.2s; returning over 200 rows... I'm pretty pleased with it and will be looking to put it into production. The only issues are:

The precalculation of tokens takes a while, but it's a one-off process, so not too much of a problem. A related problem however is that a trigger needs to be put on the mypeople table to insert/update/delete tokens into the search table whenever the corresponding operation is performed on mypeople. This may slow up the system; but as this should only happen during a few periods in a year, perhaps a better solution would be to rebuild the search table on a scheduled basis.
No stemming is being done, so the Soundex algorithm only matches on full tokens. For example, a search for "chris" will not return any "christopher"s. A possible solution to this is to only store the Soundex of the stem of the token, but calculating the stem is not a simple problem! This will be a future upgrade, possibly using the hyphenation engine used by TeX...

Anyway, hope that helps :) Comments welcome!

EDIT My full solution (write up and implementation) is now here, using Metaphone and the Damerau-Levenshtein Distance.

Recommended topics

Hot tags