FTS doesn't work as expected with emails with dots
Asked Answered
P

3

9

We're developing a search as a part of a bigger system.

We have Microsoft SQL Server 2014 - 12.0.2000.8 (X64) Standard Edition (64-bit) with this setup:

CREATE TABLE NewCompanies(
    [Id] [uniqueidentifier] NOT NULL,
    [Name] [nvarchar](400) NOT NULL,
    [Phone] [nvarchar](max) NULL,
    [Email] [nvarchar](max) NULL,
    [Contacts1] [nvarchar](max) NULL,
    [Contacts2] [nvarchar](max) NULL,
    [Contacts3] [nvarchar](max) NULL,
    [Contacts4] [nvarchar](max) NULL,
    [Address] [nvarchar](max) NULL,
    CONSTRAINT PK_Id PRIMARY KEY (Id)
);
  1. Phone is a structured comma separated digits string like "77777777777, 88888888888"
  2. Email is structured emails string with commas like "[email protected], [email protected]" (or without commas at all like "[email protected]")
  3. Contacts1, Contacts2, Contacts3, Contacts4 are text fields where users can specify contact details in free form. Like "John Smith +1 202 555 0156" or "Bob, +1-999-888-0156, [email protected]". These fields can contain emails and phones we want to search further.

Here we create full-text stuff

-- FULL TEXT SEARCH
CREATE FULLTEXT CATALOG NewCompanySearch AS DEFAULT;  
CREATE FULLTEXT INDEX ON NewCompanies(Name, Phone, Email, Contacts1, Contacts2, Contacts3, Contacts4, Address)
KEY INDEX PK_Id

Here is a data sample

INSERT INTO NewCompanies(Id, Name, Phone, Email, Contacts1, Contacts2, Contacts3, Contacts4) 
VALUES ('7BA05F18-1337-4AFB-80D9-00001A777E4F', 'PJSC Azimuth', '79001002030, 78005005044', '[email protected], [email protected]', 'John Smith', 'Call only at weekends +7-999-666-22-11', NULL, NULL)

Actually we have about 100 thousands of such records.

We expect users can specify a part of email like "@gmail.com" and this should return all the rows with Gmail email addresses in any of Email, Contacts1, Contacts2, Contacts3, Contacts4 fields.

The same for phone numbers. Users can search for a pattern like "70283" and a query should return phones with these digits in them. It's even for free form Contacts1, Contacts2, Contacts3, Contacts4 fields where we probably should remove all but digits and space characters firstly before searching.

We used to use LIKE for the search when we had about 1500 records and it worked fine but now we have a lot of records and the LIKE search takes infinite to get results.

This is how we try to get data from there:

SELECT * FROM NewCompanies WHERE CONTAINS((Email, Contacts1, Contacts2, Contacts3, Contacts4), '"[email protected]*"') -- this doesn't get the row
SELECT * FROM NewCompanies WHERE CONTAINS((Phone, Contacts1, Contacts2, Contacts3, Contacts4), '"6662211*"') -- doesn't get anything
SELECT * FROM NewCompanies WHERE CONTAINS(Name, '"zimuth*"') -- doesn't get anything
Psid answered 18/2, 2020 at 9:8 Comment(26)
Why are all your columns an nvarchar(MAX) here? I have never heard of, or met anyone who's name is 1 Billion~ characters long. And, according to this answer, an email address can't be longer 254 characters; so you also have 1 Billion~ wasted characters there.Leal
Oh well yes, they're acutally different. I'll update my question. Thanks!Psid
Updated. Name is 400 symbols max but other fields (including the email field) can store more than just one email or phone.Psid
"other fields can store more than just one email or phone" You really need to fix that design then. Don't store delimited data in your columns. There should be a row for each value. Considering the set up you have here, this means having multiple tables to store the email, telephone number, etc details.Leal
Sounds like you're fighting with full-text search's word breakers. You're unlikely to find anything using @gmail.com as a search term because the @ character is a word breaker. In other words, depending the version of SQL Server you have, words in the index for [email protected] will be either (A) user, gmail and com or (B) user, [email protected], gmail and com. REF: Behavior Changes to Full-Text SearchStreamline
As for why CONTAINS(Email,'"[email protected]*"') returns nothing; characters like . and @ are "stop characters" or "word breakers". At "worst" the above string('[email protected]') would be seen as 5 separate words: s, m, s, gmail, and com.Leal
If, however, you fix your design, you won't have to do a wildcard search for '[email protected]*', you'll be able to simply do WHERE Email = '[email protected]'. Fix the design, fix the problem.Leal
@Larnu this won't fix other issues we have with the search. There are other fields like Phone and Contacts fields we can't to search in.Psid
Why would it not? I'm telling you to normalise the entire design, not just the email column.Leal
@Larnu We still have free form fields like Contacts where a user is allowed to type whatever they want and we have to find stuff in this mess.Psid
No, but then if you're searching free form text, then you're likely going to need LIKE, or ensure that you're searching for words (not email addresses).Leal
@Larnu LIKE doesn't search appropriate time. It takes minutes for it on 100k records.Psid
Yes, it will do, it's not SARGable. but you can't use a Full Text Index for search for email addresses. A . and an @ are word breakers.Leal
@Larnu is there any chance we can disable these word breakers for this case?Psid
You can change then, yes, however, removing . as a word breaker would be a really bad idea. View or Change Registered Filters and Word Breakers. This would mean that a search of 'America' would not work in the sentence 'Last year, I went to America.'Leal
@Larnu but I don't want to search for anything but emails and phones in those fields (Contacts1, Contacts2, Contacts3, Contacts4). Do you think this will work out? Can we disable word breakers only when we index Contacts fields as well as Phone and Email?Psid
"but I don't want to search for anything but emails and phones in those fields" then they should be stored in an appropriate column, like i said before. You have columns for that data, which should be normalised. Word breakers are set at instance/database level. so it would be a significant breaking change to remove ..Leal
Whats the question here?Burgoyne
@Iman Well, the main question here is how we can make FullText search work like LIKE clausePsid
Just a thought: would registering a new "language" with required stopwords and word breakers work for you? I believe you can specify language per column so your custom text handling will be restricted only to required columns. This is probably not the end, and you can look at making your own IFilter as well?Rare
You'd either want to normalize the tables to 1-M for all phone, email etc. records Second option is to split the columns (use string_split(email,','), in combination with Outer Apply. You'd have to specify a theoretical limit on the number of emails a user can have. Then write a search like this: SELECT * FROM NewCompanies WHERE Id IN (SELECT ID from .... where MyOuterApply.EmailCol1 LIKE '%'+@SearchString+'%') OR Id IN (SELECT ID from .... where MyOuterApply.EmailCol2 LIKE '%'+@SearchString+'%') . Create about five individual indexes on each of the fields and include the primary key.Hullo
@Larnu its stated, that there are about 100 thousands of such records. They are not going to change the structure.Berners
Just to be clear: Your queries do not return a result? You are not running into a timeout?Berners
@Berners Not going to, doesn't mean that shouldn't. The reason the OP is having the issue they are is because of the lack of normalisation.Leal
Dumb question perhaps, but have you considered using RegEx instead of FTS? I agree with @Larnu, though. You'd make your life a lot easier if you'd properly store the data in 3NF..Synchro
this works for me: SELECT * FROM NewCompanies WHERE CONTAINS((Email, Contacts1, Contacts2, Contacts3, Contacts4), '[email protected]')Radtke
Q
2

Actually requests

SELECT [...] CONTAINS([...], '"6662211*"') -- doesn't get anything

against 'Call only at weekends +7-999-666-22-11' and

SELECT [...] CONTAINS(Name, '"zimuth*"') -- doesn't get anything

against 'PJSC Azimuth'

do work as expected.
See Prefix Term. Because 6662211* is not a prefix of +7-999-666-22-11 as well as zimuth* is not a prefix of Azimuth

As for

SELECT [...] CONTAINS([...], '"[email protected]*"') -- this doesn't get the row

This is probably due to word breakers as alwayslearning pointed out in comments. See word-breakers

I don't think that Full-Text Search is applicable for your task.

Why use for FTS in the exact same tasks that LIKE operator is used for? If there were a better index type for LIKE queries... then there would be the better index type, not the totally different technology and syntax.
And in no way it will help you to match "6662211*" against "666some arbitrary char22some arbitrary char11".
Full Text search is not about regex-es (and "6662211*" is not even a correct expression for the job - there is nothing about "some arbitrary char" part) it's about synonyms, word forms, etc.

But is it at all possible to search for substrings effectively?

Yes it is. Leaving aside such prospects as writing your own search engine, what can we do within SQL?

First of all - it is an imperative to cleanup your data! If you want to return to the users the exact strings they have entered

users can specify contact details in free form

...you can save them as is... and leave them along.
Then you need to extract data from the free form text (it is not so hard for emails and phone numbers) and save the data in some canonical form. For email, the only thing you really need to do - make them all lowercase or uppercase (doesn't matter), and maybe split then on the @ sing. But in phone numbers you need to leave only digits
(...And then you can even store them as numbers. That can save you some space and time. But the search will be different... For now let's dive into a more simple and universal solution using strings.)

As MatthewBaker mentioned you can create a table of suffixes. Then you can search like so

SELECT DISTINCT * FROM NewCompanies JOIN Sufficies ON NewCompanies.Id = Sufficies.Id WHERE Sufficies.sufficies LIKE 'some text%'

You should place the wildcard % only at the end. Or there would be no benefits from the Suffixes table.

Let take for example a phone number

+7-999-666-22-11

After we get rid of waste chars in it, it will have 11 digits. That means we'll need 11 suffixes for one phone number

           1
          11
         211
        2211
       62211
      662211
     6662211
    96662211
   996662211
  9996662211
 79996662211

So the space complexity for this solution is linear... not so bad, I'd say... But wait it's complexity in the number of records. But in symbols... we need N(N+1)/2 symbols to store all the suffixes - that is quadratic complexity... not good... but if you have now 100 000 records and do not have plans for millions in the near future - you can go with this solution.

Can we reduce space complexity?

I will only describe the idea, implementing it will take some effort. And probably we'll need to cross the boundaries of SQL

Let's say you have 2 rows in NewCompanies and 2 strings of free form text in it:

    aaaaa
    11111

How big should the Suffixes table be? Obviously, we need only 2 records.

Let's take another example. Also 2 rows, 2 free text strings to search for. But now it's:

    aa11aa
    cc11cc

Let's see how many suffixes do we need now:

         a // no need, LIKE `a%`  will match against 'aa' and 'a11aa' and 'aa11aa'
        aa // no need, LIKE `aa%` will match against 'aa11aa'
       1aa
      11aa
     a11aa
    aa11aa
         c // no need, LIKE `c%`  will match against 'cc' and 'c11cc' and 'cc11cc'
        cc // no need, LIKE `cc%` will match against 'cc11cc'
       1cc
      11cc
     c11cc
    cc11cc

No so bad, but not so good either.

What else can we do?

Let's say, user enters "c11" in the search field. Then LIKE 'c11%' needs 'c11cc' suffix to succeed. But if instead of searching for "c11" we first search for "c%", then for "c1%" and so on? The first search will give as only one row from NewCompanies. And there would be no need for subsequent searches. And we can

       1aa // drop this as well, because LIKE '1%' matches '11aa'
      11aa
     a11aa // drop this as well, because LIKE 'a%' matches 'aa11aa'
    aa11aa
       1cc // same here
      11cc
     c11cc // same here
    cc11cc

and we end up with only 4 suffixes

      11aa
    aa11aa
      11cc
    cc11cc

I can't say what the space complexity would be in this case, but it feels like it would be acceptable.

Quadruplex answered 26/2, 2020 at 22:17 Comment(0)
E
1

In cases like this full text searching is less than ideal. I was in the same boat as you are. Like searches are too slow, and full text searches search for words that start with a term rather than contains a term.

We tried several solutions, one pure SQL option is to build your own version of full text search, in particular an inverted index search. We tried this, and it was successful, but took a lot of space. We created a secondary holding table for partial search terms, and used full text indexing on that. However this mean we repeatedly stored multiple copies of the same thing. For example we stored "longword" as Longword, ongword, ngword, gword.... etc. So any contained phrase would always be at the start of the indexed term. A horrendous solution, full of flaws, but it worked.

We then looked at hosting a separate server for lookups. Googling Lucene and elastisearch will give you good information on these off the shelf packages.

Eventually, we developed our own in house search engine, which runs along side SQL. This has allowed us to implement phonetic searches (double metaphone) and then using levenshtein calculations along side soundex to establish relevance. Overkill for a lot of solutions, but worth the effort in our use case. We even now have an option of leveraging Nvidia GPUs for cuda searches, but this represented a whole new set of headaches and sleepless nights. Relevance of all these will depend on how often you see your searches being performed, and how reactive you need them to be.

Evalynevan answered 25/2, 2020 at 15:18 Comment(0)
R
1

Full-Text Indexes have a number of limitations. You can use wildcards on words that the index finds are whole "parts" but even then you are constrained to the ending part of the word. That is why you can use CONTAINS(Name, '"Azimut*"') but not CONTAINS(Name, '"zimuth*"')

From the Microsoft documentation:

When the prefix term is a phrase, each token making up the phrase is considered a separate prefix term. All rows that have words beginning with the prefix terms will be returned. For example, the prefix term "light bread*" will find rows with text of "light breaded," "lightly breaded," or "light bread," but it will not return "lightly toasted bread."

The dots in the email, as indicated by the title, are not the main issue. This, for example, works:

SELECT * FROM NewCompanies 
WHERE CONTAINS((Email, Contacts1, Contacts2, Contacts3, Contacts4), '[email protected]') 

In this case, the index identifies the whole email string as valid, as well as "gmail" and "gmail.com." Just "s.m.s" though is not valid.

The last example is similar. The parts of the phone number are indexed (666-22-11 and 999-666-22-11 for example), but removing the hyphens is not a string that the index is going to know about. Otherwise, this does work:

SELECT * FROM NewCompanies 
WHERE CONTAINS((Phone, Contacts1, Contacts2, Contacts3, Contacts4), '"666-22-11*"')
Radtke answered 26/2, 2020 at 19:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.