SQL Server Full Text Search around numbers and underscores
Asked Answered
E

4

10

Using SQL Server 2012 (in general using SQL Server 2008 R2 up to SQL Server 2016)

This question is a more specific re wording of SQL-Server Full Text Index Unexpected results. Please see here for how we have got to this point and what has been tried already.

I am re-posting now we have found the specific fault. Many thanks to @HoneyBadger.

His help has been invaluable getting to this point.

Table structure:

CREATE TABLE TestFullTextSearch (Id INT NOT NULL, AllText NVARCHAR(400))

CREATE UNIQUE INDEX test_tfts ON TestFullTextSearch(Id)

CREATE FULLTEXT CATALOG ftcat_tfts

CREATE FULLTEXT INDEX ON TestFullTextSearch(AllText)
    KEY INDEX test_tfts ON ftcat_tfts
    WITH CHANGE_TRACKING AUTO, STOPLIST OFF

Data:

INSERT INTO TestFullTextSearch 
VALUES (1, ' 123_456 789 '), (2, ' 789 123_456 '),
       (3, ' 123_456 ABC '), (4, ' ABC 123_456 ')

Please note this data is purely to demonstrate the issue, and is not indicative of a live data set. Our live data sets can be over 500,000 rows, with paragraphs of data in a single field being searched - hence using full text searches.

Select 1: Results as expected

SELECT *
FROM TestFullTextSearch
WHERE CONTAINS (AllText, '"123*"')

Id          AllText
----------- ------------
1            123_456 789
2            789 123_456
3            123_456 ABC
4            ABC 123_456

SELECT 2: Misses row 2 in the result set

SELECT *
FROM TestFullTextSearch
WHERE CONTAINS (AllText, '"123_*"')

Id          AllText
----------- ------------
1            123_456 789
3            123_456 ABC
4            ABC 123_456

SELECT 3: Only returns row 2

SELECT *
FROM TestFullTextSearch
WHERE CONTAINS (AllText, '"123\_*"')

Id          AllText
----------- ------------
2            789 123_456

Conclusion: Searching for a number string suffixed with an underscore fails if preceding word is a number string.

Problem: Our clients use full text search and expect a result to be forth coming around part numbers and catalogue references, which may or may not be in a section of text including other number strings. Full text search does not seem to support this in a consistent manner.

Any help gratefully received.

NB: This issue does not happen on SQL SERVER 2008, but does on 2012+

I have also tried switching to the older version of FTS parser. Testing with

SELECT * FROM sys.dm_fts_parser (' "789 123_456" ',1033,0,0)
SELECT * FROM sys.dm_fts_parser (' "789 123_456" ',2057,0,0)

I had with current parser: enter image description here

And after reverting to the legacy parser: enter image description here

So it has had an effect, however I'm still getting the same results.

Are there any other differences in the full text search between 2008 & 2012 that could be having this effect?

Encyclopedist answered 15/2, 2018 at 7:49 Comment(1)
Can't reproduce the problem. All three variants of the query return all 4 rows. Microsoft SQL Server 2008 (SP4) - 10.0.6241.0 (X64) Apr 17 2015 10:56:08 Copyright (c) 1988-2008 Microsoft Corporation Standard Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)Ming
D
5

They changed the full text parsers/stemmers between SQL 2008 and SQL 2012.

With a registry change, you can use the legacy parser, which should work better in your situation.

See https://technet.microsoft.com/en-us/library/gg509108(v=sql.110).aspx for details.

If you need to support both old and new style, then you can revert US English to the old and keep UK English the new (or vice versa)

Using SQL 2016, I reverted UK English and kept US English the same:

exec sp_help_fulltext_system_components 'wordbreaker', 1033

exec sp_help_fulltext_system_components 'wordbreaker', 2057

Returns: Screenshot showing fts components

I created another table using UK English and populated it.

CREATE TABLE TestFullTextSearch2 (Id INT NOT NULL, AllText NVARCHAR(400))

CREATE UNIQUE INDEX test_tfts2 ON TestFullTextSearch2(Id)


CREATE FULLTEXT INDEX ON TestFullTextSearch2(AllText language 2057)
    KEY INDEX test_tfts2 ON ftcat_tfts 
    WITH CHANGE_TRACKING AUTO, STOPLIST OFF

    INSERT INTO TestFullTextSearch2
VALUES (1, ' 123_456 789 '), (2, ' 789 123_456 '),
       (3, ' 123_456 ABC '), (4, ' ABC 123_456 ')

I'm getting the expected 4 results for all 3 queries.

Results of FTS queries

Verify that your changes have taken effect.

exec sp_help_fulltext_system_components 'wordbreaker', 1033

exec sp_help_fulltext_system_components 'wordbreaker', 2057

select t.name, c.* from sys.tables t inner join sys.fulltext_index_columns c on t.object_id = c.object_id
Donell answered 20/2, 2018 at 15:57 Comment(1)
Needed to add an additional step to this. Once the parser change had been made I had to drop and recreate the FTI to refresh the data. Auto changes didn't pick it up as no changes had been made to the source data. Re-population also didn't help had to drop and re-create.Encyclopedist
C
3

The problem here is basically a difference in how MSSQL 2012 stores the index and how the query itself is handling the underscore _.

This becomes clear when inspecting the index keywords & fts parser. For row 2 the keyword 123_456 is not stored as such, because of the numeric value coming in front of it. However the fts parser will search for an exact match on "123_" and does not remove the underscore.

select * from sys.dm_fts_index_keywords_by_document
( 
    DB_ID('TestDatabase'),     
    OBJECT_ID('TestFullTextSearch') 
) order by document_id

select * from sys.dm_fts_parser('"123_*"', 0, 0, 0)

One solution would be to change the word breaker for a specific language. You could easily replace it with the word breaker dll from MSSQL 2008 or 2016 where this problem does not occur. (e.g. take the one for Neutral Language: NaturalLanguage6.dll). Make sure to create the Full Text index for the same language.

To find the registered word breakers and where dll's are located use this query:

EXEC sp_help_fulltext_system_components 'wordbreaker';
Curtin answered 22/2, 2018 at 11:8 Comment(1)
Thanks redrobot. Straycatdba had pointed out the parser issue already, which you both rightly stated was the issue. I will say however that the extra script you provided was what highlighted to me I had to drop and recreate the index as it wasn't updating. I can only mark one as the answer, which i have done for straycat as he got in there first, but have voted up your result to give what credit I can. Many thanks for the help.Encyclopedist
E
0

Why don't you use LIKE operator? Try AllText LIKE '%123[_]%', it will return all four rows.

Another solution wold be using CHARINDEX, like:

where charindex('123_', AllText) > 0

0 would mean that string isn't found within the other one.

Exodontist answered 15/2, 2018 at 7:55 Comment(2)
Purely down to efficiency. Like is very slow compared to full text search, especially in large data sets. The data here is only to demonstrate the problem, rather than being indicative of our actual data. A plus one for the work around but not a feasible answer for our production data. Hopefully it will help someone else though,Encyclopedist
Again there are distinct efficiency issues with charindex. We are using full text search for speed. There are too many records (500K + rows in some cases), each with possibly paragraphs of data that are being searched through. While these solutions you have offered do work, they do not work quickly enough with our data.Encyclopedist
M
0

If you have issue with Full Text search with above mention query by @Michal then there is other alternate solution which you can apply as below:

  1. While insert AllText in database maintain other column which hold flag (boolean) indicating include 123, so at time of SELECT statement just check that flag.
  2. Maintain Computed column with pattern formula, which return true OR false.
Mutter answered 19/2, 2018 at 11:5 Comment(1)
Hi Sandip. Unfortunately this will not work. As stated above the data set provided is only to demonstrate the issue. The field being searched is effectively free text potentially paragraphs worth of data. This could not be mapped to individual fields. There is no way of creating that many columns or the supporting code to search them.Encyclopedist

© 2022 - 2024 — McMap. All rights reserved.