SQL-Server Full Text Index Unexpected results
Asked Answered
P

1

2

MS SQL SERVER 2012

I have a full text index configured on a database which for the main part seems to be functioning correctly. I have a keyword field that contains the following text:

 Veterinary Products Beaphar Fiprotec Spot On Small Dog 67mg x 1 235_ldzr_2 Beaphar 87112311437_ldzr_2

When trying to select the row I am getting unexpected results

select * from products where CONTAINS(keywords,'"235*"')
select * from products where CONTAINS(keywords,'"87112311437_ldzr_2*"')

Both return the expected result. This tells me that the index is setup and the underscore character is not an issue. However I get no results for:

select * from products where CONTAINS(keywords,'"235_*"')
select * from products where CONTAINS(keywords,'"235_ldzr_2*"')

I have automatic change tracking switched on, and have disabled it, rebuilt the index manually, re-enabled it, and still no luck. I have even gone to the point of evaluating the string character by character to get the ascii codes in case a special character had made it in via the import. The term does not appear in the stop list. I am now stuck as to why this item does not show.

I have created a copy in a seperate database and have not been able to reproduce the problem, so can't really post code here - as I cannot show the error.

Does anyone have any further checks I can perform or know of any other points that affect the full text search functionality?

@HoneyBadger I have narrowed it down. there seems to be a problem around digits followed by digits with an underscore. Try the following on your table structure.

DELETE FROM TestFullTextSearch
insert into TestFullTextSearch values (1, '235_ldzr_2 testing ')
insert into TestFullTextSearch values (2, 'test 235_ldzr_2 testing ')
insert into TestFullTextSearch values (3, 'A 235_ldzr_2 testing ')
insert into TestFullTextSearch values (4, '1 235_ldzr_2 testing ')
insert into TestFullTextSearch values (5, '12 235_ldzr_2 testing ')
insert into TestFullTextSearch values (6, '123 235_ldzr_2 testing ')

SELECT * 
from   TestFullTextSearch 
where  contains(AllText, '"235*"')

SELECT * 
from   TestFullTextSearch 
where  contains(AllText, '"235_*"')

Rows 1,2,3 all return as expected. 4,5,6 don't when the underscore is included.

Phonetics answered 14/2, 2018 at 13:55 Comment(3)
Some language parsers treat _ as a delimiter so it cannot be search for literally. 1) run select lcid from sys.syslanguages where langid=@@langid to get the LCID for your db language 2) run select * from sys.dm_fts_parser('"235_*"', your-lcid-here, 0, 0); this will show you how the string is parsed (including the catalogs stop list)Exile
Hi Alex. Good call - but already been down that road. The underscore is being parsed correctly. I even have multiple instances of the underscore in that single field, one of which can be searched on, one of which can't. Good to see we're thinking alike though.Phonetics
I came to the same conclusion it had to do with numbers, please see my last edit.Bleier
B
2

SQL Server has a list of words and symbols which are considered noise, these are excluded from the index, and you can't search for it specifically. You can edit this list, or you can create a new one. See here for the how-to.

Apparently, you can even edit it at file level, see here.

Edit:

I've been able to reproduce the problem:

create table TestFullTextSearch (
Id int not null,
AllText nvarchar(400)
)

create unique index test_tfts on TestFullTextSearch(Id);
create fulltext catalog ftcat_tfts;
create fulltext index on TestFullTextSearch(AllText)
key index test_tfts on ftcat_tfts
with change_tracking auto, stoplist off
go

insert into TestFullTextSearch values (1, 'legacyreport Report Legacy 23049823490  20150713 Cardiac US ')
insert into TestFullTextSearch values (2, '123-45-678 foo bar  19450712 20020723 Exercise Stress US ')
insert into TestFullTextSearch values (3, '2048 jj goodguy xy2000 19490328 20150721 Cardiac US ')
insert into TestFullTextSearch values (4, '12345678 4.0 ALLCALCS  19650409 20031103 Cardiac Difficult US ')
insert into TestFullTextSearch values (5, 'Veterinary Products Beaphar Fiprotec Spot On Small Dog 67mg x 1 235_ldzr_2 Beaphar 87112311437_ldzr_2 ')

On this table, if I execute

select * 
from   TestFullTextSearch 
where  contains(AllText, '"235_ldzr_2*"')

I get no results. However, if I add a backslash:

select  * 
FROM    TestFullTextSearch 
WHERE   CONTAINS(AllText, '"235\_ldzr_2*"')

I do get a result!

I don't understand though. If I add another row:

insert into TestFullTextSearch values (6, 'Veterinary Products Beaphar Fiprotec Spot On Small Dog 67_mg x 1 235_ldzr_2 Beaphar 87112311437_ldzr_2 ')

And I search for WHERE CONTAINS(AllText, '"67_*"'), I get the result back as expected... Other tests as well, don't show the same behaviour as contains(AllText, '"235_ldzr_2*"').

By the way, my LCID = 1033.

One more test case:

Insert into TestFullTextSearch values (15, 'Veterinary Products Beaphar Fiprotec Spot On Small Dog 100_ldzr_2 x 1 225_ldzr_2 Beaphar 87112311437_ldzr_2 ')

This record I can find with

CONTAINS(AllText, '"100_*"')

However, this record, where I added x 1 in front of the 100_ldzr:

Insert into TestFullTextSearch values (16, 'Veterinary Products Beaphar Fiprotec Spot On Small Dog x 1 100_ldzr_2 x 1 225_ldzr_2 Beaphar 87112311437_ldzr_2 ')

I cannot find with the same CONTAINS, but I can find it if I add a backslash.

Conclusion (for now)

It seems there can be an issue when numbers are followed by a space, then followed by numbers ending in an underscore. Consider the following:

select * from sys.dm_fts_parser('"x 235_*"', 1033, 0, 0)

This results in 2 search terms: "x" and "235_"

This however:

select * from sys.dm_fts_parser('"1 235_*"', 1033, 0, 0)

Results in seven search terms: "1 235", "1235", "1", "235", and "_". (1 and 235 also occur as nn1/nn235).

The absence of "235_" explains why it could not be found. When I added a backslash, it was interpreted as word divider , matching 235\_ to 235 (and to _).

I'm pretty confident this is the cause. The solution is probably more difficult.

Bleier answered 14/2, 2018 at 14:29 Comment(5)
Hi HoneyBadger. Your right there are many characters that do get replaced & - / \ etc. but the underscore isn't one of them. We actually do several replacements to access these characters where our clients require them to be searched. e.g. we replace '-' with 'hyph' or '&' with 'and'. This is something we do on a regular basis as we "mis-use" full text searches for matching part number. A plus 1 for the answer as it may help someone else, but unfortunately not the case here.Phonetics
ok there are underscores in the above comment but I just get bold text - sorry for any confusionPhonetics
@MatthewBaker, Been doing some digging, please see edit. I'm a bit flabbergasted now...Bleier
I've just run the same tests. select * FROM TestFullTextSearch WHERE CONTAINS(AllText, '"235_ldzr_2*"') returns 5&6, however select * FROM TestFullTextSearch WHERE CONTAINS(AllText, '"87112311437_ldzr_2*"') returns no rows at all. It's not just me doing something daft then?Phonetics
And yes my lcid = 1033 too.Phonetics

© 2022 - 2024 — McMap. All rights reserved.