Is that possible to use full text index to find closest match strings? What does Statistical Semantics do in Full Text Indexing
Asked Answered
T

1

7

I am looking for SQL Server 2016 full text indexes and they are awesome to make searches for finding multiple words containing strings

When i try to compose the full text index, it shows Statistical Semantics as a tickbox. What does statistical semantics do?

Moreover, I want to find did you mean queries

For example lets say i have a record as house. The user types hause

Can i use full text index to return hause as closest match and show user did you mean house efficiently ? thank you

I have tried soundex but the results it generates are terrible

It returns so many unrelated words

And since there are so many records in my database and i need very fast results, i need something SQL server natively supports

Any ideas? Any way to achieve such thing with using indexes?

I know there are multiple algorithms but they are not efficient enough for me to use online. I mean like calculating edit distance between each records. They could be used for offline projects but i need this efficiency in an online dictionary where there will be thousands of requests constantly.

I already have a plan in my mind. Storing not-found results in the database and offline calculating closest matches. And using them as cache. However, i wonder any possible online/live solution may exists? Consider that there will be over 100m nvarchar records

Towardly answered 22/3, 2017 at 9:35 Comment(2)
There is no SQL Server 2017 version - we have 2012, 2014, 2016 and also "vNext" in the works - which one are you really referring to??Therontheropod
@Therontheropod 2016 fixed my typo ty.Heliozoan
R
2

Short answer is no, Full Text Search cannot search for words that are similar, but different.

Full Text Search uses stemmers and thesaurus files:

The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").

A Full-Text Search thesaurus defines a set of synonyms for a specific language.

Both stemmers and thesaurus are configurable and you can easily have FT match house for a search on hause, but only if you added hause as a synonym for house. This is obviously a non-solution as it requires you to add every possible typo as a synonym...

Semantic search is a different topic, it allows you to search for documents that are semantically close to a given example.

What you want is to find records that have a short Levenshtein distance from a given word (aka. 'fuzzy' search). I don't know of any technique for creating an index that can answer a Levenshtein search. If you're willing to scan the entire table for each term, T-SQL and CLR implementations of Levenshtein exists.

Rayborn answered 25/3, 2017 at 17:24 Comment(8)
ty for answer. yes i am aware of algorithms but as you also noted, they require full table scan which is a no no in my online application. so you say there is no such index or feature in SQL server right? from this answer i am also guessing that google uses cached results based on offline training?Heliozoan
What you can try is to build all words that are at Levenshtein 1,2,3 from a given word (possible and reasonable fast if a small character set is used, eg. Latin, not the entire Unicode) and then seek for all these 'close' matches.Rayborn
Also you can add any word that has no match to a table and periodically build all 'close' words for these unmatched words and index it. Next time, will be found. In time, you'll cover all common typos.Rayborn
all words relation would generate so many relations. however it is also what in my mind that storing unfound words and calculating their closest pairs offline later. i wish SQL had some kind of index system but it seems impossible technically.Heliozoan
There are also heuristics, like assuming the first letter is correct. And also creating all the alternatives can be greatly pruned by applying common typo errors (eg. based on keyboard layout)Rayborn
so assuming first letter is correct, what technique i can apply based on indexes?Heliozoan
Assuming first letter is correct you can do a scan only on that letter, assuming an index on that word.Rayborn
like can use an index if of the form LIKE 'A%'Pagano

© 2022 - 2024 — McMap. All rights reserved.