Is that possible to use full text index to find closest match strings? What does Statistical Semantics do in Full Text Indexing

I am looking for SQL Server 2016 full text indexes and they are awesome to make searches for finding multiple words containing strings

When i try to compose the full text index, it shows Statistical Semantics as a tickbox. What does statistical semantics do?

Moreover, I want to find did you mean queries

For example lets say i have a record as house. The user types hause

Can i use full text index to return hause as closest match and show user did you mean house efficiently ? thank you

I have tried soundex but the results it generates are terrible

It returns so many unrelated words

And since there are so many records in my database and i need very fast results, i need something SQL server natively supports

Any ideas? Any way to achieve such thing with using indexes?

I know there are multiple algorithms but they are not efficient enough for me to use online. I mean like calculating edit distance between each records. They could be used for offline projects but i need this efficiency in an online dictionary where there will be thousands of requests constantly.

I already have a plan in my mind. Storing not-found results in the database and offline calculating closest matches. And using them as cache. However, i wonder any possible online/live solution may exists? Consider that there will be over 100m nvarchar records

Short answer is no, Full Text Search cannot search for words that are similar, but different.

Full Text Search uses stemmers and thesaurus files:

The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").

A Full-Text Search thesaurus defines a set of synonyms for a specific language.

Both stemmers and thesaurus are configurable and you can easily have FT match house for a search on hause, but only if you added hause as a synonym for house. This is obviously a non-solution as it requires you to add every possible typo as a synonym...

Semantic search is a different topic, it allows you to search for documents that are semantically close to a given example.

What you want is to find records that have a short Levenshtein distance from a given word (aka. 'fuzzy' search). I don't know of any technique for creating an index that can answer a Levenshtein search. If you're willing to scan the entire table for each term, T-SQL and CLR implementations of Levenshtein exists.

Recommended topics

Hot tags