Full Text Search of URL field sql server
Asked Answered
C

5

9

Objective: Return all URLs beginning with "https://mywebsite.domain.com/as/product/4/"

Given:

  • Applied full text search on URL field.
  • SQL Server edition: 2014.
  • 20+ Million rows

URL

https://mywebsite.domain.com/as/product/1/production
https://mywebsite.domain.com/as/product/2/items
https://mywebsite.domain.com/as/product/1/affordability
https://mywebsite.domain.com/as/product/3/summary
https://mywebsite.domain.com/as/product/4/schedule
https://mywebsite.domain.com/as/product/4/resources/summary

Query 1:

WHERE CONTAINS (URL, 'https://mywebsite.domain.com/as/product/4')

Result:

All records returned

Query 2 (Added "*" after reading MSDN article)

WHERE CONTAINS (URL, '"https://mywebsite.domain.com/as/product/4*"')

Result:

No records returned

Any assistance would be greatly appreciated.

Corn answered 18/11, 2016 at 0:3 Comment(8)
is url column always starting with https://mywebsite.domain.com/as/product/ or at least with https://mywebsite.domain.com?Engdahl
Yes each record starts with the protocol and domain.Corn
ok, but domain and protocol are fixed?Engdahl
Correct, they are fixed.Corn
also /as/product is always present? or could be different?Engdahl
Anything after the domain and protocol could be different.Corn
so you are looking for all url of a specific product number? for example 4?Engdahl
I am matching any value after the domain.Corn
C
6

You can use CONTAINS with a LIKE subquery for matching only a start:

SELECT * 
FROM (
SELECT * 
FROM myTable WHERE CONTAINS (URL, '"https://mywebsite.domain.com/as/product/4/"')
) AS S1 
WHERE S1.URL LIKE 'https://mywebsite.domain.com/as/product/4/%' 

This way, the SLOW LIKE operator query will be run against a smaller set of records

EDIT1: (if WHERE CONTAINS (URL, '"https://mywebsite.domain.com/as/product/4/"') is not filtering Values)

After a lot of searches. the problem is in / . The forward-slash isn't contained in the Noise Words file, but I guess it's classed as a delimiter or Word breaker and therefore isn't searchable.

Read these Topics:

EDIT2:

I found one suggested solution that is

/ is considered as an english wordbreaker You may change It from Registry

  • Navigate to Registry values HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\<InstanceRoot>\MSSearch\Language\eng and HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\<InstanceRoot>\MSSearch\Language\enu
  • clear value for WBreakerClass.

Sql server consider https://mywebsite.domain.com/as/product/4 as one word.

Note: above both path i have taken by assuming that you are using English language as word breaker.

Read more about Word Breaker in This MSDN Topic

Carbamidine answered 22/11, 2016 at 23:49 Comment(3)
As MtwStark commented on Rafael's answer, we should add a slash before % or it will find also 44, 4xyx... the OP needs .../4/%Sinistrad
@Corn you don't need any full text search, drop full text index, put a standard non clustered index on your url column and use only the LIKE operator. the CONTAINS is only a needless overhead. With the index on url column LIKE FixedPattern+% is performed with index seek, you cannot improve it any further (maybe unless you set that index itself as clustered) see my answer for further detailsEngdahl
The downside to this approach is that SQL Server performs the LIKE scan first, and then merges it with the results of the CONTAINS - meaning that the slow LIKE operator is always run against all rows first, in addition to the full-text query.Doura
N
3

Use the Like operator:

WHERE URL LIKE 'https://mywebsite.domain.com/as/product/4%'

The % is a wildcard. This should return all records that start with a pattern match up to the first wildcard %.

Natal answered 22/11, 2016 at 23:23 Comment(8)
Let me be the first to welcome you to StackOverflow. Smaller record sets work great with "Like". However, when working with millions of records, the "Like" operators performance becomes an issueCorn
What about replacing the start of the string https://mywebsite.domain.com/as/product/ with a blank, leaving the start of your string as 4/etc/etc ? The LIKE operator may perform better when only traversing one character. A simple - try and fail fast - suggestion.Doehne
@Corn on indexed columns, searching a string from beginning is very fast also with LIKE operator, take a look at my answerEngdahl
you should add a slash before % or it will find also 44, 4xyx... the OP needs .../4/%Engdahl
I agree with MtwStark that an index should make prefix queries with LIKE operator perform very fast, provided it's a clustered index (i.e. B-tree), as this indexes strings in lexicographical order. A hash index or similar will not help with prefix search.Pronounce
@HenningKoehler the index will perform very very fast also if not clustered and also without a clustered index present and also without a primary key. the index is b-tree itself, clustered or not it's only about the physical order of data.Engdahl
Hmm.. re-reading the description I'd say you are right, non-clustered indexes have b-trees as well. So just avoid explicitly creating a hash-index.Pronounce
@WorkSmarter: Although a FULLTEXT index will be remarkably faster than a LIKE in the majority of situations where you are looking for some random part of text, LIKE will easily beat out FULLTEXT when you are specifically searching for the start of a an indexed string column. You can do that different ways, but MtwStark's 2nd and 3rd approach get my vote. It's going to be faster, easier to set up and save you a gazillion on resources and frustration with the FULLTEXT.Anthropoid
G
2

Provided you always search start of the string this will ensure optimizer can use index. I assume URL is VARCHAR

Declare @p varchar(500) ='https://mywebsite.domain.com/as/product/4'

Declare @maxChar char(1);
select @maxChar = max(ch)
from (
    select top(256) ch = char(row_number() over(order by (select null)) - 1)
    from sys.all_objects) t;
select @maxChar;

-- ..
WHERE URL > @p AND URL < @p + @maxChar

When comparing strings, Sql server adds trailing spaces to the shorter one. See https://support.microsoft.com/en-us/kb/316626 . According to http://www.ietf.org/rfc/rfc1738.txt , http://www.ietf.org/rfc/rfc1738.txt all allowed URL symbols are greater than space. So the search parameter, 'https://mywebsite.domain.com/as/product/4' for example, will be less than any URL which starts with this parameter and exceeds parameter length.

Geiger answered 24/11, 2016 at 12:10 Comment(9)
Why using CHAR(255)?Whoever
Good point, and it can be collation specific. See edited answer, compute @maxChar first.Geiger
did you have a reference for comparing varchar using comparaison operators > < <= >= = and if it benefits from indexes?? i am really interested. AndWhoever
After searching. when using Like predicate to search start of string Like 'xx%' it uses indexes also.Whoever
The difference is '1' like '1%' is true, '1' >'1' is false, OP needs only URL which start with but not equal to the search parameter.Geiger
if you have index on url, this is not better nor faster than where url like 'https://mywebsite.domain.com/as/product/4/%', if you have no index the best option would be where substring(url, 30, 13) collate latin1_general_bin = 'as/product/4/' collate latin1_general_binEngdahl
sorry, maybe I have lost something, where did you read "but not equal to" ?Engdahl
@MtwStark. May be just I'm misunderstanding something. I look at URL samples as a representative set and it doesn't contain ones equal to search parameter.Geiger
1st line of question, Objective: Return all URLs beginning with "https://mywebsite.domain.com/as/product/4/" the search string itself is includedEngdahl
E
2

For similar problems I'm used to two solutions, depending on your needs, mainly on performaces or resources or concurrency.. etc etc..

The LIKE operator could be your best friend also with very big tables.

Indexing
First of all, you need to index your url column, working with 20+ millions records it is not easy task, indexing it could cost you 1.5 - 2.0 Gb of disk space, but you will get your query in NO TIME (milliseconds)

With the index on the column to search, LIKE FixedPattern+% is performed with an index seek, you cannot improve it any further.

First solution:

CREATE NONCLUSTERED INDEX [IX_URL] ON [url_table] ([url]);

DECLARE @Domain VARCHAR(100) = 'https://mywebsite.domain.com/'
DECLARE @Path VARCHAR(100) = 'as/product/'
DECLARE @Product VARCHAR(20) = '4'
DECLARE @LikeAll VARCHAR(100) = @Domain + @Path + @Product + '/%'

SELECT url
FROM url_table
WHERE url LIKE @LikeAll

Second solution
The second option is a bit tricky but very effective.
You said protocol and domain of url are fixed and you need to search for something after.
The following is a technique, you can fine tune it to match your needs.
The idea is to add a virtual (computed) column to your url table and then to add an index on it.
This will greatly reduce index dimensions and improve query performances at the cost of a very little overhead of computing in insert/update

ALTER TABLE url_table ADD path AS (SUBSTRING(url, 30, 4000));
CREATE NONCLUSTERED INDEX [IX_PATH] ON [url_table] ([path]);

DECLARE @Domain VARCHAR(100) = 'https://mywebsite.domain.com/'
DECLARE @Path VARCHAR(100) = 'as/product/'
DECLARE @Product VARCHAR(20) = '4'
DECLARE @LikeMid VARCHAR(100) = @Path + @Product + '/%' 

select @Domain + _path -- pay attention!!
FROM url_table
WHERE url LIKE @SrcAll

Please take note, we are selecting @Domain + _path instead of url, to avoid table access and work only on index data.

If you need other columns in url_table your best option is

declare @l table (id int primary key)
insert  into @l
select id 
from url_table 
where _path like @LikeMid

select url
from url_table
where id in (select id from @l)

very fast

Third solution
This is a variant of second one.
In your example data I see the path contains /product/ followed by a number and I'm assuming it as the product number. Maybe you can consider the following

ALTER TABLE url_table ADD _product AS (cast(substring(url,nullif(CHARINDEX('/product/',url,29)+9,9), CHARINDEX('/',url,nullif(CHARINDEX('/product/',url,29)+9,9))-nullif(CHARINDEX('/product/',url,29)+9,9)) as bigint));
CREATE NONCLUSTERED INDEX [IX_PRODUCT] ON [url] ([_product]);

select id, url
from url_table 
where _product = 4

This will produce a computed column with product number of type integer, the index will be only 500Mb and queries on integers will be super fast.
Also the overhead to select all columns from url_table is very very little so you can SELECT * with almost no performances issues.

P.S. You can drop your FullText index and save space and resources..

Engdahl answered 25/11, 2016 at 11:0 Comment(0)
E
1
SELECT * FROM myTable WHERE URL LIKE 'https://mywebsite.domain.com/as/product/4%'
Elli answered 29/11, 2016 at 5:44 Comment(1)
need additional / before %Engdahl

© 2022 - 2024 — McMap. All rights reserved.