SQL Query - Need to improve performance
Asked Answered
H

8

8

I have a data load scenario where I create dynamic sql query to pull data and cache in our service. There is 1 table that contains all product data : ProductHistory (47 columns, 200,000 records + and will keep growing)

What I need: Get the latest products by using the maximum id, maximum version and maximum changeid.

First Attempt:

SELECT distinct Product.* FROM ProductHistory product 
WHERE  product.version = 
(SELECT max(version) from ProductHistory p2 where product.Id = p2.Id 
  and product.changeId = 
(SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))

This took more than 2.51 minutes.

Other Failed Attempt:

select distinct product.* from ProductHistory product 
where CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal) = 
(select MAX(CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal)) from ProductHistory p2 
where product.Id = p2.Id)

It basically uses the same principle as when you order dates, concatenating the numbers ordered by relevance.

For example 11 Jun 2007 = 20070711
And in our case: Id = 4 , version = 127, changeid = 32   => 40127032
The zeros are there not to mix up the 3 different ids

But this one takes 3.10 minutes !!! :(

So, I basically need a way to make my first attempt query better by any chance. I was also wondering with such amount of data, is this the best speed of retrieval that I should expect ?

  1. I ran sp_helpindex ProductHistory and found out the indexes as below :

    PK_ProductHistoryNew - clustered, unique, primary key located on PRIMARY- Id, Version

  2. I wrapped the first query in a SP but still no change.

So, wondering by what other means we can improve the performance of this operation ?

Thanks, Mani p.s : I am just running these queries in SQL management stuido to see the time.

Husbandry answered 20/7, 2011 at 9:55 Comment(2)
The first query take 2.51 minutes or seconds?Beatrizbeattie
Thanks Ben. Edited to correct the time. It is in minutes.Husbandry
C
6

Run the query from Sql Server Management Studio and look at the query plan to see where the bottle neck is. Any place you see a "table scan" or "index scan" it has to go through all data to find what it is looking for. If you create appropriate indexes that can be used for these operations it should increase performance.

Celloidin answered 20/7, 2011 at 10:2 Comment(3)
+1: Another thing to note is whether you need all of the columns from the Product table in your SELECT statement.Anderegg
I am basically a .NET component dev and new to SQL. Any more light on Anders point would help me to debug this further. I have mentioned the indexes that is currently on the table. Any idea what more I could do specifically ?Husbandry
this is great, it runs faster.Obe
V
4

Some things I see:

  • Is the DISTINCT necessary? If you do a DISTINCT * it's unlikely to have any benefit but it will have overhead to check for duplicates across all fields.
  • Instead of having two subselects in your WHERE clause, JOIN to a derived table. This should process only once. I suspect your WHERE clause is processing multiple times.

<-- -->

SELECT Product.* 
FROM ProductHistory product 
INNER JOIN ( SELECT P.Id, 
                    MAX(p.version) as [MaxVer], 
                    MAX(p.Changeid) as [MaxChange]
             FROM Product p
             GROUP BY p.ID) SubQ
    ON SubQ.ID = product.ID
    AND SubQ.MaxChange = Product.ChangeID
    AND SubQ.MaxVer = Product.Version

You should also have an index on Id, Version, ChangeID for this.

Vaishnava answered 20/7, 2011 at 10:19 Comment(1)
@Mani: I would suggest you try this query (and also this with the Changeid removed, both lines: MAX(p.Changeid) as [MaxChange] and AND SubQ.MaxChange = Product.ChangeID). Since you already have a PK of (Id, Version) it will be faster. And it's possible you don't really need the complex approach you try.Reciprocity
J
1

Well, storing everything in the table is not the way to do. Better is to store the last version in a table and use another one (with the same structure) for the history (as I guess you are more interested in current products than old ones). And concept issues will create many workarounds...

Also, do not use DISTINCT because it often hides issues in the query (usually, if duplicates are retrieved, it means you can optimize better).

Now, the best part: how to resolve your problem? I guess you should use the grouping principle giving something like this:

SELECT max(id), max(version), max(changeid) 
  FROM ProductHistory p
  WHERE <filter if necessary for old products or anything else>
  GROUP BY version, changeid
  HAVING version = max(version)
     AND changeid = max(changeid)
     AND id = max(id)

But, if I look at your PK, I'm surprised, the changeid is not relevant as you should deal with the id and version only...

I am not sure if my request is fully correct because I can not test but I guess you can do some testings.

Jato answered 20/7, 2011 at 10:16 Comment(1)
The query you provide is not equivalent to the OP's query.Reciprocity
T
0

I think you need an index on (Id, changeId, version) for this query. Please provide the table definition, the indexes on the table now and the query plan for your query.

Throb answered 20/7, 2011 at 10:20 Comment(0)
A
0

This is getting a bit funky, but I wonder if partitioning would work:

  SELECT Id
  FROM (
      SELECT Id,
      MAX(version) OVER (PARTITION BY changeId) max_version
      FROM ProductHistory
  ) s
  where version = s.max_version
Arlinda answered 20/7, 2011 at 11:21 Comment(0)
I
0

I have a feeling this query will take longer as they number of rows increases, but it's worth a shot:

SELECT * FROM 
(
SELECT Col1, Col2, Col3,
ROW_NUMBER() OVER (PARTITION BY ProductHistory.Id ORDER BY Version DESC, ChangeID DESC) AS RowNumber 
FROM ProductHistory
)
WHERE RowNumber = 1
Incognito answered 20/7, 2011 at 12:45 Comment(0)
O
0

Try this CTE, it should be the fastest option possible and you probably won't even need indexes to get great speed:

with mysuperfastcte as (
 select product.*, 
 row_number() over (partition by id order by version desc) as versionorder,
 row_number() over (partition by id order by changeid desc) as changeorder 
 from ProductHistory as product
)
select distinct product.*
from mysuperfastcte
where versionorder = 1
and changeorder = 1;

NB. I think you may have a bug at this point in your code so please confirm and double check the results you are expecting with my code:

  and product.changeId =  (SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))
  • you are trying to get max(changeid) using a correlated subquery but you are also joining on changeid - that is the same as just getting every row. Presumably you didn't intend that?

Also - obviously reduce the number of columns you are returning to just those you need and then run the following before running your query and check the messages output:

SET STATISTICS IO ON

Look for tables with high logical reads and figure out where an index will help you.

Hint: If my code works for you then depending on the columns you need you could do:

create index ix1 (id, version desc) include (changeid, .... ) on ProductHistory.

I hope this helps!

Ordinal answered 4/9, 2012 at 2:40 Comment(0)
S
-2

Generaly speaking, select max() needs to sort through the whole table. And you are doing it twice

SELECT TOP 1 is way faster, but you need to make sure your index is right and you have a correct ORDER BY. See if you can play with that.

Schreibe answered 20/7, 2011 at 10:15 Comment(3)
-1 - SELECT MAX() can almost always use an index if one exists. Implying it needs a table scan is inaccurate.Vaishnava
I was not saying or implying that select max() will use a table scan, only that it will sort through the whole table, or index if that exists.Schreibe
SELECT TOP 1 won't be any faster, either, since you either rely on the unspecified order of the table or you have an ORDER BY which still requires a SORT in the background.Vaishnava

© 2022 - 2024 — McMap. All rights reserved.