Select 1000 distinct names from 100 million records via standard sql
Asked Answered
F

5

7

I have a table tb_FirstName with one field FirstName. The table has 100 million non null records with lots of repetitions e.g. John occurs 2 million times. The distinct count of FirstName is over 2 million.

How do I select 1000 distinct names as quickly as possible using standard sql?

I'm currently using the following but this is

  • tSQL
  • Maybe not as efficient as it could be.

    SELECT x.FirstName
    FROM (
        SELECT  FirstName,
                rnk = RANK() OVER (ORDER BY Firstname)
        FROM    WHData.dbo.tb_DimUserAccount A
        GROUP BY FirstName
        ) x
    WHERE rnk <=1000
    
Fasto answered 29/3, 2013 at 11:29 Comment(5)
You might find this usefull #595623 but may I ask what's the intention for having a table tb_FirstName full of duplicates?Sidras
@Sidras this question is a slimmed down example of the reality - the table actually has 30 fields but no need to include that in the questionFasto
100MM names? sounds like a spam/marketing database of personal info... not sure if to help you or notJempty
@RemusRusanu we have lots of users over past 10+yrs. Not spam and not marketing. Just analysis.Fasto
For analysis you should extract all the distinct names into a separate table, once, and then join at will with it.Jempty
B
12

Seems like you could use TOP 1000 with DISTINCT:

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
ORDER BY FirstName

Condensed SQL Fiddle Demo

Buyers answered 29/3, 2013 at 11:40 Comment(3)
@Fasto -- np, glad we could help!Buyers
FYI: does not work on columns that can't be compared, like XML or TEXT.Helaina
Works perfectly! Although you have a typo - DISTINCT is spelled wrong ;-)Ahead
S
3

Try this

SELECT TOP 1000 FirstName FROM 
(SELECT 
ROW_NUMBER() OVER(PARTITION BY FirstName ORDER BY FirstName) NO,
 FirstName FROM WHData.dbo.tb_DimUserAccount )
  AS T1 WHERE no =1 

or

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount ORDER BY FirstName
Specter answered 29/3, 2013 at 11:43 Comment(3)
The second approach is very appealing. It is neat and accurate. Plus it is very readable.Lippizaner
OP asks for standard SQL, while ROW_NUMBER() OVER(PARTITION... looks like tSQL feature.Skid
Second is ANSI standard me believes.Coated
L
2

Make sure you have an index defined on FirstName.

SELECT TOP 1000 FirstName
FROM (SELECT DISTINCT FirstName
FROM dbo.tb_DimUserAccount) N
ORDER BY FirstName
Lifesize answered 29/3, 2013 at 11:39 Comment(2)
No performance improvement because full table scan is being done in sub query. This is same as post.Edmon
True, full table scan still required but performance slightly better performance than postLifesize
E
2

You need the data after sorting the results on FirstName fields.

It requires full table scan if Index is not created. If Index is created on FirstName then Unique Index scan can improve the time.

Edmon answered 29/3, 2013 at 11:41 Comment(3)
2+ John's is not a rare scenarioLippizaner
We can't place the clustered Index, so Unique Index will have better performance rather than a normal non clustered index.Edmon
Unique index is ok only if the target column contains only unique elements. According to the name of the column FirstName it is highly unlikely this to be true.Lippizaner
P
1

Option with GROUP BY clause

SELECT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
GROUP BY FirstName
ORDER BY FirstName
Pricilla answered 29/3, 2013 at 11:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.