Select 1000 distinct names from 100 million records via standard sql

Asked 29/3, 2013 at 11:29 Answered 29/3, 2013 at 11:53

Solved sql sql-server standards ansi-sql

I have a table tb_FirstName with one field FirstName. The table has 100 million non null records with lots of repetitions e.g. John occurs 2 million times. The distinct count of FirstName is over 2 million.

How do I select 1000 distinct names as quickly as possible using standard sql?

I'm currently using the following but this is

tSQL

Maybe not as efficient as it could be.

SELECT x.FirstName
FROM (
    SELECT  FirstName,
            rnk = RANK() OVER (ORDER BY Firstname)
    FROM    WHData.dbo.tb_DimUserAccount A
    GROUP BY FirstName
    ) x
WHERE rnk <=1000

Fasto answered 29/3, 2013 at 11:29 Comment(5)

You might find this usefull #595623 but may I ask what's the intention for having a table tb_FirstName full of duplicates? – Sidras 29/3, 2013 at 11:38

@Sidras this question is a slimmed down example of the reality - the table actually has 30 fields but no need to include that in the question – Fasto 29/3, 2013 at 12:10

100MM names? sounds like a spam/marketing database of personal info... not sure if to help you or not – Jempty 29/3, 2013 at 13:42

@RemusRusanu we have lots of users over past 10+yrs. Not spam and not marketing. Just analysis. – Fasto 29/3, 2013 at 13:49

For analysis you should extract all the distinct names into a separate table, once, and then join at will with it. – Jempty 29/3, 2013 at 13:51

Seems like you could use TOP 1000 with DISTINCT:

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
ORDER BY FirstName

Condensed SQL Fiddle Demo

Buyers answered 29/3, 2013 at 11:40 Comment(3)

@Fasto -- np, glad we could help! – Buyers 29/3, 2013 at 13:12

FYI: does not work on columns that can't be compared, like XML or TEXT. – Helaina 22/1, 2019 at 14:54

Works perfectly! Although you have a typo - DISTINCT is spelled wrong ;-) – Ahead 4/3 at 7:25

Try this

SELECT TOP 1000 FirstName FROM 
(SELECT 
ROW_NUMBER() OVER(PARTITION BY FirstName ORDER BY FirstName) NO,
 FirstName FROM WHData.dbo.tb_DimUserAccount )
  AS T1 WHERE no =1

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount ORDER BY FirstName

Specter answered 29/3, 2013 at 11:43 Comment(3)

The second approach is very appealing. It is neat and accurate. Plus it is very readable. – Lippizaner 29/3, 2013 at 11:48

OP asks for standard SQL, while ROW_NUMBER() OVER(PARTITION... looks like tSQL feature. – Skid 29/3, 2013 at 12:10

Second is ANSI standard me believes. – Coated 29/3, 2013 at 12:53

Make sure you have an index defined on FirstName.

SELECT TOP 1000 FirstName
FROM (SELECT DISTINCT FirstName
FROM dbo.tb_DimUserAccount) N
ORDER BY FirstName

Lifesize answered 29/3, 2013 at 11:39 Comment(2)

No performance improvement because full table scan is being done in sub query. This is same as post. – Edmon 29/3, 2013 at 11:43

True, full table scan still required but performance slightly better performance than post – Lifesize 29/3, 2013 at 11:59

You need the data after sorting the results on FirstName fields.

It requires full table scan if Index is not created. If Index is created on FirstName then Unique Index scan can improve the time.

Edmon answered 29/3, 2013 at 11:41 Comment(3)

2+ John's is not a rare scenario – Lippizaner 29/3, 2013 at 11:47

We can't place the clustered Index, so Unique Index will have better performance rather than a normal non clustered index. – Edmon 29/3, 2013 at 11:52

Unique index is ok only if the target column contains only unique elements. According to the name of the column FirstName it is highly unlikely this to be true. – Lippizaner 29/3, 2013 at 11:55

Option with GROUP BY clause

SELECT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
GROUP BY FirstName
ORDER BY FirstName

Pricilla answered 29/3, 2013 at 11:53 Comment(0)

Recommended topics

Hot tags