Handling large SQL select queries / Read sql data in chunks

Asked 20/4, 2011 at 6:52 Answered 20/4, 2011 at 7:12

I'm using .Net 4.0 and SQL server 2008 R2.

I'm running a big SQL select query which returns millions of results and takes up a long time to fully run.

Does anyone know how can I read only some of the results returned by the query without having to wait for the whole query to complete?

In other words, I want to read the first by 10,000 records chunks while the query still runs and getting the next results.

Niemeyer answered 20/4, 2011 at 6:52 Comment(4)

The main issue is probably the returns millions of results - that is a database design smell.... do you really need to return millions of rows?? WHY?? Couldn't you process those on the database server and then return a few hundred results instead?? – Rhythmandblues 20/4, 2011 at 7:0

This database needs an archiving strategy... – Bradytelic 20/4, 2011 at 7:5

@Bradytelic unless you are familiar with the needs of that system, the archiving claim is meaningless – Aborn 20/4, 2011 at 7:26

I need all the millions of results in my program. I do not filter anything on the code. I will consider adding an identity column to my database and try retrieving in chunks if no better solution comes up – Niemeyer 20/4, 2011 at 18:11

It depends in part on whether the query itself is streaming, or whether it does lots of work in temporary tables then (finally) starts returning data. You can't do much in the second scenario except re-write the query; however, in the first case an iterator block would usually help, i.e.

public IEnumerable<Foo> GetData() {
     // not shown; building command etc
     using(var reader = cmd.ExecuteReader()) {
         while(reader.Read()) {
             Foo foo = // not shown; materialize Foo from reader
             yield return foo;
         }
     }
}

This is now a streaming iterator - you can foreach over it and it will retrieve records live from the incoming TDS data without buffering all the data first.

If you (perhaps wisely) don't want to write your own materialization code, there are tools that will do this for you - for example, LINQ-to-SQL's ExecuteQuery<T>(tsql, args) will do the above pain-free.

Aborn answered 20/4, 2011 at 7:5 Comment(8)

+1. Basically the reader starts when the server starts returning results. If it takes long then either the query is inefficient (takea logn time to start with data) or it is more compelx and it uses temporary tables as the answer says.... and then at the enddoes a select, then you need to rewrite it. – Zenobia 20/4, 2011 at 7:10

Using the ExecuteReader() method will cause the thread to block until the query finishes. I want to be able to restore some data while the query continues... – Niemeyer 20/4, 2011 at 18:2

@Niemeyer no, it won't; unless (as stated) your query involves temporary tables etc. It is a reader; it will start returning data row-by-row as it becomes available in the TDS stream. The point about temp tables is that if your query itself takes a log time before it returns even the first row, it can't help with that - you need to fix your query. – Aborn 20/4, 2011 at 18:8

@MarcGravell any idea why a query might take 1m in SSMS to return all rows, but take a little less than 1m in ADO just to start streaming the first row (and then continue to stream the rest) from the same query? It seems almost like SQL Server is writing all the results to temp table before streaming out. – Housewife 27/2, 2015 at 15:29

@Housewife most often, that means either blocking or different SET options; SET options can make significant differences to how a query behaves – Aborn 27/2, 2015 at 15:31

@MarcGravell thanks for the quick reply. I've tried tweaking SET options and using OPTION(RECOMPILE) in addition; SSMS always returns the first of the results in 1-2s while ADO always takes around the full time that SSMS takes to return all of the results just to return the first result. I might go ask on DBA. Thanks again – Housewife 27/2, 2015 at 15:44

@MarcGravell didn't realize buffered was true by default with Dapper until now ;) I'm sure I read through that before but it didn't occur to me. – Housewife 27/2, 2015 at 16:48

@MarcGravell will this yield approach good for lets say 10GB data? Could you suggest any best approach for such scenarios? Here is the actual case #53846765 – Outdoors 19/12, 2018 at 10:6

You'd need to use data paging.

SQL Server has the TOP clause (SQL TOP 10 a,b,c from d) and BETWEEN:

SELECT TOP 10000 a,b,c from d BETWEEN X and Y

Having this, I guess you'd be able of retrieving an N number of rows, do some partial processing, then load next N number of rows and so on.

This can be achieved by implementing a multithreaded solution: one will be retrieving results while the other will asynchronously wait for data and it'll be doing some processing.

Busterbustle answered 20/4, 2011 at 7:12 Comment(3)

BETWEEN can be tricky to use with paging unless you either introduce ROW_NUMBER, or you know that you have an unbroken monotonically increasing sequence (typically IDENTITY, but deleted/filtered data may cause issues) – Aborn 20/4, 2011 at 7:25

You're right... but we don't know what kind of identifiers he's using right now. I've tried to give him some idea. BTW, it'd be with row number, shouldn't it? – Tractor 20/4, 2011 at 7:27

Doing so will be much more complicated because I need to keep track of where I left off in the previous round/chunk. That may be problematic because I don't necessary have an increasing sequence number on all the rows. – Niemeyer 20/4, 2011 at 18:8

if you really have to process millions of records Why dont you load 10,000 each round process them and then load the next 10,000? if not consider using the DBMS to filter the data before loading it as the performance on the database is much better than in you logic leyer.

Or follow a lazy load concept and load only Ids to which you load the actual data only when you need it.

Bronder answered 20/4, 2011 at 7:6 Comment(1)

That does not solve the problem of the query traking a long time. Processing them in C# is easy - a datareader is fully capable to deal with 100 million rows without a problem and delay. THe processing on the database needs to bel ooked into. And sometimes you need so many. I work with financial market data. an option dump for analysis for an instrument has about 10 million rows, and if I need them I need them. – Zenobia 20/4, 2011 at 7:11

Recommended topics

Hot tags