Retrieving billions of rows from remote server?
Asked Answered
E

8

7

I am trying to retrieve around 200 billion rows from a remote SQL Server. To optimize this, I have limited my query to use only an indexed column as a filter and am selecting only a subset of columns to make the query look like this:

SELECT ColA, ColB, ColC FROM <Database> WHERE RecordDate BETWEEN '' AND ''

But it looks like unless I limit my query to a time window of a few hours, the query fails in all cases with the following error:

OLE DB provider "SQLNCLI10" for linked server "<>" returned message "Query timeout expired".
Msg 7399, Level 16, State 1, Server M<, Line 1
The OLE DB provider "SQLNCLI10" for linked server "<>" reported an error. Execution terminated by the provider because a resource limit was reached.
Msg 7421, Level 16, State 2, Server <>, Line 1
Cannot fetch the rowset from OLE DB provider "SQLNCLI10" for linked server "<>". 

The timeout is probably an issue because of the time it takes to execute the query plan. As I do not have control over the server, I was wondering if there is a good way of retrieving this data beyond the simple SELECT I am using. Are there any SQL Server specific tricks that I can use? Perhaps tell the remote server to paginate the data instead of issuing multiple queries or something else? Any suggestions on how I could improve this?

Estray answered 28/7, 2011 at 21:55 Comment(0)
A
13

This is more of the kind of job SSIS is suited for. Even a simple flow like ReadFromOleDbSource->WriteToOleDbSource would handle this, creating the necessary batching for you.

Adaiha answered 28/7, 2011 at 22:15 Comment(1)
+1, I believe that this is a more suited answer, since trying to do a ROW_NUMBER over 200 billion rows is gonna be a really difficult thing to accomplishDaumier
A
5

Why read 200 Billion rows all at once?

You should page them, reading say a few thousand rows at a time.

Even if you do genuinely need to read all 200 Billion rows you should still consider using paging to break up the read into shorter queries - that way if a failure happens you just continue reading where you left off.

See efficient way to implement paging for at least one method of implementing paging using ROW_NUMBER

If you are doing data analysis then I suspect you are either using the wrong storage (SQL Server isn't really designed for processing of large data sets), or you need to alter your queries so that the analysis is done on the Server using SQL.

Update: I think the last paragraph was somewhat misinterpreted.

Storage in SQL Server is primarily designed for online transaction processing (OLTP) - efficient querying of massive datasets in massively concurrent environments (for example reading / updating a single customer record in a database of billions, at the same time that thousands of other users are doing the same for other records). Typically the goal is to minimise the amout of data read, reducing the amount of IO needed and also reducing contention.

The analysis you are talking about is almost the exact opposite of this - a single client actively trying to read pretty much all records in order to perform some statistical analysis.

Yes SQL Server will manage this, but you have to bear in mind that it is optimised for a completely different scenario. For example data is read from disk a page (8 KB) at a time, despite the fact that your statistical processing is probably only based on 2 or 3 columns. Depending on row density and column width you may only be using a tiny fraction of the data stored on an 8 KB page - most of the data that SQL Server had to read and allocate memory for wasn't even used. (Remember that SQL Server also had to lock that page to prevent other users from messing with the data while it was being read).

If you are serious about processing / analysis of massive datasets then there are storage formats that are optimised for exactly this sort of thing - SQL Server also has an add on service called Microsoft Analysis Services that adds additional online analytical processing (OLAP) and data mining capabilities, using storage modes more suited to this sort of processing.

Adeliaadelice answered 28/7, 2011 at 22:1 Comment(7)
Since when is SQL Server not designed for large data sets? I've managed billions of row in SQL Server and didn't face any challenges I wouldn't have faced in other RDBMS systems.Ipswich
@Aaron -- SQLServer and especially with the add ons that come with the Enterprise edition is a superb platform for this sort of analysis.Hindermost
@James I think we are in agreement. Kragen suggested that SQL Server isn't designed for this much data. Must be thinking of ancient versions.Ipswich
@Kragen, SQL Server can handle billions of rows just fine and it has been able to do so for a long time now. Back in SQL 2000 (it was 2003 at the time) I was managing billions of rows in a single table. SQL Server 6.5 might have had issues with this much data load, but that was a VERY long time ago.Atoll
Actually the more I think about this, the more I feel that the statement that I am using the wrong storage is not very accurate. I have handled 160 billion with MySQL without problems and close to 180 billion in SQL Server 2005 before. It was just that before, I had the database on disk but never had to transfer it remotely so I never really had experience issuing such large queries over the network.Estray
@James / everyone - SQL Server is designed for querying of large data sets, not processing large data sets - perhaps I should have emphasised this! :-) I've updated my question to clarify what I meant.Adeliaadelice
@Kragen -- I think we agree really -- its the Analysis Service add ons that really make it rock!Hindermost
D
4

Personally I would use a data extraction tool such as BCP to get the data to a local file before trying to manipulate it if I was trying to pull that much data at once.

http://msdn.microsoft.com/en-us/library/ms162802.aspx

Dyeline answered 28/7, 2011 at 22:2 Comment(0)
T
2

This isn't A SQL Server specific answer, but even when the rDBMS supports server side cursors, it's considered poor form to use them. Doing so means that you are consuming resources on the server even though the server is still waiting for you to request more data.

Instead you should reformulate your query usage so that the server can transmit the entire result set as soon as it can, and then completely forget about you and your query to make way for the next one. When the result set is too large for you process all in one go, you should keep track of the last row returned by the current batch so that you can fetch another batch starting at that position.

Tahitian answered 28/7, 2011 at 22:1 Comment(2)
As he only has ODBC access to the server its difficult to see how he could use anything but a "SELECT".Hindermost
Server side cursors, in rDBMSes that support them, are a property of the connection, and is completely independent of the SQL statement used to generate the result set.Tahitian
A
1

Odds are the remote server has the "Remote Query Timeout" set. How long does it take for the query to fail?

Atoll answered 29/7, 2011 at 3:56 Comment(2)
From SSMS, it is about 10 minutes. Using sqlcmd it is about 3 hours.Estray
The 10 minute timeout really makes me thing it's the remote timeout setting as 10 minutes is the default. This setting would need to be changed on the remote side if that's the problem. Not sure why sqlcmd would run for 3 hours then die. Might be doing something else before running, plan generation, waiting on something else, etc.Atoll
H
1

Just run into the same problem, I also had the message at 10:01 after running the query.

Check this link. There's a remote query timeout setting under Connections that's setup to 600secs by default and you need to change it to zero (unlimited) or other value you think is right.

Hermaphrodite answered 6/12, 2011 at 17:49 Comment(0)
R
0

Try to change remote server connection timeout property.

For that go to SSMS, connect to the server, right click on server's name in object explorer, further select Properties -> Connections and change value in the Remote query timeout (in seconds, 0 = no timeout) text box.

enter image description here

Roney answered 12/8, 2015 at 11:10 Comment(0)
S
0

Same issue encountered here. I took the remote query and ran it natively and got the error "Arithmetic overflow error converting expression to data type datetime".

I assumed the linked server took that error as "...resource limit was reached."

**** Correction The solution was to increase the remote query timeout

Slant answered 13/6 at 21:3 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Magistral

© 2022 - 2024 — McMap. All rights reserved.