Need a row count after SELECT statement: what's the optimal SQL approach?
Asked Answered
S

10

46

I'm trying to select a column from a single table (no joins) and I need the count of the number of rows, ideally before I begin retrieving the rows. I have come to two approaches that provide the information I need.

Approach 1:

SELECT COUNT( my_table.my_col ) AS row_count
  FROM my_table
 WHERE my_table.foo = 'bar'

Then

SELECT my_table.my_col
  FROM my_table
 WHERE my_table.foo = 'bar'

Or Approach 2

SELECT my_table.my_col, ( SELECT COUNT ( my_table.my_col )
                            FROM my_table
                           WHERE my_table.foo = 'bar' ) AS row_count
  FROM my_table
 WHERE my_table.foo = 'bar'

I am doing this because my SQL driver (SQL Native Client 9.0) does not allow me to use SQLRowCount on a SELECT statement but I need to know the number of rows in my result in order to allocate an array before assigning information to it. The use of a dynamically allocated container is, unfortunately, not an option in this area of my program.

I am concerned that the following scenario might occur:

  • SELECT for count occurs
  • Another instruction occurs, adding or removing a row
  • SELECT for data occurs and suddenly the array is the wrong size.
    -In the worse case, this will attempt to write data beyond the arrays limits and crash my program.

Does Approach 2 prohibit this issue?

Also, Will one of the two approaches be faster? If so, which?

Finally, is there a better approach that I should consider (perhaps a way to instruct the driver to return the number of rows in a SELECT result using SQLRowCount?)

For those that asked, I am using Native C++ with the aforementioned SQL driver (provided by Microsoft.)

Sherrylsherurd answered 28/10, 2008 at 15:40 Comment(1)
What are you doing with this data that you require both the raw data and a row count? If you need all the raw data (which is what you are selecting) can't you count it as you read it in? If you don't need all the raw data, then don't select it. Is the count just for pagination purposes?Unfounded
P
20

There are only two ways to be 100% certain that the COUNT(*) and the actual query will give consistent results:

  • Combined the COUNT(*) with the query, as in your Approach 2. I recommend the form you show in your example, not the correlated subquery form shown in the comment from kogus.
  • Use two queries, as in your Approach 1, after starting a transaction in SNAPSHOT or SERIALIZABLE isolation level.

Using one of those isolation levels is important because any other isolation level allows new rows created by other clients to become visible in your current transaction. Read the MSDN documentation on SET TRANSACTION ISOLATION for more details.

Premier answered 28/10, 2008 at 17:21 Comment(2)
Without asking, this addressed another curiosity I had in your first bullet: obviously, I would prefer not to have the count query executed repeatedly if it can be optimized out.Sherrylsherurd
Right; I'm not an expert on the MS SQL Server optimizer, but I'd be surprised if it could optimize out that kind of correlated subquery.Premier
D
41

If you're using SQL Server, after your query you can select the @@RowCount function (or if your result set might have more than 2 billion rows use the RowCount_Big() function). This will return the number of rows selected by the previous statement or number of rows affected by an insert/update/delete statement.

SELECT my_table.my_col
  FROM my_table
 WHERE my_table.foo = 'bar'

SELECT @@Rowcount

Or if you want to row count included in the result sent similar to Approach #2, you can use the the OVER clause.

SELECT my_table.my_col,
    count(*) OVER(PARTITION BY my_table.foo) AS 'Count'
  FROM my_table
 WHERE my_table.foo = 'bar'

Using the OVER clause will have much better performance than using a subquery to get the row count. Using the @@RowCount will have the best performance because the there won't be any query cost for the select @@RowCount statement

Update in response to comment: The example I gave would give the # of rows in partition - defined in this case by "PARTITION BY my_table.foo". The value of the column in each row is the # of rows with the same value of my_table.foo. Since your example query had the clause "WHERE my_table.foo = 'bar'", all rows in the resultset will have the same value of my_table.foo and therefore the value in the column will be the same for all rows and equal (in this case) this the # of rows in the query.

Here is a better/simpler example of how to include a column in each row that is the total # of rows in the resultset. Simply remove the optional Partition By clause.

SELECT my_table.my_col, count(*) OVER() AS 'Count'
  FROM my_table
 WHERE my_table.foo = 'bar'
Defiant answered 28/10, 2008 at 17:23 Comment(2)
I would prefer to have the result in my result set. However, it does not appear that using OVER as you've described works when I try to run your query on my table in SQL.Sherrylsherurd
Note: If you select TOP 10 rows only, the OVER clause will not be 10. It will be the full result set as if you didn't have the TOP 10 on the query.Riproaring
P
20

There are only two ways to be 100% certain that the COUNT(*) and the actual query will give consistent results:

  • Combined the COUNT(*) with the query, as in your Approach 2. I recommend the form you show in your example, not the correlated subquery form shown in the comment from kogus.
  • Use two queries, as in your Approach 1, after starting a transaction in SNAPSHOT or SERIALIZABLE isolation level.

Using one of those isolation levels is important because any other isolation level allows new rows created by other clients to become visible in your current transaction. Read the MSDN documentation on SET TRANSACTION ISOLATION for more details.

Premier answered 28/10, 2008 at 17:21 Comment(2)
Without asking, this addressed another curiosity I had in your first bullet: obviously, I would prefer not to have the count query executed repeatedly if it can be optimized out.Sherrylsherurd
Right; I'm not an expert on the MS SQL Server optimizer, but I'd be surprised if it could optimize out that kind of correlated subquery.Premier
S
3

Approach 2 will always return a count that matches your result set.

I suggest you link the sub-query to your outer query though, to guarantee that the condition on your count matches the condition on the dataset.

SELECT 
  mt.my_row,
 (SELECT COUNT(mt2.my_row) FROM my_table mt2 WHERE mt2.foo = mt.foo) as cnt
FROM my_table mt
WHERE mt.foo = 'bar';
Stringer answered 28/10, 2008 at 16:33 Comment(2)
That might make it a correlated subquery, which means it'll probably execute the subquery for each row of the result set. A non-correlated subquery may be optimized so it only need to be run once.Premier
Very interesting; I didn't know that. In that case, I'd suggest using a parameter shared by the main query and the subquery.Stringer
C
3

If you're concerned the number of rows that meet the condition may change in the few milliseconds since execution of the query and retrieval of results, you could/should execute the queries inside a transaction:

BEGIN TRAN bogus

SELECT COUNT( my_table.my_col ) AS row_count
FROM my_table
WHERE my_table.foo = 'bar'

SELECT my_table.my_col
FROM my_table
WHERE my_table.foo = 'bar'
ROLLBACK TRAN bogus

This would return the correct values, always.

Furthermore, if you're using SQL Server, you can use @@ROWCOUNT to get the number of rows affected by last statement, and redirect the output of real query to a temp table or table variable, so you can return everything altogether, and no need of a transaction:

DECLARE @dummy INT

SELECT my_table.my_col
INTO #temp_table
FROM my_table
WHERE my_table.foo = 'bar'

SET @dummy=@@ROWCOUNT
SELECT @dummy, * FROM #temp_table
Chloropicrin answered 28/10, 2008 at 16:57 Comment(1)
The count could change if you’re using READ COMMITTED, right? Or does SQL Server over ODBC mode do transactions differently than T-SQL somehow?Mcadams
R
1

Here are some ideas:

  • Go with Approach #1 and resize the array to hold additional results or use a type that automatically resizes as neccessary (you don't mention what language you are using so I can't be more specific).
  • You could execute both statements in Approach #1 within a transaction to guarantee the counts are the same both times if your database supports this.
  • I'm not sure what you are doing with the data but if it is possible to process the results without storing all of them first this might be the best method.
Reseau answered 28/10, 2008 at 15:50 Comment(0)
D
1

If you are really concerned that your row count will change between the select count and the select statement, why not select your rows into a temp table first? That way, you know you will be in sync.

Drysalter answered 28/10, 2008 at 16:26 Comment(0)
A
0

Why don't you put your results into a vector? That way you don't have to know the size before hand.

Admonition answered 28/10, 2008 at 15:44 Comment(11)
I should have mentioned that your solution occurred to me but I do not like the idea of copying my information from the database, to a vector, getting the row count, then copying everything in the vector into an array. I am not able to change the use of a simple array in this case.Sherrylsherurd
The result set from a database query could be huge - it may not even fit into memory - so it is inadvisable to force a result set into memory before you know whether it is going to fit.Retake
If the result set is so huge you should probably be paging it anyway.Admonition
The point is, you don't know how large the result set is yet. It could be huge or it could be empty. There are many cases where knowing the size will change how you handle the results (e.g. how much memory you allocate client size, if you do in-memory or paged handling, etc.).Retake
You only don't know the size of the result set if you have no business knowledge of the system or knowledge of the data to begin with. I would hope that the poster has some idea of what the data is like. Admittedly, it's dangerous to tie yourself into a limited solution like this though.Manteau
Don't confuse the /rowset/ with the /result set/ here. You should know the relative size of the /rowset/ (i.e. varchar(30), int, blob) but you can't expect to know the size of the result set (i.e. the number rows in the result of the query).Retake
For example, how big of an array do you need to hold a select of all the ids and titles of every question in StackOverflow today? Next week? Next year? Say an ID 4bytes and a title is 300bytes. 27K rows will take about 7.8M. 1M rows will take about 300M. The number of results changes over time.Retake
Again, it depends on the situation. It's very possible to have a situation where you have a very good idea of about how many rows you will get back. What if the array was to hold countries in the world? That number varies very slightly year from year. We don't know the specifics of this situation.Manteau
Even so, the underlying storage mechanism for a vector is (normally) an array which doubles in size when it runs out of space. Converting a Vector<T> to a T[] shouldn't be that big a deal. I doubt the performance difference of using a vector would be that great.Admonition
For things that would fit into memory, if you can't use a dynamic array (like he stated) and you don't know the /exact/ size, then you can't safely write any code to handle the incoming result set as a single chunk. If you don't even know if it will fit into memory, Vector<T> vs T[] is a mute.Retake
This is a decent answer. The OP already stated they wanted the data to be stored in a raw, fixed array for later processing. A vector lets you do that very nicely by growing the array as necessary and then can expose a pointer (.data()) to the underlying array for that sort of processing. If the result set is too large, then as the vector exceeds that known threshold, processing can then be terminated. Of course, a quick check of the number of rows could save you needless work in this probably unlikely scenario.Maxim
G
0

You might want to think about a better pattern for dealing with data of this type.

No self-prespecting SQL driver will tell you how many rows your query will return before returning the rows, because the answer might change (unless you use a Transaction, which creates problems of its own.)

The number of rows won't change - google for ACID and SQL.

Giess answered 28/10, 2008 at 16:14 Comment(2)
Good info on the ACID, not quite on the "self-respecting" comment. Many SQL drivers execute the query server side but don't return the entire result set in the same routine call (i.e. first call SQLExecute then SQLFetch to get the results). This is often hidden from the end-user (e.g. .NET Dataset)Retake
I believe the isolation principle within the ACID concept addresses my concerns about approach #2 sufficiently. If I can count on those results to be unaffected by queries on the database by other users, I am willing to use that approach. Thank you.Sherrylsherurd
T
0
IF (@@ROWCOUNT > 0)
BEGIN
SELECT my_table.my_col
  FROM my_table
 WHERE my_table.foo = 'bar'
END
Thrown answered 2/8, 2010 at 14:26 Comment(0)
B
0

Just to add this because this is the top result in google for this question. In sqlite I used this to get the rowcount.

WITH temptable AS
  (SELECT one,two
   FROM
     (SELECT one, two
      FROM table3
      WHERE dimension=0
      UNION ALL SELECT one, two
      FROM table2
      WHERE dimension=0
      UNION ALL SELECT one, two
      FROM table1
      WHERE dimension=0)
   ORDER BY date DESC)
SELECT *
FROM temptable
LEFT JOIN
  (SELECT count(*)/7 AS cnt,
                        0 AS bonus
   FROM temptable) counter
WHERE 0 = counter.bonus
Britishism answered 9/5, 2015 at 11:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.