Fast way to discover the row count of a table in PostgreSQL
Asked Answered
E

8

259

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.

I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.

Is it possible to stop counting as soon as my constant value is surpassed?

I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.

Something like this:

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;
Epistemic answered 30/10, 2011 at 3:58 Comment(8)
Couldn't you just attempt to select the first n rows where n = constant + 1? If it returns more than your constant, you know you should use your constant, and if it doesn't you're good?Egomania
Do you have an identity or auto increment field in the tableTandy
@Sparky: Sequence backed PKs aren't guaranteed to be contiguous, rows can be deleted or there could be gaps caused by aborted transactions.Sneeze
Your update seems to contradict your original question... do you need to know the exact number of rows, or do you only need to know the exact number if it's below a threshold?Indemnify
@Flimzy Both. I need exact number if below the threshold, otherwise, I will use the threshold as limit.Epistemic
Okay, then I think my answer below may help... but as noted, it depends on the complexity of the query you're running whether the query planner's estimates will be close enough to be useful for you.Indemnify
@RenatoDinhaniConceição : Can you explain the Exact problem you are trying to solve? I think my answer below solves what you initially said was your issue. The update makes it look like you want count(*) as well as many other fields. It would help if you can explain exactly what you are trying to do. Thanks.Tharpe
Please show us the table definition. Also: I suspect that your DB is not normalised at all. BTW: text is a reserved word (datatype), though the parsers may accept it, it is still advisable to avoid reserved words as a column name.Folks
C
549

Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.

(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)

Exact count

Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.

SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate

Extremely fast:

SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';

Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.

Safer estimate

The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:

SELECT c.reltuples::bigint AS estimate
FROM   pg_class c
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relname = 'mytable'
AND    n.nspname = 'myschema';

The cast to bigint formats the real number nicely, especially for big counts.

Better estimate

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'myschema.mytable'::regclass;

Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.

Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:

Better estimate yet (for very little added cost)

This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.

We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:

These numbers are current as of the last VACUUM or ANALYZE on the table. The planner then fetches the actual current number of pages in the table (this is a cheap operation, not requiring a table scan). If that is different from relpages then reltuples is scaled accordingly to arrive at a current number-of-rows estimate.

Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:

Minimal form

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM   pg_class
WHERE  oid = 'mytable'::regclass;  -- your table here

Safe and explicit

SELECT (CASE WHEN c.reltuples < 0 THEN NULL       -- never vacuumed
             WHEN c.relpages = 0 THEN float8 '0'  -- empty table
             ELSE c.reltuples / c.relpages END
     * (pg_catalog.pg_relation_size(c.oid)
      / pg_catalog.current_setting('block_size')::int)
       )::bigint
FROM   pg_catalog.pg_class c
WHERE  c.oid = 'myschema.mytable'::regclass;      -- schema-qualified table here

Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:

If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.

If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)

If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)

Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.

Table and schema qualifications make it immune to any search_path and scope.

Either way, the query consistently takes < 0.1 ms for me.

More Web resources:


TABLESAMPLE SYSTEM (n) in Postgres 9.5+

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Like @a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:

  • No autovacuum running.
  • Immediately after a large INSERT / UPDATE / DELETE.
  • TEMPORARY tables (which are not covered by autovacuum).

This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:

  • Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
  • Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
  • General rounding errors.

Typically, the estimate from pg_class will be faster and more accurate.

Answer to actual question

First, I need to know the number of rows in that table, if the total count is greater than some predefined constant,

And whether it ...

... is possible at the moment the count pass my constant value, it will stop the counting (and not wait to finish the counting to inform the row count is greater).

Yes. You can use a subquery with LIMIT:

SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;

Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.

Crocidolite answered 30/10, 2011 at 13:47 Comment(12)
I eventually updated the Postgres Wiki page with the improved query.Crocidolite
With 9.5 getting an estimate fast should be possible using the tablesample clause: e.g. select count(*) * 100 as cnt from mytable tablesample system (1);Hasin
@a_horse_with_no_name: Interesting option. This would randomly pick 1% of all blocks and count rows in it. Much faster than a full count, but not nearly as fast as looking up reg_class. It's hard to tell how accurate the result might be since it depends on the number of dead tuples in each block and general rounding and projection errors. May be a good option without autovacuum or estimates immediately after big writes (before ANALYZE can kick in). Not so good for very big / mostly static tables / uneven data distribution. Did I miss anything? I might add a chapter to the above answer ...Crocidolite
We are only talking about an estimate here, so I guess the tablesample should be good enough. If the dead tuples are spread evenly (statistically speaking) over the blocks, then they shouldn't matter either. It would probably be a slightly more accurate estimate when using tablesample bernoulli I think, but that is going to be slower, than system.Hasin
It doesn't look like BERNOULLI would be useful for this at all. The manual: The BERNOULLI method scans the whole table and selects or ignores individual rows independently with the specified probability. It might be faster to just count all rows. But SYSTEM can be ..Crocidolite
Will the estimate occasionally be greater than the number of rows? Or is it always less than/equal to the number of rows?Larrylars
@JeffWidman: All of these estimates can be greater than the actual row count for various reasons. Not least, deletes may have happened in the meantime.Crocidolite
@ErwinBrandstetter what if i have a query like this SELECT COUNT(*) FROM users where last_name='a';Tegument
@ErwinBrandstetter realize this question is old, but if you wrapped the query in subquery then did the limit would this still be efficient or would the whole subquery be executed then limited in the outer query. SELECT count(*) FROM (Select * from (SELECT 1 FROM token) query) LIMIT 500000) limited_query; (I ask because I am trying to get a count from an arbitrary query that might have a limit clause in it already)Leavy
@NicholasErdenberger: That depends on the subquery. Postgres may need to consider more rows than the limit anyway (like with ORDER BY something while it can't use an index, or with aggregate functions). Apart from that, only the limited number of rows from the subquery is processed.Crocidolite
All these work on concrete tables; do they work on materialised views? I tried on views too and that doesn't work, so wondering if there is a workaround for viewsKerrin
@RTD: Materialized views: yes. See: https://mcmap.net/q/28675/-when-refreshing-a-materialized-view-are-indexes-also-refreshed-automatically Views: mostly no. Only the query is persisted, not resulting data. You have to execute the query to count rows. (There are shortcuts for trivial views ...)Crocidolite
I
27

I did this once in a postgres app by running:

EXPLAIN SELECT * FROM foo;

Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:

Seq Scan on uids  (cost=0.00..1.21 rows=8 width=75)

You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).

Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.

But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.

Indemnify answered 30/10, 2011 at 4:19 Comment(3)
This is actually quite convenient for interactive use :)Hydroid
Yes, be careful with this. one. It is not accurate. I ran this against a 7 billion row table and was off by 188,069 rows. As he said it is probably fine for simple queries and small tables but don't use this as an exact count.Octavla
EXPLAIN (format json) and you don't have to use regexp.Jiggered
N
6

Reference taken from this Blog.

You can use below to query to find row count.

Using pg_class:

 SELECT reltuples::bigint AS EstimatedCount
    FROM   pg_class
    WHERE  oid = 'public.TableName'::regclass;

Using pg_stat_user_tables:

SELECT 
    schemaname
    ,relname
    ,n_live_tup AS EstimatedCount 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;
Neume answered 13/2, 2016 at 7:58 Comment(1)
Just quick note that you need to VACUUM ANALYZE your tables for this method to work.Knighthood
H
1

This is the function (not mine) that implements @Jonathan Hall's answer:

CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS integer AS $$
DECLARE
    rec   record;
    rows  integer;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
            rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
            EXIT WHEN rows IS NOT NULL;
        END LOOP;
    RETURN rows;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;

Hachure answered 9/10, 2023 at 12:6 Comment(0)
T
0

How wide is the text column?

With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).

I'd recommend:

  1. If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.

  2. Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column. Again, this is to decrease the workload (scan through a narrow column index)

Edit:

Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.

Terry answered 30/10, 2011 at 5:28 Comment(0)
T
-3

In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:

SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
Tharpe answered 30/10, 2011 at 4:11 Comment(5)
SELECT count(*) cnt FROM table will always return a single row. Not sure how LIMIT is going to add any benefit there.Terry
@ChrisBednarski : I verified the oracle version of my answer on an Oracle db. It works great and solves what I thought was OP's problem (0.05 s with count(*) with rownum, 1 s without the use of rownum). Yes, SELECT count(*) cnt FROM table is always going to return 1 row, but with the LIMIT condition, it will return "500001" when table's size is over 500000 and <size> when table's size <= 500000.Tharpe
Your PostgreSQL query is complete nonsense. Syntactically and logically wrong. Please correct or remove it.Crocidolite
@ErwinBrandstetter : Removed, didn't realize PostgreSQL was so different.Tharpe
@allrite: no doubt your Oracle query works fine. LIMIT works differently though. At a basic level, it limits the number of rows returned to the client, not the number of rows queried by the database engine.Terry
C
-3

You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is

Cachexia answered 1/11, 2021 at 16:38 Comment(1)
If gap-less integer IDs starting with 1 are enforced. Which would be a very exotic case.Crocidolite
E
-7

For SQL Server (2005 or above) a quick and reliable method is:

SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')   
AND (index_id=0 or index_id=1);

Details about sys.dm_db_partition_stats are explained in MSDN

The query adds rows from all parts of a (possibly) partitioned table.

index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)

Even faster (but unreliable) methods are detailed here.

Extraterritoriality answered 29/10, 2014 at 8:45 Comment(1)
Wrong db dude. The question is about postgresHardiness

© 2022 - 2024 — McMap. All rights reserved.