Optimization of count query for PostgreSQL

I

3

10

I have a table in postgresql that contains an array which is updated constantly.

In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:

select count(id) 
from table 
where not (ARRAY['parameter value'] <@ table.array_column)

But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).

Basically my question is:

Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?

Any suggestion you could give me would be really appreciated.

Ivie answered 25/10, 2012 at 18:48 Comment(7)

Not sure, but I think an GIN index on table.array_column will help to speed this up. You will need to run EXPLAIN to find out. See here: dba.stackexchange.com/a/27505/1822 – Collectivity 25/10, 2012 at 20:23

It's going to be hard to make this efficient in postgres as the table gets big. a gin index will only help when testing for "contained in" as opposed to the "not contained in" in your predicate. If it's not crucial that the count be 100% accurate, you could try caching it at the app layer with some TTL. If your write rate on the table isn't too high, you could reasonably use triggers to update another table containing current counts. – Scepter 25/10, 2012 at 22:58

Best to show your version and explain analyze; see stackoverflow.com/tags/postgresql-performance/info – Pickard 26/10, 2012 at 0:42

Is there a fixed list of properties? You can't really index what's not there, so you may be able to reframe this as a list of parameters that the entry does not have. – Pickard 26/10, 2012 at 2:30

I believe the list of properties may be fixed. It could certainly be assumed to be fixed if this helps solve the problem somehow. – Curley 26/10, 2012 at 14:52

Someone else asked the same question there. I suggested using a secondary table to store the count of the table, which would be updated by a trigger. See this for more information about slow counting. – Huddersfield 26/10, 2012 at 14:53

If you can treat it as a fixed list, you can store the set of properties an item does not have in a side-table that's b-tree indexed, then you can do extremely fast searches for IDs that have entries in that table (which means they don't have the listed property). Backwards, but effective. – Pickard 27/10, 2012 at 9:6

P

4

Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?

Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.

CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
  demo_id integer not null references demo(id),
  property integer not null,
  primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);

You can then query the properties:

SELECT count(id) 
FROM demo 
WHERE NOT EXISTS (
  SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)

I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.

Again, a seq scan on the array-containing table does the job just as well.

Pickard answered 26/10, 2012 at 2:17 Comment(1)

thank you for that detailed explanation, yeah aparently in my current situation is better to do the sequential count or think of another way to store the information to make the search faster, again thanks very much this has been really useful – Ivie 26/10, 2012 at 15:51

P

5

PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <@ indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.

Demo:

CREATE TABLE arrtable (id integer primary key, array_column integer[]);

INSERT INTO arrtable(1, ARRAY[1,2,3,4]);

CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);

-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;

explain (buffers, analyze) select count(id) 
from arrtable 
where not (ARRAY[1] <@ arrtable.array_column);

Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).

You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:

SELECT (
  SELECT count(id) FROM arrtable
) - (
  SELECT count(id) FROM arrtable 
  WHERE (ARRAY[1] <@ arrtable.array_column)
);

It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:

drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);

Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.

Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.

The lesson here? Sometimes, the right answer is just to do a sequential scan.

Since that won't do for your hit rates, either maintain a materialized view with a trigger as @debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as @maniek suggests.

Pickard answered 26/10, 2012 at 0:30 Comment(0)

P

4

Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?

Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.

CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
  demo_id integer not null references demo(id),
  property integer not null,
  primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);

You can then query the properties:

SELECT count(id) 
FROM demo 
WHERE NOT EXISTS (
  SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)

I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.

Again, a seq scan on the array-containing table does the job just as well.

Pickard answered 26/10, 2012 at 2:17 Comment(1)

thank you for that detailed explanation, yeah aparently in my current situation is better to do the sequential count or think of another way to store the information to make the search faster, again thanks very much this has been really useful – Ivie 26/10, 2012 at 15:51

C

2

I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.

Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <@ table.array_column) ? This query would use a gin/gist index.

Cardioid answered 25/10, 2012 at 19:41 Comment(0)

Recommended topics

Hot tags