Loose index scan in Postgres on more than one field?
Asked Answered
K

3

8

I have several large tables in Postgres 9.2 (millions of rows) where I need to generate a unique code based on the combination of two fields, 'source' (varchar) and 'id' (int). I can do this by generating row_numbers over the result of:

SELECT source,id FROM tablename GROUP BY source,id

but the results can take a while to process. It has been recommended that if the fields are indexed, and there are a proportionally small number of index values (which is my case), that a loose index scan may be a better option: http://wiki.postgresql.org/wiki/Loose_indexscan

WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);

The example operates on a single field though. Trying to return more than one field generates an error: subquery must return only one column. One possibility might be to try retrieving an entire ROW - e.g. SELECT ROW(min(source),min(id)..., but then I'm not sure what the syntax of the WHERE statement would need to look like to work with individual row elements.

The question is: can the recursion-based code be modified to work with more than one column, and if so, how? I'm committed to using Postgres, but it looks like MySQL has implemented loose index scans for more than one column: http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html


As recommended, I'm attaching my EXPLAIN ANALYZE results.

For my situation - where I'm selecting distinct values for 2 columns using GROUP BY, it's the following:

 HashAggregate  (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1)
 Total runtime: 36127.789 ms
(3 rows)

I don't know how to do a 2-column index scan (that's the question), but for purposes of comparison, using a GROUP BY on one column, I get:

 HashAggregate  (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1)
 Total runtime: 32350.899 ms
(3 rows)

But for a loose index scan on one column, I get:

 Result  (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1)
   CTE t
     ->  Recursive Union  (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1)
           ->  Result  (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1)
                 InitPlan 1 (returns $1)
                   ->  Limit  (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1)
                         ->  Index Only Scan using tablename_id on tablename  (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1)
                               Index Cond: (id IS NOT NULL)
                               Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101)
                 Filter: (id IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 3
                   ->  Result  (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100)
                         InitPlan 2 (returns $3)
                           ->  Limit  (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100)
                                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100)
                                       Index Cond: ((id IS NOT NULL) AND (id > t.id))
                                       Heap Fetches: 0
   ->  Append  (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1)
         ->  CTE Scan on t  (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1)
               Filter: (id IS NOT NULL)
               Rows Removed by Filter: 1
         ->  Result  (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
               One-Time Filter: $5
               InitPlan 5 (returns $5)
                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                       Index Cond: (id IS NULL)
                       Heap Fetches: 0
 Total runtime: 2.040 ms

The full table definition looks like this:

CREATE TABLE tablename
(
  source character(25),
  id bigint NOT NULL,
  time_ timestamp without time zone,
  height numeric,
  lon numeric,
  lat numeric,
  distance numeric,
  status character(3),
  geom geometry(PointZ,4326),
  relid bigint
)
WITH (
  OIDS=FALSE
);

CREATE INDEX tablename_height
  ON public.tablename
  USING btree
  (height);

CREATE INDEX tablename_geom
  ON public.tablename
  USING gist
  (geom);


CREATE INDEX tablename_id
  ON public.tablename
  USING btree
  (id);

CREATE INDEX tablename_lat
  ON public.tablename
  USING btree
  (lat);

CREATE INDEX tablename_lon
  ON public.tablename
  USING btree
  (lon);

CREATE INDEX tablename_relid
  ON public.tablename
  USING btree
  (relid);

CREATE INDEX tablename_sid
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default", id);

CREATE INDEX tablename_source
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default");

CREATE INDEX tablename_time
  ON public.tablename
  USING btree
  (time_);

Answer selection:

I took some time in comparing the approaches that were provided. It's at times like this that I wish that more than one answer could be accepted, but in this case, I'm giving the tick to @jjanes. The reason for this is that his solution matches the question as originally posed more closely, and I was able to get some insights as to the form of the required WHERE statement. In the end, the HashAggregate is actually the fastest approach (for me), but that's due to the nature of my data, not any problems with the algorithms. I've attached the EXPLAIN ANALYZE for the different approaches below, and will be giving +1 to both jjanes and joop.

HashAggregate:

 HashAggregate  (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1)
 Total runtime: 24787.292 ms

Loose Index Scan modification

CTE Scan on t  (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1)
   Filter: (source IS NOT NULL)
   Rows Removed by Filter: 1
   CTE t
     ->  Recursive Union  (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1)
           ->  Limit  (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1)
                 ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1)
                       Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081)
                 Filter: (source IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                               Index Cond: (ROW(source, id) > ROW(t.source, t.id))
                               Heap Fetches: 0
                 SubPlan 2
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080)
                               Index Cond: (ROW(source, id) > ROW(t.source, t.id))
                               Heap Fetches: 0
 Total runtime: 250491.559 ms

Merge Anti Join

Merge Anti Join  (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1)
   Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 363464177
   ->  Index Only Scan using tablename_pkey on tablename src  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1)
         Heap Fetches: 0
   ->  Materialize  (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1)
         ->  Index Only Scan using tablename_pkey on tablename nx  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1)
               Heap Fetches: 0
 Total runtime: 541750.026 ms
Kazukokb answered 5/12, 2013 at 5:50 Comment(9)
Please show us the indexes and the execution plan (generated using explain analyze). More information on asking performance questions can be found here: wiki.postgresql.org/wiki/SlowQueryQuestionsPlacet
The recursive query makes no sense to me. After rereading it ten times, it appears to be equivalent to SELECT DISTINCT col FROM tablename ORDER BY col; Please also add your intention to the question.Ukase
@a_horse_with_no_name - I added the EXPLAIN ANALYZE results. The first is for my actual query as it stands. The next two are on one column to show the difference in speed between the GROUP BY approach and the loose index scan (32350 ms vs. 2 ms)Kazukokb
@Ukase - You are correct - the results are equivalent to SELECT DISTINCT on col, but the loose index scan provides a large performance improvement (see EXPLAIN ANALYZE results, now added). My question is how to modify the recursive code to use more than one column to get similar performance when looking at distinct values across two fields (or more if the answer is generalizable).Kazukokb
Please add the table definition (it appears tha the FK is as least a bit larger than {source,id} (otherwise you would not need the distinct ;-)Shackle
Which natural keys can be defined for tablename ? It looks like {source,id,time_} is a candidate key, and maybe {geom,time_} as well. {lat,lon} appear to be redundant (functionally dependent on geom) distance looks underdetermined to me, and {height,status} could be a genuine attributes.relid could be a FK to some other table. The reason why I am asking: this might indicate a bad case of 4NF violation.Ukase
The natural key would be {source,id,time_}. Lat,lon aren't entirely redundant, as they are different measurements than what is contained in geom. Distance shouldn't be underdetermined - it's based on a non-euclidean path distance.Kazukokb
So: {{lat,lon},geom,time_} would also be a candidate key ?Ukase
Yes it would be. You can think of lat lon (and height) as where the item should be whereas the geometry is where the item was recorded.Kazukokb
W
5

Rather hideous, but this seems to work:

WITH RECURSIVE
     t AS (
  select a,b from (select a,b from foo order by a,b limit 1) asdf union all 
  select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),
         (select b from foo where (a,b) > (t.a,t.b) order by a,b limit 1) 
     from t where t.a is not null)  
select * from t where t.a is not null;

I don't really understand why the "is not nulls" are needed, as where do the nulls come from in the first place?

Wickman answered 6/12, 2013 at 17:38 Comment(1)
The 'is not null's come from [link]wiki.postgresql.org/wiki/Loose_indexscan, which is the source of the code. They might not be necessary in my case, but they also don't seem to be doing any harm.Kazukokb
U
1
DROP SCHEMA zooi CASCADE;
CREATE SCHEMA zooi ;
SET search_path=zooi,public,pg_catalog;

CREATE TABLE tablename
  ( source character(25) NOT NULL
  , id bigint NOT NULL
  , time_ timestamp without time zone NOT NULL
  , height numeric
  , lon numeric
  , lat numeric
  , distance numeric
  , status character(3)
  , geom geometry(PointZ,4326)
  , relid bigint
        , PRIMARY KEY (source,id,time_) -- <<-- Primary key here
  ) WITH ( OIDS=FALSE);


  -- invent some bogus data
INSERT INTO tablename(source,id,time_)
SELECT 'SRC_'|| (gs%10)::text
        ,gs/10
        ,gt
FROM generate_series(1,1000) gs
        , generate_series('2013-12-01', '2013-12-07', '1hour'::interval) gt
        ;

Select unique values for two key fields:

VACUUM ANALYZE tablename;

EXPLAIN ANALYZE
SELECT source,id,time_
FROM tablename src
WHERE NOT EXISTS (
        SELECT * FROM tablename nx
        WHERE nx.source =src.source
        AND nx.id = src.id
        AND time_  > src.time_
        )
        ;

Generates this plan here (Pg-9.3):

                                                     QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=4981.00..12837.82 rows=96667 width=42) (actual time=547.218..1194.335 rows=1000 loops=1)
   Hash Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 145000
   ->  Seq Scan on tablename src  (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.010..210.810 rows=145000 loops=1)
   ->  Hash  (cost=2806.00..2806.00 rows=145000 width=42) (actual time=546.497..546.497 rows=145000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 9063kB
         ->  Seq Scan on tablename nx  (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.006..259.864 rows=145000 loops=1)
 Total runtime: 1197.374 ms
(9 rows)

The hash-joins will probably disappear once the data outgrows the work_mem:

 Merge Anti Join  (cost=0.83..8779.56 rows=29832 width=120) (actual time=0.981..2508.912 rows=1000 loops=1)
   Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 184051
   ->  Index Scan using tablename_sid on tablename src  (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.055..250.621 rows=145000 loops=1)
   ->  Index Scan using tablename_sid on tablename nx  (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.008..603.403 rows=328906 loops=1)
 Total runtime: 2510.505 ms
Ukase answered 6/12, 2013 at 17:21 Comment(0)
B
0

Lateral joins can give you a clean code to select multiple columns in nested selects, without checking for null as no subqueries in select clause.

-- Assuming you want to get one '(a,b)' for every 'a'.
with recursive t as (
  (select a, b from foo order by a, b limit 1)
  union all
  (select s.* from t, lateral(
    select a, b from foo f
    where f.a > t.a
    order by a, b limit 1) s)
)
select * from t;
Broccoli answered 8/6, 2022 at 14:41 Comment(2)
you only have to do where f.a > t.a? what about checking f.b > t.b, etc.?Coat
@BrandonRos Change where f.a > t.a to where f.a > t.a and f.b > t.b if you want. It will give you an (a, b) ascending array. Changing it may reduce the result rows as fewer f.a meets the requirement, that wouldn't be one (a, b) for every a.Broccoli

© 2022 - 2024 — McMap. All rights reserved.