Index skip scan emulation to retrieve distinct product IDs and min/max for additional columns
Asked Answered
C

1

3

Here's my table schema:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ NOT NULL,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

CREATE INDEX idx_tickers_product_id_time ON tickers (product_id, time);

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

The table has over two million rows now.

I learned how to use index skip scan emulation (see: SELECT DISTINCT is slower than expected on my table in PostgreSQL) in PostgreSQL in order to quickly retrieve distinct product_id values from this table, rather than using the slower SELECT DISTINCT method.

I also want to retrieve min/max values for other columns. Here's what I came up with. It takes ~2.9 milliseconds over 2.25 rows.

Is there a better way to accomplish this?

WITH product_ids AS (
    WITH RECURSIVE cte AS (
       (   -- parentheses required
           SELECT product_id
           FROM tickers
           ORDER BY 1
           LIMIT 1
       )
       UNION ALL
       SELECT l.*
       FROM cte c
       CROSS JOIN LATERAL (
          SELECT product_id
          FROM tickers t
          WHERE t.product_id > c.product_id  -- lateral reference
          ORDER BY 1
          LIMIT 1
          ) l
       )
    TABLE cte
)
SELECT
    product_id,
    (SELECT (MAX(trade_id) - MIN(trade_id) + 1) FROM tickers WHERE product_id = product_ids.product_id) AS ticker_count,
    (SELECT MIN(time) FROM tickers WHERE product_id = product_ids.product_id) AS min_time,
    (SELECT MAX(time) FROM tickers WHERE product_id = product_ids.product_id) AS max_time
FROM product_ids
ORDER BY ticker_count DESC
Crud answered 31/3, 2021 at 21:30 Comment(9)
Are there any other indexes?Meniscus
There is a combined index on product_id and time. Updated schema in question.Crud
Perfect! We can put it to good use. :)Meniscus
(MAX(trade_id) - MIN(trade_id)? Subtracting IDs? That's not a typo?Meniscus
That's correct. The trade_id for each product always goes up by 1 as new tickers are broadcasted by Coinbase Pro's websocket "ticker" channel. I previously used COUNT in previous iterations of this query, but I came up with this as an optimization because I found that COUNT was very slow once my table started filling up with millions of rows. But if we can avoid doing this subtraction I think it would be better.Crud
@ErwinBrandstetter: Oops... it was supposed to be (MAX(trade_id) - MIN(trade_id) + 1), not (MAX(trade_id) - MIN(trade_id)). I edited my question.Crud
time is the only column that can be NULL? Typo?Meniscus
It should be NOT NULL. I'll fix it.Crud
Writing this as a comment because I get super annoyed when people say "why don't you redesign your schema?". But have you considered having a seperate version table? One that's a simple mapping between the product_id and the trade_id? And you can join that onto the main table.Czarist
M
3

Query

Using the existing index on (product_id, time) we can get two for the price of one, i.e. fetch product_id and minimum time in one index scan:

WITH RECURSIVE product_ids AS (
   (   -- parentheses required
   SELECT product_id, time AS min_time
   FROM   tickers
   ORDER  BY 1, 2
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   product_ids p
   CROSS JOIN LATERAL (
      SELECT t.product_id, t.time
      FROM   tickers t
      WHERE  t.product_id > p.product_id
      ORDER  BY 1, 2
      LIMIT  1
      ) l
   )
SELECT product_id, min_time
    , (SELECT MAX(time) FROM tickers WHERE product_id = p.product_id) AS max_time
    , (SELECT MAX(trade_id) - MIN(trade_id) + 1 FROM tickers WHERE product_id = p.product_id) AS ticker_count
FROM   product_ids p
ORDER  BY ticker_count DESC;

Also, no need for a second CTE wrapper.

Indexes

Currently you have two indexes: The PK index on (product_id, trade_id), and another one on (product_id, time). You might optimize this by reversing the column order in one of both. Like:

PRIMARY KEY (trade_id, product_id)

Logically equivalent, but typically more efficient as it covers a wider range of possible queries. See (again):

We only need the existing index on (product_id, time), so no direct effect on this query.

Meniscus answered 31/3, 2021 at 21:47 Comment(6)
When I run this query it never returns. Same when I do EXPLAIN ANALYZE on it.Crud
You tried the latest version with the fix for the WHERE clause?Meniscus
I guess I didn't, haha. Works great. Thanks!!Crud
Execution time is now ~2.6 ms instead of 2.9 msCrud
Makes sense, only saved one index scan. Still. :)Meniscus
Every little bit helps. You've enlightened me today! Once I digest these new PostgreSQL concepts I'll have some more tools in the tool box. Thanks again :)Crud

© 2022 - 2024 — McMap. All rights reserved.