TimescaleDB: efficiently select last row
Asked Answered
B

4

14

I have a postgres database with the timescaledb extension.

My primary index is a timestamp, and I would like to select the latest row.

If I happen to know the latest row happened after a certain time, then I can use a query such as:

query = 'select * from prices where time > %(dt)s'

Here I specify a datetime, and execute the query using psycopg2:

# 2018-01-10 11:15:00
dt = datetime.datetime(2018,1,10,11,15,0)

with psycopg2.connect(**params) as conn:
    cur = conn.cursor()
    # start timing
    beg = datetime.datetime.now()
    # execute query
    cur.execute(query, {'dt':dt})
    rows = cur.fetchall()
    # stop timing
    end = datetime.datetime.now()

print('took {} ms'.format((end-beg).total_seconds() * 1e3))

The timing output:

took 2.296 ms

If, however, I don't know the time to input into the above query, I can use a query such as:

query = 'select * from prices order by time desc limit 1'

I execute the query in a similar fashion

with psycopg2.connect(**params) as conn:
    cur = conn.cursor()
    # start timing
    beg = datetime.datetime.now()
    # execute query
    cur.execute(query)
    rows = cur.fetchall()
    # stop timing
    end = datetime.datetime.now()

print('took {} ms'.format((end-beg).total_seconds() * 1e3))

The timing output:

took 19.173 ms

So that's more than 8 times slower.

I'm no expert in SQL, but I would have thought the query planner would figure out that "limit 1" and "order by primary index" equates to an O(1) operation.

Question:

Is there a more efficient way to select the last row in my table?

In case it is useful, here is the description of my table:

# \d+ prices

                                           Table "public.prices"
 Column |            Type             | Collation | Nullable | Default | Storage | Stats target | Description 
--------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
 time   | timestamp without time zone |           | not null |         | plain   |              | 
 AAPL   | double precision            |           |          |         | plain   |              | 
 GOOG   | double precision            |           |          |         | plain   |              | 
 MSFT   | double precision            |           |          |         | plain   |              | 
Indexes:
    "prices_time_idx" btree ("time" DESC)
Child tables: _timescaledb_internal._hyper_12_100_chunk,
              _timescaledb_internal._hyper_12_101_chunk,
              _timescaledb_internal._hyper_12_102_chunk,
              ...
Beak answered 28/7, 2018 at 20:25 Comment(1)
In 2022 the timescale blog had this post: timescale.com/blog/…Mcmichael
A
11

An efficient way to get last / first record in TimescaleDB:

First record:

SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time ASC LIMIT 1 ;

Last record:

SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time DESC LIMIT 1 ;

The question has already answered but I believe it might be useful if people will get here. Using first() and last() in TimescaleDB takes much longer.

Alban answered 16/9, 2018 at 14:21 Comment(1)
If I do an explain on "SELECT time FROM <TABLE_NAME> where xxx ORDER BY time ASC LIMIT 1" it returns a cost of "24 to 38 million" select max(time) FROM <TABLE_NAME> where xxx" is a very light cost queryBigamy
R
5

Your first query can exclude all but the last chunk, while your second query has to look in every chunk since there is no information to help the planner exclude chunks. So its not an O(1) operation but an O(n) operation with n being the number of chunks for that hypertable.

You could give that information to the planner by writing your query in the following form:

select * from prices WHERE time > now() - interval '1day' order by time desc limit 1

You might have to choose a different interval depending on your chunk time interval.

Starting with TimescaleDB 1.2 this is an O(1) operation if an entry can be found in the most recent chunk and the explicit time constraint in the WHERE clause is no longer needed if you order by time and have a LIMIT.

Radarscope answered 30/7, 2018 at 5:20 Comment(3)
Whilst your suggestion of a query does indeed improve the performance, the main problem with it is that I don't have the "now() - interval '1 day'" information. Sometimes it will be within a day, other times more. Is there any way to just grab the latest timestamp, irrespective of when that actually is?Beak
upvoted how would you do this if you had a group of ids and you wanted the last timestamp of each idIlk
@Beak Do you have the ids you want to group by as metadata in a separate table?Radarscope
B
1

I tried to solve this problem in multiple ways: using last(), trying to create indexes to get the last items faster. In the end, I just ended up creating another table where I store the first and the last item inserted in the hypertable, keyed by WHERE condition that is a relationship in my case.

  • The database writer updates this table as well when it is inserting entries to the hypertable

  • I get first and last item with a simple BTree lookup - no need to go to hypertable at all

Here is my SQLAlchemy code:

class PairState(Base):
    """Cache the timespan endpoints for intervals we are generating with hypertable.

    Getting the first / last row (timestamp) from hypertable is very expensive:
    https://mcmap.net/q/56882/-timescaledb-efficiently-select-last-row

    Here data is denormalised per trading pair, and being updated when data is written to the database.
    Save some resources by not using true NULL values.
    """

    __tablename__ = "pair_state"

    # This table has 1-to-1 relationship with Pair
    pair_id = sa.Column(sa.ForeignKey("pair.id"), nullable=False, primary_key=True, unique=True)
    pair = orm.relationship(Pair,
                        backref=orm.backref("pair_state",
                                        lazy="dynamic",
                                        cascade="all, delete-orphan",
                                        single_parent=True, ), )

    # First raw event in data stream
    first_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    # Last raw event in data stream
    last_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    # The last hypertable entry added
    last_interval_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    @staticmethod
    def create_first_event_if_not_exist(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Sets the first event value if not exist yet."""
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, first_event_at=ts).
            on_conflict_do_nothing()
        )

    @staticmethod
    def update_last_event(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Replaces the the column last_event_at for a named pair."""
        # Based on the original example of https://mcmap.net/q/56883/-programmingerror-sqlalchemy-on_conflict_do_update
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, last_event_at=ts).
            on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_event_at": ts})
        )

    @staticmethod
    def update_last_interval(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Replaces the the column last_interval_at for a named pair."""
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, last_interval_at=ts).
            on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_interval_at": ts})
        )
Benkley answered 14/6, 2021 at 11:17 Comment(0)
P
0

Create table where you will store latest timestamp after every inserting. And use this timestamp in query. It's the most efficent way for me

SELECT <COLUMN> FROM <TABLE_NAME>, <TABLE_WITH_TIMESTAMPS> WHERE time = TABLE_WITH_TIMESTAMPS.time;
Palmira answered 15/11, 2021 at 12:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.