TimescaleDB/Postgres taking up far more space than expected

Asked 4/6, 2020 at 4:57 Answered 1/11, 2021 at 11:1

I've been saving some tick data in TimescaleDB and have been surprised at how much space it's been taking up. I'm pretty new to this but I'm saving roughly 10 million rows a day into a table that has the following columns:

This has been taking up roughly 35GB a day, which seems excessive. I was wondering what steps I can take to reduce this amount - if I changed the doubles column to float, would this have a big impact? Are there any other ways to reduce this size?

EDIT:

The results of running chunk_relation_size_pretty() are:

and hypertable_relation_size_pretty():

It also seems very strange that the index is taking up so much space - I tried querying the data over a certain range of data and the results took quite a while to get back (roughly 10 minutes for a day's worth of data). The index is currently set as a composite index between (instrument, exchange, time DESC).

Turbojet answered 4/6, 2020 at 4:57 Comment(6)

We need more data: what is the table definition? What does pgstattuple say about the table? – Lading 4/6, 2020 at 6:20

Can you post table schema, indexes and actual sizes, which can be obtained by hypertable_relation_size_pretty() and chunk_relation_size_pretty() – Bogle 4/6, 2020 at 6:34

@Bogle sure - I've added more info in the question. – Turbojet 7/6, 2020 at 22:59

I feel it is something more ongoing in the database, but I didn't come up what to look for. Can you run VACUUM ANALYZE? Do you know if it was run? – Bogle 8/6, 2020 at 7:20

It is weird to see that price and quantity are stored as double. Seems like direct translation from Javascript. I would expect that quantity is an integer and price is a decimal or number. – Bogle 8/6, 2020 at 7:22

Did you come up with anything, apart from compression? I'm having the same "issue". – Quiddity 6/10, 2020 at 5:40

You should turn on TimescaleDB's native compression:

https://docs.timescale.com/latest/using-timescaledb/compression

Murmurous answered 4/6, 2020 at 5:22 Comment(7)

Thanks - I'll test this. It seems like this will help but I'm not sure if this is the main problem I'm having. Are timescale databases expected to take up much more space normally then (given this is turned off by default). – Turbojet 5/6, 2020 at 1:59

TimescaleDB basic table structure (uncompressed) is basically identical to Postgres, so that's the basic overhead for your row structure. If you do the math, each row takes up roughly the width of each column, plus roughly 27 bytes for additional metadata (e.g. MVCC versioning). create_hypertable also by default creates an index (btree) on timestamp; you should double check that you don't have the default index + your composite. But it's not surprising that the composite index may be large -- you probably have a very large count of instructure/exchange/timestamp. – Murmurous 5/6, 2020 at 4:51

@MikeFreedman i was experimenting with timescale db for timeseries data. I took stock daily data. Total ~28 M rows. Postgres took 1.8 GB of space. timescaledb table using create_hypertable() took 2.2 GB (using hypertable_size('table_name'). which is much higher than postgre. Whats even more surprising when i put compression on with segment_by on ticker the size blow up to 4.0 GB. chunk_compress_stats() shows every chunk is roughly 80% more on size. – Synovitis 6/3, 2021 at 1:17

@MikeFreedman details are captured here medium.com/p/68405425827 – Synovitis 6/3, 2021 at 2:22

@Synovitis - Left comment in Medium: If I understand what's happening: Your chunks are likely way too small (or your data too sparse). The "default" time interval per chunk is one week. If you are only taking 1 datapoint per day per stock, that means that each "segmentby" group is 7 items, so you are probably getting more overhead from the various array types we're using as part of compression, versus the compressibility itself. We typically recommend at least 100s of rows per distinct segmentby item per chunk. The additional overhead from TimescaleDB vs. Postgres is likely that as well. – Murmurous 6/3, 2021 at 5:7

One simple way to test: SELECT count(*) from timescaledb_information.chunks WHERE hypertable_name = '‘stock_price_hyper’; Please see here for best practices: docs.timescale.com/latest/using-timescaledb/… docs.timescale.com/latest/using-timescaledb/… – Murmurous 6/3, 2021 at 5:7

Thanks i have updated the post and corrected it. Thanks again for your prompt and expert advice. BTW: Can this be automatically detected and flagged to user? – Synovitis 6/3, 2021 at 6:1

Try storing the data in other time series databases such as InfluxDB or VictoriaMetrics (I'm the core developer of VictoriaMetrics). They may provide better on-disk compression than TimescaleDB according to benchmarks.

Apparatus answered 1/11, 2021 at 11:1 Comment(0)

Recommended topics

Hot tags