TimescaleDB/Postgres taking up far more space than expected
Asked Answered
T

2

6

I've been saving some tick data in TimescaleDB and have been surprised at how much space it's been taking up. I'm pretty new to this but I'm saving roughly 10 million rows a day into a table that has the following columns:

enter image description here

This has been taking up roughly 35GB a day, which seems excessive. I was wondering what steps I can take to reduce this amount - if I changed the doubles column to float, would this have a big impact? Are there any other ways to reduce this size?

EDIT:

The results of running chunk_relation_size_pretty() are:

enter image description here

and hypertable_relation_size_pretty():

enter image description here

It also seems very strange that the index is taking up so much space - I tried querying the data over a certain range of data and the results took quite a while to get back (roughly 10 minutes for a day's worth of data). The index is currently set as a composite index between (instrument, exchange, time DESC).

Turbojet answered 4/6, 2020 at 4:57 Comment(6)
We need more data: what is the table definition? What does pgstattuple say about the table?Lading
Can you post table schema, indexes and actual sizes, which can be obtained by hypertable_relation_size_pretty() and chunk_relation_size_pretty()Bogle
@Bogle sure - I've added more info in the question.Turbojet
I feel it is something more ongoing in the database, but I didn't come up what to look for. Can you run VACUUM ANALYZE? Do you know if it was run?Bogle
It is weird to see that price and quantity are stored as double. Seems like direct translation from Javascript. I would expect that quantity is an integer and price is a decimal or number.Bogle
Did you come up with anything, apart from compression? I'm having the same "issue".Quiddity
M
6

You should turn on TimescaleDB's native compression:

https://docs.timescale.com/latest/using-timescaledb/compression

Murmurous answered 4/6, 2020 at 5:22 Comment(7)
Thanks - I'll test this. It seems like this will help but I'm not sure if this is the main problem I'm having. Are timescale databases expected to take up much more space normally then (given this is turned off by default).Turbojet
TimescaleDB basic table structure (uncompressed) is basically identical to Postgres, so that's the basic overhead for your row structure. If you do the math, each row takes up roughly the width of each column, plus roughly 27 bytes for additional metadata (e.g. MVCC versioning). create_hypertable also by default creates an index (btree) on timestamp; you should double check that you don't have the default index + your composite. But it's not surprising that the composite index may be large -- you probably have a very large count of instructure/exchange/timestamp.Murmurous
@MikeFreedman i was experimenting with timescale db for timeseries data. I took stock daily data. Total ~28 M rows. Postgres took 1.8 GB of space. timescaledb table using create_hypertable() took 2.2 GB (using hypertable_size('table_name'). which is much higher than postgre. Whats even more surprising when i put compression on with segment_by on ticker the size blow up to 4.0 GB. chunk_compress_stats() shows every chunk is roughly 80% more on size.Synovitis
@MikeFreedman details are captured here medium.com/p/68405425827Synovitis
@Synovitis - Left comment in Medium: If I understand what's happening: Your chunks are likely way too small (or your data too sparse). The "default" time interval per chunk is one week. If you are only taking 1 datapoint per day per stock, that means that each "segmentby" group is 7 items, so you are probably getting more overhead from the various array types we're using as part of compression, versus the compressibility itself. We typically recommend at least 100s of rows per distinct segmentby item per chunk. The additional overhead from TimescaleDB vs. Postgres is likely that as well.Murmurous
One simple way to test: SELECT count(*) from timescaledb_information.chunks WHERE hypertable_name = '‘stock_price_hyper’; Please see here for best practices: docs.timescale.com/latest/using-timescaledb/… docs.timescale.com/latest/using-timescaledb/…Murmurous
Thanks i have updated the post and corrected it. Thanks again for your prompt and expert advice. BTW: Can this be automatically detected and flagged to user?Synovitis
A
0

Try storing the data in other time series databases such as InfluxDB or VictoriaMetrics (I'm the core developer of VictoriaMetrics). They may provide better on-disk compression than TimescaleDB according to benchmarks.

Apparatus answered 1/11, 2021 at 11:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.