Sudden increase of queue depth and read / write IOPS

Asked 17/12, 2018 at 23:50 Answered 5/5, 2023 at 22:7

Our database was running 16 GB of ram, 4 vcpu and 60 GB of disk (m4.xlarge aws rds machine) 3 months ago, and it was running fine until, from one moment to another, it started with an increase of the queue depth, along with Read and Write latency. In that case, we rebooted and upgraded our database to a 32 GB of ram, 8 vcpu (m4.2xlarge instance) and the problem disappeared.

Of course that's not sustainable in time but we couldn't find the exact reason. We do know that our traffic, hence the database operations, are increasing, but not duplicating in size for sure.

The thing is, after 3 months, the problem appeared again today with the 32 GB 8 vcpu, and again, as a panic button, we upgraded to 64 gb 16 vcpu. And we also duplicated the disk size to 120 GB just to be sure. The traffic didn't suddenly increase in the moment of the incident, and also, I can ensure that the traffic is not twice heavier than 3 months ago. In this 3 months though, we analyzed the slow query log and improved some queries by adding some indexes.

We now have (relevant tables) this tables and indexes

We are not database experts but this doesn't look like the traffic is too heavy for our databases (we are running a ruby on rails e-commerce app). We recognize that "Freeable memory" was always over 20 GB, disk was always over 15 GB. But also, we don't know why it had 100MB of swap usage, if ram was free.

Here are the monitoring charts from aws the last 6 hours, in the middle is the incident, then the scale around 18:15.

Mcnew answered 17/12, 2018 at 23:50 Comment(6)

@wildpasser Probably, we added one index on every slowquery that was recurrent enough to result in a bad user experience. But how does it affect in the problem I'm having, how can I resolve it? – Mcnew 18/12, 2018 at 1:29

Tune the three queries with the highest total_time in pg_stat_statements. – Assyriology 18/12, 2018 at 6:53

You aren't considering sequential table scans, and you aren't considering vacuuming activity. Those two things are likely to affect operating system metrics in visible ways. I recommend to set log_autovacuum_min_duration to a value other than -1 for an idea on vacuuming activity; see number of seqscans in pg_stat_user_tables too. – Trinitroglycerin 18/12, 2018 at 15:3

@Trinitroglycerin docs.google.com/spreadsheets/d/… There are the pg_stat_user_tables anything you could advice? – Mcnew 21/12, 2018 at 12:30

It's fairly clear you have a good number of tables that are being seqscanned too much, just by sorting the table by n_live_tup and then looking at the seq_scan column. It's quite possible you're missing some indexes on those, but it's hard to tell without any details. And like wildplasser said, you probably have too many indexes on only one column. Also, you would probably benefit from lowering fillfactor on some tables (19472) to increase chance of updates becoming HOT. – Trinitroglycerin 31/12, 2018 at 20:45

this is a problem with this cloud vendor RDS. It can also because poor/faulty hardware. – Garibald 28/3 at 9:2

Answering in hopes that this saves some folks trouble in the future. This has caused us massive headaches in the past!

On AWS RDS storage (general purpose), you are given a specific amount of Read/Write IOPS based on your disk space. The amount (at the time of writing) is 3 IOPS per GB of storage. For 60GB of disk space, you would then get 180 IOPs with some burst capacity beyond that.

From your graphs, you exceeded total IOPS for a sustained period of time, which caused your burst credits to run out and your latency + queue depth to increase significantly. You can measure burst credits by the RDS DB Metric "Burst Balance" in AWS Cloudwatch. If it goes to 0, you're about to have a bad time.

You don't need to increase your RDS instance size to solve this, only your disk space. Based on your usage, you might want 2000 IOPS, which would be around 670 GB of disk. Even if you don't need that much storage, you need it just to be provisioned the correct IOPS. The other solution would be to purchase "Provisioned IOPS Storage" which is built for high throughput workloads.

See the below section on I/O credits and burst performance: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html

Ogee answered 31/3, 2020 at 19:58 Comment(3)

Alternatively, one can switch from General Purpose to Provisioned IOPS Storage to customize Disk Space and IOPS limit separately: docs.aws.amazon.com/AmazonRDS/latest/UserGuide/… – Chesterfield 20/5, 2021 at 15:3

I went through this painful path 2 years ago; fun times 😅... We ended up busting the Provisioned IOPS Storage too. Migrating to Aurora solved it for us. – Welford 2/6, 2021 at 17:3

migrating to other database can not be a solution. – Garibald 28/3 at 9:4

The new aws gp3 storage has 3000 iops right off the bat

Driest answered 5/5, 2023 at 22:7 Comment(0)

Recommended topics

Hot tags