MongoDB: Disk I/O % utilization on Data Partition has gone
Asked Answered
S

3

16

Last time I get alert from MongoDB Atlas:

Disk I/O % utilization on Data Partition has gone above 70 on nvme2n1 

But I have no any ideas how can I localize / query / index / part of code / problematic collection.

In what way can I perform any analyze to find out problem root-cause?

Singultus answered 19/7, 2019 at 21:12 Comment(0)
S
15

Not answer, but just seen that many people faced with similar problem. In My case root cause was: we had collection with huge documents that contain array of data (in fact - list of coordinates with some metadata), and update it as many times, as coordinates we have (when adding new coordinates). + some additional operations.

As I know MongoDB cannot fetch just part of document, it fetch full document, and when we fetch many different and big documents, they are not fit into MongoDB in-memory cache, and each time access into hard disc, that lead to this issue. So, we just split up this document on several, and this fixed issue. While we need frequent access to update/add this data, we keep it into different documents, and finally, after process done, we gather back all this documents into one big document, for "history check" purpose.

Singultus answered 17/10, 2020 at 14:14 Comment(0)
L
10

Update 08/24/2023

The "Disk Utilization %" metric has been retired, and the “Disk Queue Depth” and “Disk IOPS” metrics could be used to monitor the performance of the Disk.

At MongoDB, we are proponents of continuously improving your user experience. As part of this commitment, we have made an important adjustment to our database monitoring metrics; we have retired the "Disk Utilization %" metric from our monitoring charts and alerts.

Moving forward, we recommend that you use the “Disk Queue Depth” and “Disk IOPS” metrics as a more comprehensive and actionable alternative to the previous metric. Our team has carefully evaluated the metrics that best align with the real-world performance scenarios you encounter and the "Disk Queue Depth" metric provides a better measure of disk saturation and the “Disk IOPS” metric provides a better measure of disk utilization. By focusing on these metrics, you can gain more valuable insights into the performance of your system and identify potential bottlenecks.

Here are more details for How to Monitor MongoDB


Recently, we met this alert on MongoDB Atlas Disk I/O % utilization on Data Partition has gone above 90 after the instance reboots maintenance. After a discussion with Atlas support guys, we clearly understand this metric.


Understanding Disk I/O % Utilization

The definition of Disk I/O % Utilization and Disk I/O % utilization on Data Partition per doc

Disk I/O % Utilization alerts indicate that the percentage of time during which requests are being issued reaches a specified threshold.

Disk I/O % utilization on Data Partition occurs if the percentage of time during which requests are being issued to any partition that contains the MongoDB collection data meets or exceeds the threshold.

Two traps in iostat: %util and svctm

Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

This means if there was even just one I/O operation in progress for a given time period, the operating system would report 100% Disk Util, as the disk was in use 100% of that time.

Thus, the disk utilization percentage by itself is NOT an indicator of stress on the disk relative to its maximum IOPS capacity.

Having disk utilization at 100% does not in itself imply there is an issue. Disk utilization is the percentage of time requests are issued to any partition containing the MongoDB collection data. This includes requests from any process, not just MongoDB processes. Modern disk storage can sustain multiple I/O operations simultaneously, so having a ~100% utilization is not unusual, because it just means that the disk is constantly processing at least one operation during the 100% interval.


Conclusion

We should look at a combination of all the available disk-related metrics, as well as IOWait in the System CPU when diagnosing potential disk performance-related issues.


Possible actions to help resolve Disk Utilization % alerts

  • Optimize your queries
    • Create an Index to Support Read Operations
    • Pay attention to Query Selectivity and Covered Query
  • Use the Atlas Performance Advisor to view slow queries and suggested indexes.
  • Review Indexing Strategies for possible further indexing improvements.
  • Analyze Query Performance to review how your queries are using your indexes.
  • Analyze Profile to optimize the long execution time query
  • Increase hardware resources, such as instance size and IOPS on Atlas

Source: Mongo Doc

Ledaledah answered 28/11, 2022 at 9:31 Comment(0)
K
2

As the alert says, it is due to the high utilization of the disk. The most common cause of it is unoptimized queries with poor Query Targeting Ratio, or simply reading/writing a lot of documents from/to the disk in a relatively shorter time window. In order to identify these queries, start with the Profiler and look for the operations with a poor Examined:Returned ratio. You can also refer to the Performance Advisor to see if it suggests any indexes on the inefficient operations. Since Profiler's window is limited to the last 24 hours, you can also refer to your logs to identify the Slow Queries. Ultimately, the effort to solve this is tri-directional:

  • Optimizing the query execution with efficient indexing and filtering strategies
  • Keep a check on the volume of data being read/written in one go.
  • Increase the IOPS of the cluster

For official reference, checkout the documentation here.

Kinard answered 24/8, 2022 at 6:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.