Update 08/24/2023
The "Disk Utilization %" metric has been retired, and the “Disk Queue Depth” and “Disk IOPS” metrics could be used to monitor the performance of the Disk.
At MongoDB, we are proponents of continuously improving your user experience. As part of this commitment, we have made an important adjustment to our database monitoring metrics; we have retired the "Disk Utilization %" metric from our monitoring charts and alerts.
Moving forward, we recommend that you use the “Disk Queue Depth” and “Disk IOPS” metrics as a more comprehensive and actionable alternative to the previous metric. Our team has carefully evaluated the metrics that best align with the real-world performance scenarios you encounter and the "Disk Queue Depth" metric provides a better measure of disk saturation and the “Disk IOPS” metric provides a better measure of disk utilization. By focusing on these metrics, you can gain more valuable insights into the performance of your system and identify potential bottlenecks.
Here are more details for How to Monitor MongoDB
Recently, we met this alert on MongoDB Atlas Disk I/O % utilization on Data Partition has gone above 90
after the instance reboots maintenance. After a discussion with Atlas support guys, we clearly understand this metric.
Understanding Disk I/O % Utilization
The definition of Disk I/O % Utilization
and Disk I/O % utilization on Data Partition
per doc
Disk I/O % Utilization
alerts indicate that the percentage of time during which requests are being issued reaches a specified threshold.
Disk I/O % utilization on Data Partition
occurs if the percentage of time during which requests are being issued to any partition that contains the MongoDB collection data meets or exceeds the threshold.
Two traps in iostat: %util and svctm
Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
This means if there was even just one I/O operation in progress for a given time period, the operating system would report 100% Disk Util
, as the disk was in use 100%
of that time.
Thus, the disk utilization percentage by itself is NOT an indicator of stress on the disk relative to its maximum IOPS
capacity.
Having disk utilization at 100%
does not in itself imply there is an issue. Disk utilization is the percentage of time requests are issued to any partition containing the MongoDB collection data. This includes requests from any process, not just MongoDB processes. Modern disk storage can sustain multiple I/O operations simultaneously, so having a ~100%
utilization is not unusual, because it just means that the disk is constantly processing at least one operation during the 100%
interval.
Conclusion
We should look at a combination of all the available disk-related metrics, as well as IOWait
in the System CPU when diagnosing potential disk performance-related issues.
Possible actions to help resolve Disk Utilization %
alerts
- Optimize your queries
- Create an Index to Support Read Operations
- Pay attention to Query Selectivity and Covered Query
- Use the
Atlas Performance Advisor
to view slow queries and suggested indexes.
- Review
Indexing Strategies
for possible further indexing improvements.
- Analyze
Query Performance
to review how your queries are using your indexes.
- Analyze
Profile
to optimize the long execution time query
- Increase hardware resources, such as
instance size
and IOPS
on Atlas
Source: Mongo Doc