Main causes for leader node to be at high CPU

D

2

7

I often see the leader node of our Redshift cluster peak at 100% CPU. I've identified one possible cause for this: many concurrent queries, and so many execution plans for the leader to calculate. This hypothesis seems very likely as the times we have the most queries coming seem to be the same as the ones we see the leader at 100%.

To fix this at best, we are wondering: are there other main possible causes for a high CPU on the leader?

(I'm precising the situation is when only the leader node is at high CPU and the workers seem fine)

Dingo answered 14/10, 2019 at 8:19 Comment(2)

Facing the same issue and unable to resolve this. The last option is to increase nodes in the cluster. – Marlenamarlene 30/7, 2020 at 10:0

I also faced it recently, where some queries were executing and then aborted from client and i was unable to terminated them. However, it's not clear to me whether leader node reached 100 during query execution or after abort. – Bolanger 24/8, 2020 at 9:12

C

6

The Redshift leader node is the same size and class of compute as the compute nodes. Typically this means that the leader is over provisioned for the role it plays but since its role is so important and impactful if things slows down, it is good that it is over provisioned. The leader needs to compile and optimized the queries and perform final steps in queries (final sort for example). It communicates with the session clients and handles all their requests. If the leader becomes overloaded all these activities slow down creating significant performance issues. It is not good that your leader is hitting 100% CPU often enough for you to notice. I bet the seems sluggish when this happens.

There are a number of ways I've seen "leader abuse" and it usually becomes a problem when bad patterns are copied between users. In no particular order:

Large data literals in queries (INSERT ... VALUES ...). This puts your data through the query compiler on the leader node. This is not what it is design to do and is very expensive for the leader. Use the COPY command to bring data into the cluster. (Just bad, don't do this)
Over use of COMMIT. A commits cause an update to the coherent state of the database and needs to run through the "commit queue" and creates work for the leader and the compute nodes. Having COMMITs every other statement can cause this queue to back up and work to generally back up.
Too many slots defined in the WLM. Redshift can typically only efficiently run between 1 and 2 dozen queries at once. Setting the total slot count very high (like 50) can lead to very inefficient operation and high CPU loads. Depending on workload this can show up for compute or occasionally the lead node.
Large data output through SELECT statements. SELECTs return data but when this data is many GBs in size the management of this data movement (and sorting) is done by the leader node. If large amounts of data need to be extracted from Redshift it should be done with an UNLOAD statement.
Overuse of large cursors. Cursors can be an important tool and needed for many BI tools but cursors are located on the leader and overuse can lead to reduced leader attention on other tasks.
Many / large UNLOADs with parallel off. UNLOADs generally come from the compute nodes straight to S3 but with "parallel off" all the data is routed to the leader node where it is combined (sorted) and sent to S3.

While none of the above of problems in and of themselves, it is when these are overused, used in ways they are not intended, or all at once that the leader starts to be impacted. It also comes down to what you intend to do with your cluster - if it support BI tools then you may have a lot of cursors but this load on the leader is part of the cluster's intent. Issue often arise when the cluster's intent is to all things to everybody.

If your workload for Redshift is leader function heavy and you are efficiently using the leader node (no large literals, using COPY and UNLOAD, etc.) then high leader workload is what you want. You're getting the most out of the critical resource. However, most use Redshift to perform analytics on large data which is the function of the compute nodes. A highly loaded leader can detract significantly from this mission and needs to be addressed.

Another way that leader can get stressed is when clusters are configured with many smaller node types instead of fewer bigger nodes. Since the leader is the same size as the compute nodes many smaller nodes means you have a small leader doing the work. Something to consider but I'd make sure you don't have unneeded leader node stressers before investing in a resize.

Consumptive answered 3/12, 2021 at 16:11 Comment(2)

About cursors, what is the alternative ? Unloading the query-result to S3, instead of returning it to the DB client? – Bib 26/1 at 17:33

When I said large cursors I meant large. So don’t worry about normal cursor activity. A large Redshift cursor holds more than your local workbench can manage. That said cursors are rarely used when extracting data and these large extractions should be unloaded to s3. Large cursor activity often comes about when building BI reports with database updates on. Or just exploring some data on the db. Turn off auto updates, add limits, or add a where clause. If the RS cluster is all yours it’s not a problem but don’t stress the system with nonproductive work when others need to get things done. – Consumptive 27/1 at 13:35

B

1

Whenever you execute some commands which require calculation on the leader node, whether for dispatching data, computing statistics, or aggregating results from the workers, like COPY, UNLOAD, VACUUM, ANALYZE, you'll see an increase in CPU usage. More information about this here: https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html

Browne answered 3/12, 2021 at 9:15 Comment(0)

Recommended topics

Hot tags