When ColdFusion is maxing out the CPU, how do I find out what it's chewing/choking on?
Asked Answered
S

5

10

I'm running CF 9.0.1 on Ubuntu on an "Medium" Amazon EC2 instance. CF has been seizing-up intermittently (several times per day...but notably not isolated to hours of peak usage). At such times, running top gets me this (or something similar):

PID     USER    PR  NI  VIRT    RES     SHR S   %CPU    %MEM    TIME+COMMAND
15855   wwwrun  20  0   1762m   730m    20m S   99.3    19.4    13:22.96 coldfusion9

So, it's obviously consuming most of the server resources. The following error has been showing up in my cfserver.log in the lead-up to each seize-up:

java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.

If I run /opt/coldfusion9/bin/coldfusion status, I get:

Pg/Sec  DB/Sec  CP/Sec  Reqs  Reqs  Reqs  AvgQ   AvgReq AvgDB  Bytes  Bytes 
Now Hi  Now Hi  Now Hi  Q'ed  Run'g TO'ed Time   Time   Time   In/Sec Out/Sec
0   0   0   0   -1  -1  150   25    0     0      -1352560      0      0

In the administrator, under Server Settings > Request Tuning, the setting for Maximum number of simultaneous Template requests is 25. So this makes sense so far. I could just increase the thread pool to cover these sort of load spikes. I could make it 200. (Which I did just now as a test.)

However, there's also this file /opt/coldfusion9/runtime/servers/coldfusion/SERVER-INF/jrun.xml. And some of the settings in there appear to conflict. For example, it reads:

<service class="jrunx.scheduler.SchedulerService" name="SchedulerService">
  <attribute name="bindToJNDI">true</attribute>
  <attribute name="activeHandlerThreads">25</attribute>
  <attribute name="maxHandlerThreads">1000</attribute>
  <attribute name="minHandlerThreads">20</attribute>
  <attribute name="threadWaitTimeout">180</attribute>
  <attribute name="timeout">600</attribute>
</service>

Which a) has fewer active threads (what does this mean?), and b) has a max threads that exceed the simultaneous request limit set in the admin. So, I'm not sure. Are these independent configs that need to be made to match manually? Or is the jrun.xml file supposed to be written by the CF Admin when changes are made there? Hmm. But maybe this is different because presumably the CF Scheduler should only use a subset of all available threads, right?...so we'd always have some threads for real live users? We also have this in there:

<service class="jrun.servlet.http.WebService" name="WebService">
  <attribute name="port">8500</attribute>
  <attribute name="interface">*</attribute>
  <attribute name="deactivated">true</attribute>
  <attribute name="activeHandlerThreads">200</attribute>
  <attribute name="minHandlerThreads">1</attribute>
  <attribute name="maxHandlerThreads">1000</attribute>
  <attribute name="mapCheck">0</attribute>
  <attribute name="threadWaitTimeout">300</attribute>
  <attribute name="backlog">500</attribute>
  <attribute name="timeout">300</attribute>
</service>

This appears to have changed when I changed the CF Admin setting...maybe...but it's the activeHandlerThreads that matches my new maximum simulataneous requests setting...rather than the maxHandlerThreads, which again exceeds it. Finally, we have this:

<service class="jrun.servlet.jrpp.JRunProxyService" name="ProxyService">
  <attribute name="activeHandlerThreads">200</attribute>
  <attribute name="minHandlerThreads">1</attribute>
  <attribute name="maxHandlerThreads">1000</attribute>
  <attribute name="mapCheck">0</attribute>
  <attribute name="threadWaitTimeout">300</attribute>
  <attribute name="backlog">500</attribute>
  <attribute name="deactivated">false</attribute>
  <attribute name="interface">*</attribute>
  <attribute name="port">51800</attribute>
  <attribute name="timeout">300</attribute>
  <attribute name="cacheRealPath">true</attribute>
</service>

So, I'm not certain which (if any) of these I should change and what exactly the relationship is between maximum requests and maximum threads. Also, since several of these list the maxHandlerThreads as 1000, I'm wondering if I should just set the maximum simultaneous requests to 1000. There must be some upper limit that depends on available server resources...but I'm not sure what it is and I don't really want to play around with it since it's a production environment.

I'm not sure if it pertains to this issue at all, but when I run a ps aux | grep coldfusion I get the following:

wwwrun   15853  0.0  0.0   8704    760    pts/1     S   20:22   0:00 /opt/coldfusion9/runtime/bin/coldfusion9 -jar jrun.jar -autorestart -start coldfusion
wwwrun   15855  5.4 18.2   1678552 701932 pts/1     Sl  20:22   1:38 /opt/coldfusion9/runtime/bin/coldfusion9 -jar jrun.jar -start coldfusion

There are always these two and never more than these two processes. So there does not appear to be a one-to-one relationship between processes and threads. I recall from an MX 6.1 install I maintained for many years that additional CF processes were visible in the process list. It seemed to me at the time like I had a process for each thread...so either I was wrong or something is quite different in version 9 since it's reporting 25 running requests and only showing these two processes. If a single process can have multiple threads in the background, then I'm given to wonder why I have two processes instead of one?...just curious.

So, anyway, I've been experimenting while composing this post. As noted above I adjusted the maximum simultaneous requests up to 200. I was hoping this would solve my problem, but CF just crashed again (rather it slogged down and requests started timing out...so effectively "crashed"). This time, top looked similar (still consuming more than 99% of the CPU), but CF status looked different:

Pg/Sec  DB/Sec  CP/Sec  Reqs  Reqs  Reqs  AvgQ   AvgReq AvgDB  Bytes  Bytes
Now Hi  Now Hi  Now Hi  Q'ed  Run'g TO'ed Time   Time   Time   In/Sec Out/Sec
0   0   0   0   -1  -1  0     150   0     0      0      0      0      0

Obviously, since I'd increased the maximum simultaneous requests, it was allowing more requests to run simultaneously...but it was still maxing out the server resources.

Further experiments (after restarting CF) showed me that the server became unusably slogged after about 30-35 "Reqs Run'g", with all additional requests headed for an inevitable timeout:

Pg/Sec  DB/Sec  CP/Sec  Reqs  Reqs  Reqs  AvgQ   AvgReq AvgDB  Bytes  Bytes
Now Hi  Now Hi  Now Hi  Q'ed  Run'g TO'ed Time   Time   Time   In/Sec Out/Sec
0   0   0   0   -1  -1  0     33    0     0      -492   0      0      0

So, it's clear that increasing the maximum simultaneous requests has not helped. I guess what it comes down to is this: What is it having such a hard time with? Where are these spikes coming from? Bursts of traffic? On what pages? What requests are running at any given time? I guess I simply need more information to continue troubleshooting. If there are long-running requests, or other issues, I'm not seeing it in the logs (although I do have that option checked in the admin). I need to know which requests exactly are those responsible for these spikes. Any help would be much appreciated. Thanks.

~Day

Sarnen answered 5/6, 2012 at 17:35 Comment(8)
I just wanted to note that I have posted this question at forums.adobe.com/message/4464997 but I haven't seen any responses yet, so I'm cross-posting here in the hope that this community is more helpful. Thanks.Sarnen
Do you run fusionreactor? It's kinda helpful to see which requests are running and where they are stuck. I have a similar problem with 100% CPU usage but it's mostly all the time ... So in my case FR doesn't help alot. Maybe you are luckier and these 100% spikes are just caused by imageresize with highestquality or some imageprocessing tags/functions like this.Aviary
Thanks for your response. I just went to take a look at FusionReactor and found it's not free (which would be fine, of course, if it would actually help). It looks like there's a fully functional demo, which is cool...but I've haven't been able to get it to install yet, so we'll see. Thanks again.Sarnen
I also wanted to post a follow-up here in fairness to the CF community over on the Adobe Forums, I did get a response from @jcaito, who also suggested FusionReactor (in addition to the CF native Server Monitoring, which, unfortunately, is not enabled in Standard Edition).Sarnen
Day - while you are futzing with fusion reactor I would suggest turning on page logging in the CF admin (log any page taking longer than N number seconds). It is a blunt tool that often provides an additional starting point in your code.Canned
Thanks @Mark. I've done that. Unfortunately, as you know, the request timeouts don't catch everything. It appears that what's happening is that one of the requests is locking something in the template cache, and then all the other requests are just waiting on that lock and are in limbo as it were. No request timeouts occur for any of the requests...Sarnen
...Then, when I finally go in and kill the one that had the lock, the others are released from limbo. At that point, if they are below the request timeout, they finish processing, and if they are over it (which most all of them usually are) then they simply proceed to timeout. But they won't timeout on their own and the requests just pile up until the active threads are used and the thread queue is full and everything seizes up. I posted my some stack traces so you can see what I'm talking about if you're interested.Sarnen
That has the earmarks of a database "block" - have you monitored your DB during the lockup? - I'll take a look at your stack traces after I get off the golf course :)Canned
D
5

I've had a number of 'high-cpu in production' type bugs and the way i've always dealt with them is this:

  1. Use jstack PID >> stack.log to dump 5 of stack traces, 5 seconds apart. Number of traces and timing not critical.

  2. Open the log in Samurai. You get a view of the threads at each time you did a dump. Threads processing your code start web- (for requests using the built-in server) and jrpp- for requests coming in through Apache/IIS.

  3. Read the history of each thread. You're looking for the stack being very similar in each dump. If a thread looks like it's handling the same request the whole time, the bits that vary near the top will point to where an infinite loop is happening.

Feel free to dump a stack trace somewhere online and point us to it.

The other technique I've used to understand what's going on is to modify apache's httpd.conf to log time taken: %D and record session id: %{jsessionid} which allows you to trace individual users in the run-up to hangs and to do some nice stats/graphs with the data (I use LogParser to crunch the numbers and output to CSV, followed by Excel to graph the data):

LogFormat "%h %l %u %t "%r" %>s %b %D %{jsessionid}" customAnalysis
CustomLog logs/analysis_log customAnalysis

One other technique I've just remembered is to enable CF Metrics, which will get you some measure of what the server was up to in the runup to a hang. I set this to log every 10 seconds and change the format to be CSV, so I can grep the metrics from the event log and then run them through Excel to graph server load in the runup to crashes.

Barny

Dalt answered 6/6, 2012 at 7:30 Comment(7)
Thanks Barny. I have not yet tried jStack and Samurai as you suggested. But I do have stack traces now from the FusionReactor demo. I don't see anywhere to enable the CF Metrics in the admin. Would I want to hand-edit the jrun.xml file to do that?Sarnen
I installed the JDK and that provided jStack. But Samurai appears (at first glance) to be GUI tool that I don't expect to work without XWindows. Am I wrong?Sarnen
You're right, it's a swing app that needs a GUI. I normally take the log files off the server and analyse them on a workstation locally. You can analyse the dumps by hand, but it's hard to keep track of the threads between successive dumps, which is where samurai really helps.Dalt
Looking at the stack dumps you've provided, it looks like jrpp-431, jrpp-310, jrpp-455, jrpp-494, jrpp-476, jrpp-477, jrpp-452 and jrpp-490 are all locked on line 28 of Application.cfm. Going by the stack traces, I'm guessing it's a custom tag. Line 28 of Application.cfm is the last line of your code on the trace though, after that its CF internals, and from the looks of it, it's trying to check whether the template is in the trusted cache, so you could try disabling that, although that's a bit drastic. Also, do you have any <cflocks in place around line 28?Dalt
You're right about the CF Metrics being a straight edit in jrun.xml. I'll see if I can dig out the config we use in our app, which gives a CSV or TSV format to the metrics.Dalt
Your stack traces look rather similar to this fixed defect: helpx.adobe.com/coldfusion/kb/… - if nothing obvious crops up from reviewing your code, I'd suggest raising a ticket with Adobe and seeing what their response is.Dalt
Thanks again, Barny. I'm not using any <cflock> tags. I think you (and others) have pretty much answered my primary question here with regard to how to find out which scripts are causing problems. The stack traces were a major breakthrough. So in the interest of keeping StackOverflow tidy, I've started a new question that better reflects my current stage of troubleshooting: How to resolve deadlock issue in ColdFusion 9: coldfusion.util.AbstractCache$LockSarnen
C
2

To find out what is maxing out your procs requires a lot of information that is "internal" to your system. It's hard to do it from outside looking at things like queued requests etc. One thing is certain - altering simultaneous request setting to a very high number is not going to do the trick :) All it will do is remove something that is designed to keep CF from gloaming onto too much processor.

Here's my list of things that max out CPU usage.

  • Client vars in the registry. I have a couple of excellent articles on why this problem can "crop up" out of no where. check out my blog (coldfusion muse).
  • intermitent problems connecting to the database. This is actually a little bit exacerbated in the cloud where networking and restrictions on bandwidth usage can sort of "throttle" connections to a DB. Most CF apps make heavy use of the DB. If something interferes with or slows connections the result is the number of connections rises til it hits that simultaneous number then requests start to queue - but this issue isn't necessarily related to CF itself - it's a symptom.
  • JVM issues - tuning your JVM to handle garbage collection, have enough New and Perm gen space etc is very important... although frankly the items above are often the first at fault.

There are many other reasons this can happen - among them (as you surmise) code issues that crop up as certain scripts are run. Long running requests, file uploads, heavy lifting scheduled tasks, index bot traffic generating traffic or spawning too many sessions.... the list goes on.

Hopefully something on this list I provided will strike you as possible. good luck.

(and yes FR or even the CF monitor are good tools to help you tease all this out :).

Canned answered 6/6, 2012 at 3:15 Comment(0)
S
0

A few weeks ago, I had a server that kept maxing out the CPU utilization on the JRun process, and would periodically restart it, only to have it ramp right back up to 100% within 24 hours. I fussed with JVM settings and the like too, until finally discovering, much to my embarrassed surprise, an infinite loop in my code. There was a WHILE loop with a condition that would never fail to be met. Oops.

So maybe you made a simple mistake in your code, and it's got nothing to do with the server config, fwiw.

+1 for the FusionReactor demo. That'll at least give you some clues.

Stepchild answered 5/6, 2012 at 21:38 Comment(1)
Yes. Thanks, Dave. I expect it is my code. But my problem is really that I don't know which bit of code it is. Thus I would like to see what it's working on when it's working so hard. If I can get FR demo working, that may well be enough to put me on the right track.Sarnen
M
0

You would have to increase active thread pool size. Please check the below links

http://www.talkingtree.com/blog/index.cfm/2005/11/28/Request-timed-out-waiting-for-an-available-thread-to-run

http://helpx.adobe.com/coldfusion/kb/coldfusion-mx-6-1-request.html

Happy coding!!!

Malinger answered 6/6, 2012 at 12:31 Comment(1)
Thanks for the answer Ravia, but I've tried that and my server can only handle as many threads as it can handle; increasing it won't help. The advice I've seen is to take the total CPU in MHz and divide it by 100...in my case this ends up being 34 point something, so I've now set my active thread pool (i.e. simultaneous requests) to 33 to make sure I'm within the limits of my server resources. My question here (while I took a while to get around to it) is really about the best way to identify which requests/threads are to blame for the deadlocks I'm experiencing.Sarnen
P
0

Have you tried using the ColdFusion Server monitor that comes with Coldfusion ?

Portative answered 30/12, 2013 at 0:59 Comment(2)
Requests for clarification should really be made as a comments, rather than an answer. (Though if you read the comments above, they mentioned Server monitor was not an option as they are running Standard Edition).Usually
Thanks anyway though. ;-)Sarnen

© 2022 - 2024 — McMap. All rights reserved.