RabbitMQ (beam.smp) and high CPU/memory load issue
Asked Answered
M

5

61

I have a debian box running tasks with celery and rabbitmq for about a year. Recently I noticed tasks were not being processed so I logged into the system and noticed that celery could not connect to rabbitmq. I restarted rabbitmq-server and even though celery was not complaining anymore it was not executing new tasks now. The odd thing was that rabbitmq was devouring cpu and memory resources like crazy. Restarting server would not solve the problem. After spending couple hours looking for solution online to no avail I decided to rebuild the server.

I rebuilt new server with Debian 7.5, rabbitmq 2.8.4, celery 3.1.13 (Cipater). For about an hour or so everything worked beautifully again until celery started complaining again that it can't connect to rabbitmq!

[2014-08-06 05:17:21,036: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
Trying again in 6.00 seconds...

I restarted rabbitmq service rabbitmq-server start and same issue gain:

rabbitmq started again swelling up constantly pounding on cpu and slowly taking over all ram and swap:

PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
21823 rabbitmq  20   0  908m 488m 3900 S 731.2 49.4   9:44.74 beam.smp

Here's the result on rabbitmqctl status:

Status of node 'rabbit@li370-61' ...
[{pid,21823},
 {running_applications,[{rabbit,"RabbitMQ","2.8.4"},
                        {os_mon,"CPO  CXC 138 46","2.2.9"},
                        {sasl,"SASL  CXC 138 11","2.2.1"},
                        {mnesia,"MNESIA  CXC 138 12","4.7"},
                        {stdlib,"ERTS  CXC 138 10","1.18.1"},
                        {kernel,"ERTS  CXC 138 10","2.15.1"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:8:8] [async-threads:30] [kernel-poll:true]\n"},
 {memory,[{total,489341272},
          {processes,462841967},
          {processes_used,462685207},
          {system,26499305},
          {atom,504409},
          {atom_used,473810},
          {binary,98752},
          {code,11874771},
          {ets,6695040}]},
 {vm_memory_high_watermark,0.3999999992280962},
 {vm_memory_limit,414559436},
 {disk_free_limit,1000000000},
 {disk_free,48346546176},
 {file_descriptors,[{total_limit,924},
                    {total_used,924},
                    {sockets_limit,829},
                    {sockets_used,3}]},
 {processes,[{limit,1048576},{used,1354}]},
 {run_queue,0},

Some entries from /var/log/rabbitmq:

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=INFO REPORT==== 8-Aug-2014::00:11:36 ===
vm_memory_high_watermark set. Memory used:422283840 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
memory resource limit alarm set on node 'rabbit@li370-61'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 8-Aug-2014::00:11:43 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 8-Aug-2014::00:11:44 ===
vm_memory_high_watermark clear. Memory used:290424384 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:11:44 ===
memory resource limit alarm cleared on node 'rabbit@li370-61'

=INFO REPORT==== 8-Aug-2014::00:11:59 ===
vm_memory_high_watermark set. Memory used:414584504 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:11:59 ===
memory resource limit alarm set on node 'rabbit@li370-61'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 8-Aug-2014::00:12:00 ===
vm_memory_high_watermark clear. Memory used:411143496 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:12:00 ===
memory resource limit alarm cleared on node 'rabbit@li370-61'

=INFO REPORT==== 8-Aug-2014::00:12:01 ===
vm_memory_high_watermark set. Memory used:415563120 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:12:01 ===
memory resource limit alarm set on node 'rabbit@li370-61'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 8-Aug-2014::00:12:07 ===
Server startup complete; 0 plugins started.

=ERROR REPORT==== 8-Aug-2014::00:15:32 ===
** Generic server rabbit_disk_monitor terminating 
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61",
                               50000000,46946492416,100,10000,
                               #Ref<0.0.1.79456>,false}
** Reason for termination == 
** {unparseable,[]}

=INFO REPORT==== 8-Aug-2014::00:15:37 ===
Disk free limit set to 50MB

=ERROR REPORT==== 8-Aug-2014::00:16:03 ===
** Generic server rabbit_disk_monitor terminating 
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61",
                               50000000,46946426880,100,10000,
                               #Ref<0.0.1.80930>,false}
** Reason for termination == 
** {unparseable,[]}

=INFO REPORT==== 8-Aug-2014::00:16:05 ===
Disk free limit set to 50MB

UPDATE: Seems like problem was solved when installed newest version of rabbitmq (3.3.4-1) from rabbitmq.com repository. Originally I had one installed (2.8.4) from Debian repositories. So far rabbitmq-server is working smoothly. I will update this post if issue comes back.

UPDATE: Unfortunately after about 24 hours the issue reappeared where rabbitmq shut down and restarting the process would make it consume resources until it shuts down again within minutes.

Mingy answered 6/8, 2014 at 14:3 Comment(1)
I experienced this issue today and it turned out to be that our (singleton) RabbitMQ instance had exhausted its burst allowance on EC2. Thought I'd mention as it might help someone else who lands on this page.Norean
M
50

Finally I found the solution. These posts helped to figure this out. RabbitMQ on EC2 Consuming Tons of CPU and https://serverfault.com/questions/337982/how-do-i-restart-rabbitmq-after-switching-machines

What happened was rabbitmq was holding on to all the results that were never freed to the point it became overloaded. I cleared all the stale data in /var/lib/rabbitmq/mnesia/rabbit/, restarted rabbit and it works fine now.

My solution was to disable storing results alltogether with CELERY_IGNORE_RESULT = True in the Celery config file to assure this does not happen again.

Mingy answered 6/8, 2014 at 17:49 Comment(1)
How do I stop the high load issue for rabbitmq if I am not using celery. Any ideas ?Quintie
S
5

You can also reset the queue:

Warning: This clears all data and configuration! Use with caution.

sudo service rabbitmq-server start
sudo rabbitmqctl stop_app
sudo rabbitmqctl reset
sudo rabbitmqctl start_app

You might need to run these command right after rebooting if your system is not responding.

Swindle answered 1/9, 2015 at 14:57 Comment(1)
rabbitmqctl reset destroys configuration and data! i have to assume you didn't know this because it's otherwise pretty reckless to suggest that without this warning :(Oblong
H
3

You are running out the memory resources because of celery, I got a similar issue and it was a problem with the queues used by celery backend result.

You can check how many queues there are using rabbitmqctl list_queues command, take attention if that number grows forever. In that case check out your celery use.

About celery, if you are not getting the results as asycronous events dont configure a backend for store those unused results.

Hedgehog answered 8/8, 2014 at 10:13 Comment(1)
Thanks, I figured it out yesterday. I added CELERY_IGNORE_RESULT = True to my celeryconfig.py and from what you're saying not specifying backend will do the trick as well. Good approach if results are needed would be to use a database (mongo, redis) as backend.Mingy
H
2

I experienced a similar issue and it turned out to be due to some rogue RabbitMQ client applications. The issue seems to have been that due to some un-handled error, the rogue application was continuously trying to make a connection to the RabbitMQ broker. Once the client applications were restarted, everything went back to normal (since the application stopped malfunctioning and there for stopped trying to make a connection to RabbitMQ in an endless loop)

Heinie answered 23/9, 2015 at 11:21 Comment(0)
W
1

Another possible cause: The management-plugin.

I'm running RabbitMQ 3.8.1 with enabled management-plugin. On a 10-core server I had up to 1000% CPU-usage with 3 idle consumers and not a single message being sent, and one queue.

When I disabled the management-plugin by executing rabbitmq-plugins disable rabbitmq_management the usage dropped to 0% with occasional spikes of 200%.

Waistband answered 8/11, 2019 at 10:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.