How to run a celery worker with Django app scalable by AWS Elastic Beanstalk?
Asked Answered
H

5

21

How to use Django with AWS Elastic Beanstalk that would also run tasks by celery on main node only?

Heraclea answered 15/12, 2016 at 10:19 Comment(1)
If you want something lighter than celery, you can try pypi.org/project/django-eb-sqs-worker package - it uses Amazon SQS for queueing tasks.Peat
H
36

This is how I set up celery with django on elastic beanstalk with scalability working fine.

Please keep in mind that 'leader_only' option for container_commands works only on environment rebuild or deployment of the App. If service works long enough, leader node may be removed by Elastic Beanstalk. To deal with that, you may have to apply instance protection for your leader node. Check: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html#instance-protection-instance

Add bash script for celery worker and beat configuration.

Add file root_folder/.ebextensions/files/celery_configuration.txt:

#!/usr/bin/env bash

# Get django environment variables
celeryenv=`cat /opt/python/current/env | tr '\n' ',' | sed 's/export //g' | sed 's/$PATH/%(ENV_PATH)s/g' | sed 's/$PYTHONPATH//g' | sed 's/$LD_LIBRARY_PATH//g' | sed 's/%/%%/g'`
celeryenv=${celeryenv%?}

# Create celery configuraiton script
celeryconf="[program:celeryd-worker]
; Set full path to celery program if using virtualenv
command=/opt/python/run/venv/bin/celery worker -A django_app --loglevel=INFO

directory=/opt/python/current/app
user=nobody
numprocs=1
stdout_logfile=/var/log/celery-worker.log
stderr_logfile=/var/log/celery-worker.log
autostart=true
autorestart=true
startsecs=10

; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600

; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true

; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998

environment=$celeryenv

[program:celeryd-beat]
; Set full path to celery program if using virtualenv
command=/opt/python/run/venv/bin/celery beat -A django_app --loglevel=INFO --workdir=/tmp -S django --pidfile /tmp/celerybeat.pid

directory=/opt/python/current/app
user=nobody
numprocs=1
stdout_logfile=/var/log/celery-beat.log
stderr_logfile=/var/log/celery-beat.log
autostart=true
autorestart=true
startsecs=10

; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600

; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true

; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998

environment=$celeryenv"

# Create the celery supervisord conf script
echo "$celeryconf" | tee /opt/python/etc/celery.conf

# Add configuration script to supervisord conf (if not there already)
if ! grep -Fxq "[include]" /opt/python/etc/supervisord.conf
  then
  echo "[include]" | tee -a /opt/python/etc/supervisord.conf
  echo "files: celery.conf" | tee -a /opt/python/etc/supervisord.conf
fi

# Reread the supervisord config
supervisorctl -c /opt/python/etc/supervisord.conf reread

# Update supervisord in cache without restarting all services
supervisorctl -c /opt/python/etc/supervisord.conf update

# Start/Restart celeryd through supervisord
supervisorctl -c /opt/python/etc/supervisord.conf restart celeryd-beat
supervisorctl -c /opt/python/etc/supervisord.conf restart celeryd-worker

Take care about script execution during deployment, but only on main node (leader_only: true). Add file root_folder/.ebextensions/02-python.config:

container_commands:
  04_celery_tasks:
    command: "cat .ebextensions/files/celery_configuration.txt > /opt/elasticbeanstalk/hooks/appdeploy/post/run_supervised_celeryd.sh && chmod 744 /opt/elasticbeanstalk/hooks/appdeploy/post/run_supervised_celeryd.sh"
    leader_only: true
  05_celery_tasks_run:
    command: "/opt/elasticbeanstalk/hooks/appdeploy/post/run_supervised_celeryd.sh"
    leader_only: true

File requirements.txt

celery==4.0.0
django_celery_beat==1.0.1
django_celery_results==1.0.1
pycurl==7.43.0 --global-option="--with-nss"

Configure celery for Amazon SQS broker (Get your desired endpoint from list: http://docs.aws.amazon.com/general/latest/gr/rande.html) root_folder/django_app/settings.py:

...
CELERY_RESULT_BACKEND = 'django-db'
CELERY_BROKER_URL = 'sqs://%s:%s@' % (aws_access_key_id, aws_secret_access_key)
# Due to error on lib region N Virginia is used temporarily. please set it on Ireland "eu-west-1" after fix.
CELERY_BROKER_TRANSPORT_OPTIONS = {
    "region": "eu-west-1",
    'queue_name_prefix': 'django_app-%s-' % os.environ.get('APP_ENV', 'dev'),
    'visibility_timeout': 360,
    'polling_interval': 1
}
...

Celery configuration for django django_app app

Add file root_folder/django_app/celery.py:

from __future__ import absolute_import, unicode_literals
import os
from celery import Celery

# set the default Django settings module for the 'celery' program.
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'django_app.settings')

app = Celery('django_app')

# Using a string here means the worker don't have to serialize
# the configuration object to child processes.
# - namespace='CELERY' means all celery-related configuration keys
#   should have a `CELERY_` prefix.
app.config_from_object('django.conf:settings', namespace='CELERY')

# Load task modules from all registered Django app configs.
app.autodiscover_tasks()

Modify file root_folder/django_app/__init__.py:

from __future__ import absolute_import, unicode_literals

# This will make sure the app is always imported when
# Django starts so that shared_task will use this app.
from django_app.celery import app as celery_app

__all__ = ['celery_app']

Check also:

Heraclea answered 15/12, 2016 at 10:19 Comment(23)
Could you take a look at this question I followed your example but got the following error #43482040Haplo
@BorkoKovacev Thanks, I've updated set fix for supervisorctl restart.Heraclea
@Heraclea this may be a little late but I did exactly what you said above and it was worked fine so far except that when I deploy i get this happening to me #44269039 seems to have something to do with pycurl not being found?Tallie
@Heraclea small edit - adding | sed 's/%/%%/g' to the celeryenv line helps prevent a problem a few people are running into with this config, see #41231989Punic
is the LAMIA dictionary something you created or does eb provide that?Anagoge
"If service works long enough, leader node may be removed by Elastic Beanstalk. " - > You can protect specific instances from being removed by the load balancer.Anagoge
Im unsure how I connect to a specific sqs queue, I created an IAM user to obtain extra access_key_id and secret_key but your configuration file never mentions the queue name? Ah, nvm its farther below. The prefix equals the name of the sqs queue?Anagoge
Since we are using celery that rely on kombu, the queue will be created automatically. The queue name: 'queue_name_prefix': 'django_app-%s-' % os.environ.get('APP_ENV', 'dev') will give: 'django_app-dev-celery'. It is worth to remember about app env's if you use them.Heraclea
I see, I created a sqs queue manually, ill try with different prefix then. Do you know which permissions the IAM user needs to create a queue?Anagoge
Thanks for mentioning about instance protection.Heraclea
I get: Could not connect to the endpoint URL: "sqs.sqs.us-west-1.amazonaws.com.amazonaws.com" ist possible you have an sqs too much?Anagoge
Use aws region endpoint identifier instead. I modified the post for that.Heraclea
yup, tried that after the comment, seems to work. i am now getting botocore.exceptions.ClientError: An error occurred (InvalidClientTokenId) when calling the ListQueues operation: The security token included in the request is invalid. Any ideas?Anagoge
Let us continue this discussion in chat.Anagoge
"but only on main node (leader_only: true)" why only on main node, and the others node ?Hydrology
Since task initialization should happen once only.Heraclea
Why do you copy the run_supervised_celeryd.sh script to /opt/elasticbeanstalk/hooks/appdeploy/post/ AND also run it as a container_command? Just copying it to /opt/elasticbeanstalk/hooks/appdeploy/post/ should be enough. Running it as a container_command is unnecessary, and is also too early - the app isn't completely installed at that point.Horacehoracio
pycurl==7.43.0 --global-option="--with-nss" didn't work for me- I had to place another command at the top of celery.config containing PYCURL_SSL_LIBRARY=nss /opt/python/run/venv/bin/pip install pycurl==7.43.0 (and leave pycurl out of requirements.txt entirely)Deafanddumb
does this create an additional ec2 instance? how do I scale it if needed?Peat
After following the instruction, this is the error I get, could anyone help? Thanks! #64674007Kovar
This is one solution to avoid getting your leader node removed: ajbrown.org/2017/02/10/…Minima
This solution will not work easily on Amazon Linux 2, which is now the only configuration available by default through AWS EB. Notably, supervisor is not installed by default.Unbeliever
This is for Amazon Linux 1. On Amazon Linux 2 it need a rework on all the paths + you can put it directly on your project directory inside .platform/hooks/predeploy/run_supervised_celeryd.sh (with the whole .txt content) With that you can skip the 02-python commandsJunia
S
6

This is how I extended the answer by @smentek to allow for multiple worker instances and a single beat instance - same thing applies where you have to protect your leader. (I still don't have an automated solution for that yet).

Please note that envvar updates to EB via the EB cli or the web interface are not relflected by celery beat or workers until app server restart has taken place. This caught me off guard once.

A single celery_configuration.sh file outputs two scripts for supervisord, note that celery-beat has autostart=false, otherwise you end up with many beats after an instance restart:

# get django environment variables
celeryenv=`cat /opt/python/current/env | tr '\n' ',' | sed 's/export //g' | sed 's/$PATH/%(ENV_PATH)s/g' | sed 's/$PYTHONPATH//g' | sed 's/$LD_LIBRARY_PATH//g' | sed 's/%/%%/g'`
celeryenv=${celeryenv%?}

# create celery beat config script
celerybeatconf="[program:celeryd-beat]
; Set full path to celery program if using virtualenv
command=/opt/python/run/venv/bin/celery beat -A lexvoco --loglevel=INFO --workdir=/tmp -S django --pidfile /tmp/celerybeat.pid

directory=/opt/python/current/app
user=nobody
numprocs=1
stdout_logfile=/var/log/celery-beat.log
stderr_logfile=/var/log/celery-beat.log
autostart=false
autorestart=true
startsecs=10

; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 10

; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true

; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=998

environment=$celeryenv"

# create celery worker config script
celeryworkerconf="[program:celeryd-worker]
; Set full path to celery program if using virtualenv
command=/opt/python/run/venv/bin/celery worker -A lexvoco --loglevel=INFO

directory=/opt/python/current/app
user=nobody
numprocs=1
stdout_logfile=/var/log/celery-worker.log
stderr_logfile=/var/log/celery-worker.log
autostart=true
autorestart=true
startsecs=10

; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600

; When resorting to send SIGKILL to the program to terminate it
; send SIGKILL to its whole process group instead,
; taking care of its children as well.
killasgroup=true

; if rabbitmq is supervised, set its priority higher
; so it starts first
priority=999

environment=$celeryenv"

# create files for the scripts
echo "$celerybeatconf" | tee /opt/python/etc/celerybeat.conf
echo "$celeryworkerconf" | tee /opt/python/etc/celeryworker.conf

# add configuration script to supervisord conf (if not there already)
if ! grep -Fxq "[include]" /opt/python/etc/supervisord.conf
  then
  echo "[include]" | tee -a /opt/python/etc/supervisord.conf
  echo "files: celerybeat.conf celeryworker.conf" | tee -a /opt/python/etc/supervisord.conf
fi

# reread the supervisord config
/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf reread
# update supervisord in cache without restarting all services
/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf update

Then in container_commands we only restart beat on leader:

container_commands:
  # create the celery configuration file
  01_create_celery_beat_configuration_file:
    command: "cat .ebextensions/files/celery_configuration.sh > /opt/elasticbeanstalk/hooks/appdeploy/post/run_supervised_celeryd.sh && chmod 744 /opt/elasticbeanstalk/hooks/appdeploy/post/run_supervised_celeryd.sh && sed -i 's/\r$//' /opt/elasticbeanstalk/hooks/appdeploy/post/run_supervised_celeryd.sh"
  # restart celery beat if leader
  02_start_celery_beat:
    command: "/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf restart celeryd-beat"
    leader_only: true
  # restart celery worker
  03_start_celery_worker:
    command: "/usr/local/bin/supervisorctl -c /opt/python/etc/supervisord.conf restart celeryd-worker"
Spontaneity answered 30/10, 2018 at 1:34 Comment(3)
I wonder how you deployed this on AWS. Did you make use of Worker Environments like shown here: docs.aws.amazon.com/elasticbeanstalk/latest/dg/…. What do you mean with beat instance? Running beat just sends tasks to the queue, so I don't understand why one should have a separate machine for this. Do you have a separate EC2 instance running the web application?Sateia
how do you set this up? How do you make sure you won't have multiple instances of celery running when scaling occurs?Peat
Multiple instances of celery workers is fine. You only want one beat though. Honestly I stopped using elastic beanstalk a while back and have moved everything to kubernetes, I recommend you do the same. @GregHolst worker environments ended up being unsuitable for some reason.Spontaneity
A
3

If someone is following smentek's answer and getting the error:

05_celery_tasks_run: /usr/bin/env bash does not exist.

know that, if you are using Windows, your problem might be that the "celery_configuration.txt" file has WINDOWS EOL when it should have UNIX EOL. If using Notepad++, open the file and click on "Edit > EOL Conversion > Unix (LF)". Save, redeploy, and error is no longer there.

Also, a couple of warnings for really-amateur people like me:

  • Be sure to include "django_celery_beat" and "django_celery_results" in your "INSTALLED_APPS" in settings.py file.

  • To check celery errors, connect to your instance with "eb ssh" and then "tail -n 40 /var/log/celery-worker.log" and "tail -n 40 /var/log/celery-beat.log" (where "40" refers to the number of lines you want to read from the file, starting from the end).

Hope this helps someone, it would've saved me some hours!

Aglitter answered 25/1, 2019 at 9:10 Comment(0)
K
0

As stated in the answer accepted solution needs a lot of outside work other than coding. Now there is a nice library that handles this.
https://github.com/ybrs/single-beat

You install the library and create a redis server with elasticache.
And your procfile can be like this with environment variable targeting cache server.

    web: gunicorn --bind :8000 --workers 3 --threads 2 appname.wsgi:application
celery_beat: SINGLE_BEAT_REDIS_SERVER=$SINGLE_BEAT_REDIS single-beat celery -A proj beat -l INFO --scheduler django_celery_beat.schedulers:DatabaseScheduler
celery_worker: celery -A proj worker -l INFO -P solo
Kelton answered 18/10, 2023 at 8:49 Comment(0)
O
0

After spending three days searching for a solution, I finally found a straightforward method. To set it up, create a new directory in your project's main folder named .platform/hooks/postdeploy. Inside this directory, add a file without any extension called 'postdeploy' and paste the provided script into it. Make sure to replace the placeholder with your actual project name.

With this setup, you can easily check the status of Celery by connecting to your instance via SSH and running the command 'systemctl status celery.service'. This approach simplifies the process of managing and monitoring Celery using the systemd service manager.

postdeploy

#!/usr/bin/env bash

echo "[Unit]
Name=Celery
Description=Celery service for My App
After=network.target
StartLimitInterval=0

[Service]
Type=simple
Restart=always
RestartSec=30
User=root
WorkingDirectory=/var/app/current
ExecStart=$PYTHONPATH/celery -A your_project_folder.celery worker --loglevel=INFO
ExecReload=$PYTHONPATH/celery -A your_project_folder.celery worker --loglevel=INFO
  EnvironmentFile=/opt/elasticbeanstalk/deployment/env

[Install]
WantedBy=multi-user.target
" | sudo tee /etc/systemd/system/celery.service

# Start celery service
sudo systemctl start celery.service

# Enable celery service to load on system start
sudo systemctl enable celery.service

Also do not forgrt to set up a config file inside .ebextensions. Content should be like:

any_name.config

{
  "commands": {
   
    "01_install_ctl": {
      "command": "sudo yum install -y /usr/bin/systemctl"
    }
  }
}
Oppression answered 30/11, 2023 at 8:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.