I have a script that runs for a few minutes as a job on the cluster in the production environment. There are between 0 and 100 such jobs, each with 1 script per job, running at the same time on the cluster. Usually, there are no jobs running, or a burst of about 4-8 such jobs.
I want to prevent such jobs from running when I deploy a new version of the code into production.
How do I do that to optimize maintainability?
My initial idea was this:
- Use a semaphore file or a lock file that is created at the beginning of deployment and then removed after the code has been deployed. Deploy runs for 0.5 - 10 min, depending on the complexity of the current deploy tasks.
- This lock file is also automatically deleted by a separate cron job after, for example, 30 min, if deploy fails to remove this file. For example, if the deploy in rudely killed, this file should not hang around forever blocking the jobs. That is, the file is deleted by a separate cron job if it is older than 30 minutes.
- The production code checks for this lock file and waits until it is gone. So the jobs wait no more than 30 min.
I am concerned about possible race conditions, and considering maybe using a database-based solution. In the case of my application, I would use postgreSQL. This database-based solution may be more complex to implement and maintain, but may be less probe to race conditions.
Perhaps there is a standard mechanism to achieve this in Capistrano, which is used for deployment of this code?
Notes:
When you answer the question, please compare maintainability of your suggested solution with that of the simple solution I propose above (using lock files)
I am not sure if I need to take the race conditions into account. That is, is this system (with lock files) really race condition-prone? Or is it an unlikely possibility?
FAQs:
Is there a particular reason these jobs shouldn't run during deployment?
I had cases when multiple jobs would run during mid-deployment, and fail because of that. Finding and rerunning such failed jobs is time-consuming. Delaying them during deployment carries only a small and rare performance hit, and is by far the most acceptable solution. For our system, maintainability is priority number one.