Prevent jobs on the cluster from running on production code during deployment
J

3

8

I have a script that runs for a few minutes as a job on the cluster in the production environment. There are between 0 and 100 such jobs, each with 1 script per job, running at the same time on the cluster. Usually, there are no jobs running, or a burst of about 4-8 such jobs.

I want to prevent such jobs from running when I deploy a new version of the code into production.

How do I do that to optimize maintainability?

My initial idea was this:

  1. Use a semaphore file or a lock file that is created at the beginning of deployment and then removed after the code has been deployed. Deploy runs for 0.5 - 10 min, depending on the complexity of the current deploy tasks.
  2. This lock file is also automatically deleted by a separate cron job after, for example, 30 min, if deploy fails to remove this file. For example, if the deploy in rudely killed, this file should not hang around forever blocking the jobs. That is, the file is deleted by a separate cron job if it is older than 30 minutes.
  3. The production code checks for this lock file and waits until it is gone. So the jobs wait no more than 30 min.

I am concerned about possible race conditions, and considering maybe using a database-based solution. In the case of my application, I would use postgreSQL. This database-based solution may be more complex to implement and maintain, but may be less probe to race conditions.

Perhaps there is a standard mechanism to achieve this in Capistrano, which is used for deployment of this code?

Notes:

  • When you answer the question, please compare maintainability of your suggested solution with that of the simple solution I propose above (using lock files)

  • I am not sure if I need to take the race conditions into account. That is, is this system (with lock files) really race condition-prone? Or is it an unlikely possibility?

FAQs:

Is there a particular reason these jobs shouldn't run during deployment?

I had cases when multiple jobs would run during mid-deployment, and fail because of that. Finding and rerunning such failed jobs is time-consuming. Delaying them during deployment carries only a small and rare performance hit, and is by far the most acceptable solution. For our system, maintainability is priority number one.

Janejanean answered 10/4 at 17:24 Comment(1)
Re: Lock file, keep in mind the cron may run every 30 minutes but you may begin your deployment at minute (e.g.) 29. Your lock file would only exist for a minute before the cron wipes it and you're back to your initial problem.Waites
E
3

Working with advisory locks at simplest level using psql.

Session 1

select pg_advisory_lock(3752667);

Contents of advisory_lock_test.sql file:

select pg_advisory_lock(3752667);
select "VendorID" from nyc_taxi_pl limit 10;

Then session 2:

psql -d test -U postgres -p 5452 -f advisory_lock_text.sql 
Null display is "NULL".

Then in session 1:

select pg_advisory_unlock(3752667);

Back to session 2:

Null display is "NULL".
 pg_advisory_lock 
------------------
 
(1 row)

 VendorID 
----------
        1
        2
        2
        2
        2
        2
        1
        1
        2
        2
(10 rows)

Note:

The below is using session level locks. Transaction locks are also available using pg_advisory_xact_lock

Basically you create a lock in a session with pg_advisory_lock(3752667) where the number can be one 64 bit integer of two 32 bit integers. These could come from values that you fetch from a table so a number is scoped to a particular action e.g. select pg_advisory_lock((select lock_number from a_lock where action = 'deploy'));. Then in the second or other sessions you try to acquire a lock on the same number. If the number is in use, not unlocked or the original session did not exit, the other sessions will wait until the original session releases the lock. At that point the rest of the commands will run.

In your case create a number, possibly in a table, that is associated with deploying. When you run the deployments lock on the number before you run the changes, then unlock at end of deployment. If the deployment fails and the session ends the lock will also be released The other scripts would also need to start with attempting to lock on that number also. If it is in use they will wait until it is released and then run the rest of the script commands and unlock. How manageable this is depends on the number of scripts you are dealing with and getting people to stick to the process.

Emalia answered 16/4 at 18:29 Comment(5)
Thank you very much for the answer with detailed examples! How can I release the lock if it gets left behind by a deployment that has failed to remove the lock? In case of lock files, I proposed to do this: "This lock file is also automatically deleted by a separate cron job after, for example, 30 min, if deploy fails to remove this file. For example, if the deploy in rudely killed, this file should not hang around forever blocking the jobs. That is, the file is deleted by a separate cron job if it is older than 30 minutes."Janejanean
1) If the deployment failed and the session ended then the lock will be released. 2) You could still have cron job as backup. In it do the select pg_advisory_unlock(3752667); that will throw a warning if the lock has already been released. If you want the look before you leap you can select from pg_locks: select * from pg_locks where locktype = 'advisory' and objid = 3752667;Emalia
You also mentioned transaction locks. When should I use transaction locks, w.r.t. the situation that I describe (deployment)? I found this, but no further guidance: "Locks can be taken at session level (so that they are held until released or the session ends) or at transaction level (so that they are held until the current transaction ends; there is no provision for manual release).", PostgreSQL: Documentation: 16: 9.27. System Administration FunctionsJanejanean
1) The lock query should be select * from pg_locks where locktype = 'advisory' and objid = 3752667 and database = (select oid from pg_database where datname = 'test'); as advisory locks are per database. 2) Read this Advisory Locks as it goes into more detail on transaction vs session locks. Transaction locks automatically release at end of transaction. My guess is you will be wanting session locks.Emalia
"cron job as backup. In it do the select pg_advisory_unlock(3752667);" you can't pg_advisory_unlock() that lock from cron. Outside the session that holds the lock it will always return false with a warning. You can check if it's in pg_locks, but to release it from outside the owner session, you'd have to run pg_terminate_backend(), killing off the whole session holding the lock. If it's transaction-level, you can pg_cancel_backend() to target the transaction. Or set up timeouts insteadRepay
R
6

I'll address the OS-based flag file approach:

  1. High maintainability and portability: You just need a modern unix distro that comes with standard flock, which means most of them. You can interact with it from the shell, but it's also exposed in most modern programming languages: Python fcntl.flock, Ruby File.flock, etc.

  2. No race conditions: if you planned to handle files directly, consider using the flock instead. Rather than creating/hunting/deleting files to raise/lower the flag, it uses an actual OS-level lock system on them. It offers built-in support for shared/exclusive locks, blocking/blocking with timeout/non-blocking and command-wrapping.

    • If a worker finishes/dies without deleting the file, it still releases the lock on it, so their successor can still grab the lock and doesn't create a new file.
    • Deleting the flag file doesn't affect the worker that has it open with a lock: it keeps the lock and the file until it's done with it, and it's also free to also ask for it to be deleted afterwards, even if if a delete is already pending.
    • If a worker is hanging and you delete its flag file during cleanup, future workers are free to create a file with the same name/identifier and get a lock on it, regardless of whether you kill the hanging worker, finalising its flag file deletion, or let it live on, holding on to its outdated image of the file.
  3. Cleanup: use at in tandem with cron. As @Jacob Miller pointed out, if you let cron handle deletion directly, its cycles can run out of sync with your deployments, resulting in premature flag cleanup. If you instead tell it to only spot the flag files for deletion and schedule them to be wiped 30 minutes from being spotted, you guarantee a >=30min time window for the flag owner to complete their work.

    If you're varying the advisory lock identifiers (file names in this scenario) a lot, you can have the periodic cleanup job that'll wipe the oldest, unlocked files, to prevent clutter. If you re-use the identifiers, any leftover files from past workers will be re-used and eventually deleted by future workers.

  4. Apart from cleaning up the flag files, you might want to consider monitoring and cleaning up their associated workers that might be still hanging around.

Repay answered 23/4 at 10:41 Comment(8)
@VonC Thank you for your answer below. I found it useful. See also: I wanted to give a bounty to an answer, but it was deleted by CommunityBot, and the author is temporarily suspended. What should I do? - Meta Stack Exchange and What is the evidence that this specific answer is AI-generated? - Meta Stack Overflow.Janejanean
@VonC I was going to award your (now deleted) answer below a +200 bounty, but, as I explained here, I was prevented from doing so. Obviously, I am disappointed. Since then, I already posted the next bounty (+400), but it will likely go to another answer (sorry!). Yet, I still would love your answer to be somehow restored/salvaged and then upvoted by both me and the rest of the community (it is a useful answer).Janejanean
@VonC Could you please consider posting a separate answer, based on the deleted one, but without the text that triggered the deletion? Perhaps it could be rephrased somehow? I know that I may be asking a lot, and all for the potential reward of only a single upvote (mine). But the answer is good, IMHO. And I think others will upvote it, perhaps not immediately. Thank you in advance for your time! And, needless to say, I am glad to see that you are back. :)Janejanean
@TimurShtatland Thank you and, again, apologies. For obvious reason, I cannot repost an illegitimate answer, even reworded. But you should be able to see the deleted one, and post in your own word, based on your setup and experience, something which will help other readers.Vulgarism
@Vulgarism I also think your answer would be worth restoring. It showed a unique, custom approach, a clear and specific use case, plus it did a small comparison against the other answers. By the time you removed it, I think the human input outweighed or even displaced the AI-generated parts.Repay
@Repay Thank you for this feedback. Do understand I did not remove the answer. It was removed for me, by moderators who estimate (rightly) that any amount of AI-generated part outweighs everything else.Vulgarism
@Vulgarism Thank you for the quick reply! I can indeed see the deleted answer. Due to my lack of knowledge in this subject, I will refrain from posting the deleted answer in my own words. This is because some of the details are above my head. If someone else (someone more qualified) does it, I will welcome it, of course.Janejanean
@Vulgarism As for your recent post: no apologies in my case are needed! I am grateful to you both for the original answer, and for posting the explanations of what happened on Meta SO. In my opinion, explaining this was the right thing to do, and it also shows the community how to deal with past mistakes and move forward in a productive way.Janejanean
G
5

As I described in comments this feature can be integrated as feature flag. Very popular solution for rails is gem flipper.

Pseudo code for your job can looks like (while I don't know your jobs code)

class ProcessingJob < ApplicationJob
  queue_as :default

  def perform
    return unless Flipper.enabled?(:jobs_processing)

    ... job's code
  end
end

Flipper has admin UI for enabling/disabling feature flags. So, for example, you can create feature jobs_processing, enable it and then at some point before deploy you can turn off it.

While feature flag is disabled during deploy you will sure that no jobs are performed. And after deploy you can enable it again.

You could think Flipper is complicated solution for you feature, so you can do something simplier without gem and just create table in database, update your admin page with enabling/disabling features.

class CreateFeatureFlags < ActiveRecord::Migration[7.1]
  def change
    create_table :feature_flags do |t|
      t.string :name, null: false, index: { unique: true }
      t.timestamps
    end
  end
end

class FeatureFlag < ApplicationRecord
  def self.enabled?(name)
    where(name: name).exists?
  end
end

FeatureFlag.enabled?('jobs_processing')
Glaswegian answered 11/4 at 12:47 Comment(6)
Thank you very much for an interesting, elegant and useful solution! I like it more than the solution I proposed in my question (lock file-based).Janejanean
I did have one question, which is important for our group. As my question says, "For our system, maintainability is priority number one.". Could you please compare maintainability of your suggested solution with that of the simple solution I proposed in the question (using lock files). For a non-expert, it might seem that the lock file-based solution may be easier to maintain. Lock file-based solution needs no external (potentially, later also non-supported) gems, no db or extra tables. Lock file-based solution can also be manipulated on the filesystem without having to access the database.Janejanean
Another question is if we need to take the race conditions into account? That is, is this system (with lock files) really race condition-prone? Or is it an unlikely possibility? If the latter is the case, maybe a simpler, lock file-based solution is more appropriate. Note that even though I am trying to critically evaluate both solutions, I personally prefer your solution, as I already mentioned. Thank you again!Janejanean
Race condition does not matter - in both cases (set feature flag or some lock file) when you disable jobs you just wait until already started jobs ends, and run deployGlaswegian
About lock file - imagine you will need to give non-developer opportunity to enable/disable feature flags (maybe will will have more flags in the future), what do you choose - set flag in simple UI or provide access to server?Glaswegian
Thank you! Giving access to this to a non-developer is a good, convincing argument against the lock file solution.Janejanean
E
3

Working with advisory locks at simplest level using psql.

Session 1

select pg_advisory_lock(3752667);

Contents of advisory_lock_test.sql file:

select pg_advisory_lock(3752667);
select "VendorID" from nyc_taxi_pl limit 10;

Then session 2:

psql -d test -U postgres -p 5452 -f advisory_lock_text.sql 
Null display is "NULL".

Then in session 1:

select pg_advisory_unlock(3752667);

Back to session 2:

Null display is "NULL".
 pg_advisory_lock 
------------------
 
(1 row)

 VendorID 
----------
        1
        2
        2
        2
        2
        2
        1
        1
        2
        2
(10 rows)

Note:

The below is using session level locks. Transaction locks are also available using pg_advisory_xact_lock

Basically you create a lock in a session with pg_advisory_lock(3752667) where the number can be one 64 bit integer of two 32 bit integers. These could come from values that you fetch from a table so a number is scoped to a particular action e.g. select pg_advisory_lock((select lock_number from a_lock where action = 'deploy'));. Then in the second or other sessions you try to acquire a lock on the same number. If the number is in use, not unlocked or the original session did not exit, the other sessions will wait until the original session releases the lock. At that point the rest of the commands will run.

In your case create a number, possibly in a table, that is associated with deploying. When you run the deployments lock on the number before you run the changes, then unlock at end of deployment. If the deployment fails and the session ends the lock will also be released The other scripts would also need to start with attempting to lock on that number also. If it is in use they will wait until it is released and then run the rest of the script commands and unlock. How manageable this is depends on the number of scripts you are dealing with and getting people to stick to the process.

Emalia answered 16/4 at 18:29 Comment(5)
Thank you very much for the answer with detailed examples! How can I release the lock if it gets left behind by a deployment that has failed to remove the lock? In case of lock files, I proposed to do this: "This lock file is also automatically deleted by a separate cron job after, for example, 30 min, if deploy fails to remove this file. For example, if the deploy in rudely killed, this file should not hang around forever blocking the jobs. That is, the file is deleted by a separate cron job if it is older than 30 minutes."Janejanean
1) If the deployment failed and the session ended then the lock will be released. 2) You could still have cron job as backup. In it do the select pg_advisory_unlock(3752667); that will throw a warning if the lock has already been released. If you want the look before you leap you can select from pg_locks: select * from pg_locks where locktype = 'advisory' and objid = 3752667;Emalia
You also mentioned transaction locks. When should I use transaction locks, w.r.t. the situation that I describe (deployment)? I found this, but no further guidance: "Locks can be taken at session level (so that they are held until released or the session ends) or at transaction level (so that they are held until the current transaction ends; there is no provision for manual release).", PostgreSQL: Documentation: 16: 9.27. System Administration FunctionsJanejanean
1) The lock query should be select * from pg_locks where locktype = 'advisory' and objid = 3752667 and database = (select oid from pg_database where datname = 'test'); as advisory locks are per database. 2) Read this Advisory Locks as it goes into more detail on transaction vs session locks. Transaction locks automatically release at end of transaction. My guess is you will be wanting session locks.Emalia
"cron job as backup. In it do the select pg_advisory_unlock(3752667);" you can't pg_advisory_unlock() that lock from cron. Outside the session that holds the lock it will always return false with a warning. You can check if it's in pg_locks, but to release it from outside the owner session, you'd have to run pg_terminate_backend(), killing off the whole session holding the lock. If it's transaction-level, you can pg_cancel_backend() to target the transaction. Or set up timeouts insteadRepay

© 2022 - 2024 — McMap. All rights reserved.