I'm using condor to do batches of ~100 processes for a few hours. After these processes are finished, I need to start the next batch of runs with results from the first batch, and this process is repeated tens of times. My condor pool is >100 cores, and I'd like to limit my condor cluster to only do 100 processes at a time, so that condor only starts working on the next process after one of the first processes is finished. Is this possible?
This sounds like you're just running a job that checkpoints, and then the next job reads in that checkpoint and does some stuff and writes out a new checkpoint, etc 10 times. I'm not sure why you need to break it up the way you have, why not just have a wrapper script that looks for a checkpoint file and uses it, or starts from scratch?
The other option is to use the "Requirements" in your submission file and list only 100 machines or cores that your job can be run on. Something like:
Requirements = (machine == "astrolab01") || (machine == "astrolab02") || (machine == "astrolab03")
will ensure you never run more than 3 jobs at once. Unless those machines have multiple cores, then you need to do something like:
Requirements = (name == "slot1@astrolab01") || (name == "slot1@astrolab02")
You need to use the DAG Manager - this allows you to define parent-child relationships between jobs so that you can wait for results from the first job before starting the second job.
DAGman also has a MAX_JOBS_RUNNING setting which limits the total number of active jobs for you.
This is all documented in section 2.10 of the 8.4 manual. You will likely need to use a script of some sort to build the DAG file, and have a location available to store interim results from the runs - it's not possible for jobs to pass data directly from parent to child. The output is gathered from the first run into the work directory, then that's sent to the next job from the work directory.
© 2022 - 2024 — McMap. All rights reserved.