I'd like to find a user-space tool (preferably in Python - barring that, in anything I could easily modify if it doesn't already do what I need it to) to replace a short script I've been using that does the two things below:
- polls less than 100 computers (Fedora 13, it so happens) for load, available memory, and if it looks like someone is using them
- selects good hosts for jobs, runs these jobs over ssh. These jobs are the execution of arbitrary command line programs which read and write to a shared filesystem - typically image processing scripts or similar - cpu, sometimes memory intensive tasks.
For example, using my current script, I can in a python prompt
>>> import hosts
>>> hosts.run_commands(['users']*5)
or from the command line
% hosts.py "users" "users" "users" "users" "users"
to run the command users
5 times (after finding 5 computers on which the command could be run by checking the cpu load and available memory on at least 5 computers from a config file). There should be no job server other than the script I just ran, and no worker daemons or processes on the computers that will run these commands.
I'd additionally like to be able to track the jobs, run jobs again on failure, etc., but these are extra features (very standard in a real job scheduler) that I don't actually need.
I've found good ssh libraries for Python, things like classh and PuSSH, which don't have the (very simple) load balancing features I'd like. On the other side of what I want is Condor or Slurm, as suggested by crispamares before I clarified I want something lighter. Those would be doing things the proper way, but from reading about them, they sounds like spinning them up in user space only when I need them would be annoying to impossible. This isn't a dedicated cluster, and I don't have root access on these hosts.
I'm currently planning to use a wrapper around classh with some basic polling of computers whenever I need to know how busy they are if I can't find something else.