How to overcome server load issues when running multiple cron jobs simultaneously?
Asked Answered
D

2

5

I've got a site that displays data from a game server. The game has different "domains" (which are actually just separate servers) that the users play on.

Right now, I've got 14 cron jobs running at different intervals of the hour every 6 hours. All 14 files that are run are pretty much the same, and each takes around 75 minutes ( an hour and 15 minutes ) to complete it's run.

I had thought about just using 1 file run from cron and looping through each server, but this would just cause that one file run for 18 hours or so. My current VPS is set to only allow 1 vCPU, so I'm trying to accomplish things and stay within my allotted server load.

Seeing that the site needs to have updated data available every 6 hours, this isn't doable.

I started looking into message queues and passing some information to a background process that will perform the work in question. I started off trying to use resque and php-resque, but my background worker died as soon as it was started. So, I moved on to ZeroMQ, which seems to be more what I need, anyway.

I've set up ZMQ via Composer, and everything during the installation went fine. In my worker script (which will be called by cron every 6 hours), I've got:

$dataContext = new ZMQContext();
$dataDispatch = new ZMQSocket($dataContext, ZMQ::SOCKET_PUSH);
$dataDispatch->bind("tcp://*:50557");

$dataDispatch->send(0);

foreach($filesToUse as $filePath){
    $dataDispatch->send($filePath);
    sleep(1);
}

$filesToUse = array();
$blockDirs = array_filter(glob('mapBlocks/*'), 'is_dir');
foreach($blockDirs as $k => $blockDir){
    $files = glob($rootPath.$blockDir.'/*.json');
    $key = array_rand($files);
    $filesToUse[] = $files[$key];
}

$mapContext = new ZMQContext();
$mapDispatch = new ZMQSocket($mapContext, ZMQ::SOCKET_PUSH);
$mapDispatch->bind("tcp://*:50558");

$mapDispatch->send(0);

foreach($filesToUse as $blockPath){
    $mapDispatch->send($blockPath);
    sleep(1);
}

$filesToUse is an array of files submitted by users that contain information to be used in querying the game server. As you can see, I'm looping through the array and sending each file to the ZeroMQ listener file, which contains:

$startTime = time();

$context = new ZMQContext();

$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver->connect("tcp://*:50557");

$sender = new ZMQSocket($context, ZMQ::SOCKET_PUSH);
$sender->connect("tcp://*:50559");

while(true){
    $file = $receiver->recv();

    // -------------------------------------------------- do all work here
    // ... ~ 75:00 [min] DATA PROCESSING SECTION foreach .recv()-ed WORK-UNIT
    // ----------------------------------------------------------------------

    $endTime = time();
    $totalTime = $endTime - $startTime;
    $sender->send('Processing of domain '.listener::$domain.' competed on '.date('M-j-y', $endTime).' in '.$totalTime.' seconds.');
}

Then, in the final listener file:

$context = new ZMQContext();
$receiver = new ZMQSocket($context, ZMQ::SOCKET_PULL);
$receiver->bind("tcp://*:50559");

while(true){
    $log = fopen($rootPath.'logs/sink_'.date('F-jS-Y_h-i-A').'.txt', 'a');
    fwrite($log, $receiver->recv());
    fclose($log);
}

When the worker script is run from cron, I get no confirmation text in my log.

Q1) is this the most efficient way to do what I'm trying to?
Q2) am I trying to use or implement ZeroMQ incorrectly, here?

And, as it would seem, using cron to call 14 files simultaneously is causing the load to far exceed the allotment. I know I could probably just set the jobs to run at different times throughout the day, but if at all possible, I would like to keep all updates on the same schedule.


UPDATE:

I have since gone ahead and upgraded my VPS to 2 CPU cores, so the load aspect of the question isn't really all that relevant anymore.

The code above has also been changed to the current setup.

I am, after the code-update, getting an email from cron now with the error:

Fatal error: Uncaught exception 'ZMQSocketException' with message 'Failed to bind the ZMQ: Address already in use'

Discant answered 27/5, 2016 at 0:52 Comment(4)
1 cpu or 1 core? i would simply upgrade thatMorsel
You have jobs which require CPU cycles. You will have to spend those cycles anyway. The question is how much of those 75 minutes is for I/O and how much for actual computing? And what about optimizing the update process? We really have too little info to be able to say which approach is better. And I am afraid the answer will now be a short one here...Farceur
a ) what is your quantitative metric for comparing { a more | a less | the most }-efficient way to do something ( a minimum scope of re-factoring? a minimum software design cost? a shortest time to RTO? minimum externally spent expenses? ) b ) what is your vCPU utilisation graph ( collect graphs from VPS-management console for each your vCPU-vCore / post this + all following as an update here, right )? c ) what is your vHDD utilisation graph? d ) what is your vLAN utilisation graph? e ) How many threads can your VPS run (vCPU/HT)?Barragan
Would you kindly disambiguate what process does a .connect() to any localhost URL-exposed port# 50558? Could you also update figures / post 24/7 graphs ( well this still makes a pretty sense for the very week where you have both a 1 vCPU & 2 vCPU modus-operandi ) / raw printscreens for details about resources utilisation patterns? Thanks for your kind re-considering the importance of quantitative facts for your MCVE issue.Barragan
D
6

Running your scripts through cron or through ZeroMQ will make absolutely no difference in how much CPU you will need. The only difference between the two is that the cron job starts your script at intervals and the messaging queue will start your script based on some user action.

At the end of the day, you need more available threads to run your scripts. But before you go down that path, you may want to take a look at your scripts. Maybe there's a more efficient way of writing them so that they don't take as much resources? And have you looked at your CPU utilization rate? Most web hosting services have built-in metrics that you can pull up through their console. You might not be using as much resources as you think.

The fact that it will take you that much longer to run a file that loops through all the servers than the cumulative time of running the files separately suggest that your scripts aren't being multi-threaded properly. A single instance of your script is not using up all available resources and thus you are only seeing speed gains when you run multiple instances of your scripts.

Diva answered 29/5, 2016 at 2:22 Comment(0)
B
4

Yes, this way does not seem to be the state-of-art use of ZeroMQ powers.

The good news is it is possible to redesign the solution to become closer to best practices.

MOTIVATION

ZeroMQ is out of question a very powerful and a very smart toolbox for scaleable, lightweight distributed processing systems design, control and its performance and event management. There are many resources published on the best engineering practices on designing ZeroMQ systems.

Lightweight does not mean a golden bullet or a perpetuum-mobile with zero-overheads at all.

ZeroMQ still consumes additional resources and for target ecosystems, the more for those with minimalistic resources' footprint ( a hidden hyper-threading limitation on some VPS systems vCPU/vCPU-core emulations, as a just one bright example here ), one may realise, that there is no benefit coming and adjusting the threading concurrency impact costs of consuming additional ZeroMQ I/O-threads ( 1+ ) per each Context()-instance.

Exception handling?
No, rather an exception-prevention and a blocking-avoidance are the alpha/omega for production-grade, non-stop, distributed processing system. Your experience becomes bitter and bitter, but you will learn a lot on software design practices with ZeroMQ. One of such lessons to learn is a resource management and graceful termination. Each process is responsible for releasing all resources allocated on it's own, so the port, blocked by respective .bind()-s has to be systematically free'd and released in a clear manner.

( Plus one will soon realise, that a port-release is not instant, due to operating system overheads, that are outside of one's code-control -- so do not rely on such port becoming immediately RTO for a next port re-use ( one may find a lot posts here on ports blocked this very way ) ).


Facts on resources utilisation envelopes [FIRST]:

While quantitative facts on processing Performance / resources Utilisation envelopes are for the moment still missing, the appended picture may help one identify the key importance of such knowledge.

vCPU-workload Envelope once markets started next 24/5 on Sunday 22:00 GMT+0000

enter image description here

Still +55% CPU-power avail vCPU-workload and other resources-usage Envelopes

enter image description here


Cron, queues & a relative priorities setup hack [NEXT]:

Without much details on whether the 75 minutes lasting WORK-UNIT suffer from CPU-bound or I/O-bound issues, the system configuration may moderate the cron-jobs' relative priorities, so that your system performance is "focused" on the primary jobs during your peak-hours. There are chances to create a separate queue with adapted nice priority. A good trick on this was presented by @Pederabo:

cron usually runs with nice 2 but this is controlled by the queuedefs file. Queues a, b, and c are for at, batch, and cron.
- should be able to add a line for queue, say, Z which defines a special queue and set the nice value to zero.
- should be able to run the script from the console with at -q Z ....
- if that works well, put the at command in crontab.
The "at" command itself will run with cron's default priority, but it only takes a few seconds. Then the job it creates will run with whatever you set in the queuedefs file for queue Z.


Avoid unnecessary overheads [ALWAYS]:

There is always a reason not to waste CPU-clks. The more in minimalistic system designs. Using tcp:// transport-class on the very same localhost may be a PoC practice during a prototyping phase, but never for going into a 24/7 production phase. Try avoiding all services one never uses - why climbing on L3 with consuming even more operating systems resources ( ZeroMQ is not a Zero-Copy in this phase - so double-allocations appear here ) when delivering just on the same localhost. ipc:// and inproc:// transport-classes are much better for this modus-operandi ( also ref. below on going truly distributed )


The main issue (the processing design, using ZeroMQ tools)

based from a given, high-level description of what is the intention, there seems to be a way to avoid cron mechanism at all and allow the whole processing pipeline / distribution / collection become a non-stop ZeroMQ distributed processing system, where you can rather build on an autonomous CLI interface ( an r/KBD Terminal to communicate ad-hoc with the Non-stop Processing System ) so as to:

  • remove one's dependency on operating system's features / limitations
  • reduce overall overheads associated with concurrent system-level process maintenance
  • share a single, central, Context() ( so paying a minimum cost of just one additional I/O-thread ), as the processing seems not to be messaging-intensive / ultra-low-latency sensitive

Your ZeroMQ ecosystem may help you build a right-scaling or even adaptive-scaling feature as the scalable distributed processing does not limit you for just your VPS localhost device ( if your VPS hyper-threading limits do not allow for such colocated processing to meet your 24/7 flow of WORK-UNIT-s performance envelope ).

All that just by modifying the adequate transport-class from ipc:// to tcp:// allows one to distribute tasks ( WORK-UNIT-s ) literally round the globe to whatever processing node you may "plug-in" to increase your processing powers .. all without a SLOC of change in your source code.

enter image description here

That is worth one's time to re-decide the design strategy, isn't it?

Barragan answered 29/5, 2016 at 17:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.