I'm making GIS-based data-analysis, where I calculate wide area nation wide prediction maps (e.g. weather maps etc.). Because my target area is very big (whole country) I am using supercomputers (Slurm) and parallelization to calculate the prediction maps. That is, I split the prediction map into multiple pieces with each piece being calculated in its own process (embarrassingly parallel processes), and within each process, multiple CPU cores are used to calculate that piece (the map piece is further split into smaller pieces for the CPU cores).
I used Python's joblib-library for taking advantage of the multiple cores at my disposal and most of the time everything works smoothly. But sometimes, randomly with about 1.5% of the time, I get the following error:
Traceback (most recent call last):
File "main.py", line 557, in <module>
sub_rasters = Parallel(n_jobs=-1, verbose=0, pre_dispatch='2*n_jobs')(
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGBUS(-7)}
What causes this problem, any ideas? And how to make sure this does not happen? This is irritating because, for example, if I have 200 map pieces being calculated and 197 succeed and 3 have this error, then I need to calculate these 3 pieces again.