I am using the ipyparallel module to speed up an all by all list comparison but I am having issues with huge memory consumption.
Here is a simplified version of the script that I am running:
From a SLURM script start the cluster and run the python script
ipcluster start -n 20 --cluster-id="cluster-id-dummy" &
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy"
In python, make two list of lists for the simplified example
import ipyparallel as ipp
from itertools import compress
list1 = [ [i, i, i] for i in range(4000000)]
list2 = [ [i, i, i] for i in range(2000000, 6000000)]
Then define my list comparison function:
def loop(item):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
Then connect to my ipython engines, push list2 to each of them and map my function:
rc = ipp.Client(profile='default', cluster_id = "cluster-id-dummy")
dview = rc[:]
dview.block = True
lview = rc.load_balanced_view()
lview.block = True
mydict = dict(list2 = list2)
dview.push(mydict)
trueorfalse = list(lview.map(loop, list1))
As mentioned, I am running this on a cluster using SLURM and getting the memory usage from the sacct command. Here is the memory usage that I am getting for each of the steps:
Just creating the two lists: 1.4 Gb Creating two lists and pushing them to 20 engines: 22.5 Gb Everything: 62.5 Gb++ (this is where I get an OUT_OF_MEMORY failure)
From running htop on the node while running the job, it seems that the memory usage is going up slowly over time until it reaches the maximum memory and fails.
I combed through this previous thread and implemented a few of the suggested solutions without success
Memory leak in IPython.parallel module?
I tried clearing the view with each loop:
def loop(item):
lview.results.clear()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
I tried purging the client with each loop:
def loop(item):
rc.purge_everything()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
And I tried using the --nodb and --sqlitedb flags with ipcontroller and started my cluster like this:
ipcontroller --profile=pierrj --nodb --cluster-id='cluster-id-dummy' &
sleep 60
for (( i = 0 ; i < 20; i++)); do ipengine --profile=pierrj --cluster-id='cluster-id-dummy' & done
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy" --profile=pierrj
Unfortunately none of this has helped and has resulted in the exact same out of memory error.
Any advice or help would be greatly appreciated!