Large memory consumption by iPython Parallel module
Asked Answered
H

2

10

I am using the ipyparallel module to speed up an all by all list comparison but I am having issues with huge memory consumption.

Here is a simplified version of the script that I am running:

From a SLURM script start the cluster and run the python script

ipcluster start -n 20 --cluster-id="cluster-id-dummy" &
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy"

In python, make two list of lists for the simplified example

import ipyparallel as ipp
from itertools import compress

list1 = [ [i, i, i] for i in range(4000000)]
list2 = [ [i, i, i] for i in range(2000000, 6000000)]

Then define my list comparison function:

def loop(item):
    for i in range(len(list2)):
        if list2[i][0] == item[0]:
            return True
    return False

Then connect to my ipython engines, push list2 to each of them and map my function:

rc = ipp.Client(profile='default', cluster_id = "cluster-id-dummy")
dview = rc[:]
dview.block = True
lview = rc.load_balanced_view()
lview.block = True
mydict = dict(list2 = list2)
dview.push(mydict)
trueorfalse = list(lview.map(loop, list1))

As mentioned, I am running this on a cluster using SLURM and getting the memory usage from the sacct command. Here is the memory usage that I am getting for each of the steps:

Just creating the two lists: 1.4 Gb Creating two lists and pushing them to 20 engines: 22.5 Gb Everything: 62.5 Gb++ (this is where I get an OUT_OF_MEMORY failure)

From running htop on the node while running the job, it seems that the memory usage is going up slowly over time until it reaches the maximum memory and fails.

I combed through this previous thread and implemented a few of the suggested solutions without success

Memory leak in IPython.parallel module?

I tried clearing the view with each loop:

def loop(item):
    lview.results.clear()
    for i in range(len(list2)):
        if list2[i][0] == item[0]:
            return True
    return False

I tried purging the client with each loop:

def loop(item):
    rc.purge_everything()
    for i in range(len(list2)):
        if list2[i][0] == item[0]:
            return True
    return False

And I tried using the --nodb and --sqlitedb flags with ipcontroller and started my cluster like this:

ipcontroller --profile=pierrj --nodb --cluster-id='cluster-id-dummy' &
sleep 60
for (( i = 0 ; i < 20; i++)); do ipengine --profile=pierrj --cluster-id='cluster-id-dummy' & done
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy" --profile=pierrj

Unfortunately none of this has helped and has resulted in the exact same out of memory error.

Any advice or help would be greatly appreciated!

Hellas answered 10/7, 2020 at 22:42 Comment(0)
R
2

Looking around, there seems to be lots of people complaining about LoadBalancedViews being very memory inefficient, and I have not been able to find any useful suggestions on how to fix this, for example.

However, I suspect given your example that's not the place to start. I assume that your example is a reasonable approximation of your code. If your code is doing list comparisons with several million data points, I would advise you to use something like numpy to perform the calculations rather than iterating in python.

If you restructure your algorithm to use numpy vector operations it will be much, much faster than indexing into a list and performing the calculation in python. numpy is a C library and calculation done within the library will benefit from compile time optimisations. Furthermore, performing operations on arrays also benefits from processor predictive caching (your CPU expects you to use adjacent memory looking forward and preloads it; you potentially lose this benefit if you access the data piecemeal).

I have done a very quick hack of your example to demonstrate this. This example compares your loop calculation with a very naïve numpy implementation of the same question. The python loop method is competitive with small numbers of entries, but it quickly heads towards x100 faster with the number of entries you are dealing with. I suspect looking at the way you structure data will outweigh the performance gain you are getting through parallelisation.

Note that I have chosen a matching value in the middle of the distribution; performance differences will obviously depend on the distribution.

import numpy as np
import time

def loop(item, list2):
    for i in range(len(list2)):
        if list2[i][0] == item[0]:
            return True
    return False

def run_comparison(scale):
    list2 = [ [i, i, i] for i in range(4 * scale)]
    arr2 = np.array([i for i in range(4 * scale)])

    test_value = (2 * scale)
    np_start = time.perf_counter()
    res1 = test_value in arr2
    np_end = time.perf_counter()
    np_time = np_end - np_start

    loop_start = time.perf_counter()
    res2 = loop((test_value, 0, 0), list2)
    loop_end = time.perf_counter()
    loop_time = loop_end - loop_start

    assert res1 == res2

    return (scale, loop_time / np_time)

print([run_comparison(v) for v in [100, 1000, 10000, 100000, 1000000, 10000000]])

returns:

[
  (100, 1.0315526939407524),
  (1000, 19.066806587378263),
  (10000, 91.16463510672537),
  (100000, 83.63064249916434),
  (1000000, 114.37531283123414),
  (10000000, 121.09979997458508)
]
Ransom answered 8/10, 2022 at 14:34 Comment(1)
Thank you for this explanation! At the time I was not familiar with numpy and have since then implemented it in my code. Similarly, I ended up using GNU parallel instead. I left the post up though since this seemed like a general problem with iPython Parallel that others would need help with.Hellas
J
1

Assuming that a single task on the two lists is being divided up between the worker threads you will want to ensure that the individual workers are using the same copy of the lists. In most cases is looks like ipython parallel will pickle objects sent to workers (relevant doc). If you are able to use one of the types that are not copied (as stated in doc)

buffers/memoryviews, bytes objects, and numpy arrays.

the memory issue might be resolved since a reference is distributed. This answer also assumes that the individual tasks do not need to operate on the lists while working (thread-safe).

TL;DR It looks like moving the objects passed to the parallel workers into a numpy array may resolve the explosion in memory.

Joker answered 11/10, 2022 at 19:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.