Reduce multiprocessing for statsmodels glm

Asked 12/12, 2017 at 10:17 Answered 1/5, 2019 at 9:40

I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".

glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()

This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.

For reference, below is the machine configuration and some more information if needed.

CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the 
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')

Front answered 12/12, 2017 at 10:17 Comment(0)

GLM uses multiprocessing only through the linear algbra libraries

The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914 It includes some links to other issues where this shows up.

(quote:)

Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.

However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.

How can we restrict the number of cores used by the linear algebra libraries?

This depends on which linear algebra library is used. see mailing list thread https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ

openblas: try setting the environment variable OMP_NUM_THREADS=1

Accelerate on OSX, set VECLIB_MAXIMUM_THREADS

mkl in anaconda:

import mkl
mkl.set_num_threads(1)

Lyrebird answered 12/12, 2017 at 18:4 Comment(0)

This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.

Here is the source code from statsmodels:

        if method == 'pinv':
            pinv_wexog = np.linalg.pinv(self.wexog)
            params = pinv_wexog.dot(self.wendog)
        elif method == 'qr':
            Q, R = np.linalg.qr(self.wexog)
            params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
        else:
        params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,

Poignant answered 1/5, 2019 at 9:40 Comment(0)

Recommended topics

Hot tags