What is an efficient way of running a logistic regression for large data sets (200 million by 2 variables)?
Asked Answered
G

1

6

I currently am trying to run a logistic regression model. My data has two variables, one response variable and one predictor variable. The catch is that I have 200 million observations. I am trying to run a logistic regression model but am having extremely difficulty doing so in R/Stata/MATLAB even with the help of EC2 instances on Amazon. I believe the problem lies in how the logistic regression functions are defined in the language itself. Is there another way to run a logistic regression quickly? Currently the problem I have is that my data quickly fills up whatever space it is using. I have even tried using up to 30 GB of RAM to no avail. Any solutions would be greatly welcome.

Glissade answered 30/7, 2014 at 20:32 Comment(4)
With only one predictor, do you expect the result to be very different depending on whether you using 200 million rows or 1 million?Rumsey
My advice would be to run 10 (or 100) regressions using random samples of 1 million rows (or even just 1e5 rows), and see how your coefficient estimates vary. My guess is they'll be near identical, and adding more data won't add anything interesting.Rumsey
What effect sizes are you looking for? You would probably want to do some power calculations to see how many samples you really need to test your hypothesis.Compact
Have you tried R's biglm package?Brambly
K
5

If your main issue is the ability to estimate a logit model given computer memory constraints, and not the quickness of the estimation, you can take advantage of the additivity of maximum likelihood estimation and write a custom program for ml. A logit model is simply a maximum likelihood estimation using the logistic distribution. The fact that you have only one independent variable simplifies this problem. I've simulated the problem below. You should create two do files out of the following code blocks.

If you have no issue loading in the whole dataset - which you shouldn't, my simulation only used ~2 gigs of ram using 200 million obs and 2 vars, though mileage may vary - the first step would be to break down the dataset into manageable pieces. For instance:

depvar = your dependent variable (0 or 1s) indepvar = your independent variable (some numeric data type)

cd "/path/to/largelogit"

clear all
set more off

set obs 200000000

// We have two variables, and independent variable and a dependent variable.
gen indepvar = 10*runiform()
gen depvar = .

// As indpevar increases, the probability of depvar being 1 also increases.
replace depvar = 1 if indepvar > ( 5 + rnormal(0,2) )
replace depvar = 0 if depvar == .

save full, replace
clear all

// Need to split the dataset into managable pieces

local max_opp = 20000000    // maximum observations per piece

local obs_num = `max_opp'

local i = 1
while `obs_num' == `max_opp' {

    clear

    local h = `i' - 1

    local obs_beg = (`h' * `max_opp') + 1
    local obs_end = (`i' * `max_opp')

    capture noisily use in `obs_beg'/`obs_end' using full

    if _rc == 198 {
        capture noisily use in `obs_beg'/l using full
    }
    if _rc == 198 { 
        continue,break
    }

    save piece_`i', replace

    sum
    local obs_num = `r(N)'

    local i = `i' + 1

}

From here to minimize your memory usage close Stata and reopen it. When you create such large datasets Stata keeps some memory allocated for overhead etc. even if you clear the dataset. You can type memory after the save full and after the clear all to see what I mean.

Next you must define your own custom ml program which will feed in each of these pieces one at a time within the program, calculate and sum the log-likelihoods of each observation for each piece and add them all together. You need to use the d0 ml method as opposed to the lf method because the optimizing routine with lf requires all data used to be loaded into the Stata.

clear all
set more off

cd "/path/to/largelogit"

// This local stores the names of all the pieces 
local p : dir "/path/to/largelogit" files "piece*.dta"

local i = 1
foreach j of local p {    // Loop through all the names to count the pieces

    global pieces = `i'    // This is important for the program
    local i = `i' + 1

}

// Generate our custom MLE logit progam. This is using the d0 ml method 

program define llogit_d0

    args todo b lnf 

    tempvar y xb llike tot_llike it_llike

quietly {

    forvalues i=1/$pieces {

        capture drop _merge
        capture drop depvar indepvar
        capture drop `y'
        capture drop `xb'
        capture drop `llike' 
        capture scalar drop `it_llike'

        merge 1:1 _n using piece_`i'

        generate int `y' = depvar

        generate double `xb' = (indepvar * `b'[1,1]) + `b'[1,2]    // The linear combination of the coefficients and independent variable and the constant

        generate double `llike' = .

        replace `llike' = ln(invlogit( `xb')) if `y'==1    // the log of the probability should the dependent variable be 1
        replace `llike' = ln(1-invlogit(`xb')) if `y'==0   // the log of the probability should the dependent variable be 0

        sum `llike' 
        scalar `it_llike' = `r(sum)'    // The sum of the logged probabilities for this iteration

        if `i' == 1     scalar `tot_llike' = `it_llike'    // Total log likelihood for first iteration
        else            scalar `tot_llike' = `tot_llike' + `it_llike' // Total log likelihood is the sum of all the iterated log likelihoods `it_llike'

    }

    scalar `lnf' = `tot_llike'   // The total log likelihood which must be returned to ml

}

end

//This should work

use piece_1, clear

ml model d0 llogit_d0 (beta : depvar = indepvar )
ml search
ml maximize

I just ran the above two blocks of code and received the following output:

Large Logit Output

Pros and Cons of this approach:
Pro:

    - The smaller the `max_opp' size the lower the memory usage. I never used more than ~1 gig in with the simulator as above.
    - You end up with unbiased estimators, the full log-likelihood of estimator for the entire dataset, the correct standard errors - basically everything important for making inferences.

Con:

    - What you save in memory you must sacrifice in CPU time. I ran this on my personal laptop with Stata SE (one core) with an i5 processor and it took me overnight.
    - The Wald Chi2 statistic is wrong, but I believe you can calculate it given the correct data mentioned above
    - You don't get a Psudo R2 as you would with logit.

To test if the coefficients truly are the same as a standard logit, set obs to something relatively small, 100000, and set max_opp to something like 1000. Run my code, look at the output, run logit depvar indepvar, look at the output, they are the same other than what I mention in "Cons" above. Setting obs to the same as max_opp will correct Wald Chi2 statistics.

Klarrisa answered 19/9, 2014 at 5:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.