I think I have a pretty reasonable idea on how to do go about accomplishing this, but I'm not 100% sure on all of the steps. This question is mostly intended as a sanity check to ensure that I'm doing this in the most efficient way, and that my math is actually sound (since my statistics knowledge is not completely perfect).
Anyways, some explanation about what I'm trying to do:
I have a lot of time series data that I would like to perform some linear regressions on. In particular, I have roughly 2000 observations on 500 different variables. For each variable, I need to perform a regression using two explanatory variables (two additional vectors of roughly 2000 observations). So for each of 500 different Y's, I would need to find a
and b
in the following regression Y = aX_1 + bX_2 + e
.
Up until this point, I have been using the OLS function in the statsmodels package to perform my regressions. However, as far as I can tell, if I wanted to use the statsmodels package to accomplish my problem, I would have to call it hundreds of times, which just seems generally inefficient.
So instead, I decided to revisit some statistics that I haven't really touched in a long time. If my knowledge is still correct, I can put all of my observations into one large Y matrix that is roughly 2000 x 500. I can then stick my explanatory variables into an X matrix that is roughly 2000 x 2, and get the results of all 500 of my regressions by calculating (X'Y)/(X'X)
. If I do this using basic numpy stuff (matrix multiplication using * and inverses using matrix.I), I'm guessing it will be much faster than doing hundreds of statsmodel OLS calls.
Here are the questions that I have:
- Is the numpy stuff that I am doing faster than the earlier method of calling statsmodels many times? If so, is it the fastest/most efficient way to accomplish what I want? I'm assuming that it is, but if you know of a better way then I would be happy to hear it. (Surely I'm not the first person to need to calculate many regressions in this way.)
- How do I deal with missing data in my matrices? My time series data is not going to be nice and complete, and will be missing values occasionally. If I just try to do regular matrix multiplication in numpy, the NA values will propagate and I'll end up with a matrix of mostly NAs as my end result. If I do each regression independently, I can just drop the rows containing NAs before I perform my regression, but if I do this on the large 2000 x 500 matrix I will end up dropping actual, non-NA data from some of my other variables, and I obviously don't want that to happen.
- What is the most efficient way to ensure that my time series data actually lines up correctly before I put it into the matrices in the first place? The start and end dates for my observations are not necessarily the same, and some series might have days that others do not. If I were to pick a method for doing this, I would put all the observations into a pandas data frame indexed by the date. Then pandas will end up doing all of the work aligning everything for me and I can extract the underlying ndarray after it is done. Is this the best method, or does pandas have some sort of overhead that I can avoid by doing the matrix construction in a different way?