Python Pandas: how to turn a DataFrame with "factors" into a design matrix for linear regression?

Asked 17/4, 2012 at 18:26 Answered 19/4, 2019 at 6:49

Solved python dataframe regression factors

If memory servies me, in R there is a data type called factor which when used within a DataFrame can be automatically unpacked into the necessary columns of a regression design matrix. For example, a factor containing True/False/Maybe values would be transformed into:

for the purpose of using lower level regression code. Is there a way to achieve something similar using the pandas library? I see that there is some regression support within Pandas, but since I have my own customised regression routines I am really interested in the construction of the design matrix (a 2d numpy array or matrix) from heterogeneous data with support for mapping back and fort between columns of the numpy object and the Pandas DataFrame from which it is derived.

Update: Here is an example of a data matrix with heterogeneous data of the sort I am thinking of (the example comes from the Pandas manual):

>>> df2 = DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c' : np.random.randn(7)})
>>> df2
       a  b         c
0    one  x  0.000343
1    one  y -0.055651
2    two  y  0.249194
3  three  x -1.486462
4    two  y -0.406930
5    one  x -0.223973
6    six  x -0.189001
>>>

The 'a' column should be converted into 4 floating point columns (in spite of the meaning, there are only four unique atoms), the 'b' column can be converted to a single floating point column, and the 'c' column should be an unmodified final column in the design matrix.

Thanks,

SetJmp

Amphiboly answered 17/4, 2012 at 18:26 Comment(2)

It's not clear what you mean by "The 'a' column should be converted into 4 floating point columns" ... Do you mean 4 floating point values? I don't see how splitting the first columns into mutliple columns will allow a design matrix. My understanding is that the first two columns here are categorical variables. Do you mean that you want 4 binary variables, which are equal to 1 only if that row of the data had that first-column-categorical number? – Loth 17/4, 2012 at 21:36

Converting a factor with k levels into k distinct columns/variables is called discretization. – Lichtenfeld 10/3, 2013 at 6:20

There is a new module called patsy that solves this problem. The quickstart linked below solves exactly the problem described above in a couple lines of code.

Here is an example usage:

import pandas
import patsy

dataFrame = pandas.io.parsers.read_csv("salary2.txt") 
#salary2.txt is a re-formatted data set from the textbook
#Introductory Econometrics: A Modern Approach
#by Jeffrey Wooldridge
y,X = patsy.dmatrices("sl ~ 1+sx+rk+yr+dg+yd",dataFrame)
#X.design_info provides the meta data behind the X columns
print X.design_info

generates:

> DesignInfo(['Intercept',
>             'sx[T.male]',
>             'rk[T.associate]',
>             'rk[T.full]',
>             'dg[T.masters]',
>             'yr',
>             'yd'],
>            term_slices=OrderedDict([(Term([]), slice(0, 1, None)), (Term([EvalFactor('sx')]), slice(1, 2, None)),
> (Term([EvalFactor('rk')]), slice(2, 4, None)),
> (Term([EvalFactor('dg')]), slice(4, 5, None)),
> (Term([EvalFactor('yr')]), slice(5, 6, None)),
> (Term([EvalFactor('yd')]), slice(6, 7, None))]),
>            builder=<patsy.build.DesignMatrixBuilder at 0x10f169510>)

Amphiboly answered 28/7, 2012 at 22:32 Comment(1)

pasty is superb when comes to transforming continuous values to discrete. – Ivanivana 17/3, 2016 at 6:47

import pandas
import numpy as np

num_rows = 7;
df2 = pandas.DataFrame(
                        {
                        'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                        'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                        'c' : np.random.randn(num_rows)
                        }
                      )

a_attribute_list = ['one', 'two', 'three', 'six']; #Or use list(set(df2['a'].values)), but that doesn't guarantee ordering.
b_attribute_list = ['x','y']

a_membership = [ np.reshape(np.array(df2['a'].values == elem).astype(np.float64),   (num_rows,1)) for elem in a_attribute_list ]
b_membership = [ np.reshape((df2['b'].values == elem).astype(np.float64), (num_rows,1)) for elem in b_attribute_list ]
c_column =  np.reshape(df2['c'].values, (num_rows,1))


design_matrix_a = np.hstack(tuple(a_membership))
design_matrix_b = np.hstack(tuple(b_membership))
design_matrix = np.hstack(( design_matrix_a, design_matrix_b, c_column ))

# Print out the design matrix to see that it's what you want.
for row in design_matrix:
    print row

I get this output:

[ 1.          0.          0.          0.          1.          0.          0.36444463]
[ 1.          0.          0.          0.          0.          1.         -0.63610264]
[ 0.          1.          0.          0.          0.          1.          1.27876991]
[ 0.          0.          1.          0.          1.          0.          0.69048607]
[ 0.          1.          0.          0.          0.          1.          0.34243241]
[ 1.          0.          0.          0.          1.          0.         -1.17370649]
[ 0.          0.          0.          1.          1.          0.         -0.52271636]

So, the first column is an indicator for the DataFrame locations that were 'one', the second column is an indicator for the DataFrame locations that were 'two', and so on. Columns 4 and 5 are indicators of DataFrame locations that were 'x' and 'y', respectively, and the final column is just the random data.

Loth answered 17/4, 2012 at 19:17 Comment(10)

The values attribute returns nested ndarray's where the innermost array holds dtype=object. The factors are converted into strings, and the float data are floats within this inner array. – Amphiboly 17/4, 2012 at 19:49

It doesn't work like that for me. I edited the question above to illustrate. – Loth 17/4, 2012 at 19:53

It works for you because in your example all the data happens to be float type. However with string data present I get a different structure as the return type. What I am looking for as a logical mapping that transforms the data frame into a 2d ndarray of floats that could then be put into a low level solver expecting design matrix X and dependent variables y. By low level, I mean speudoinverse code that only knows how to work on 2d float ndarrays (not recarrays). This lower level encoding is what is referred to as a "design matrix" in statistics references. – Amphiboly 17/4, 2012 at 20:40

Here is a discussion highlighting how R code translates factors into a design matrix "behind the scenes" before sending to low lever solver code. Though the example factors has only 2 levels, I believe the correct behaviour can be expected for 3 or more levels.r.789695.n4.nabble.com/… – Amphiboly 17/4, 2012 at 20:47

It seems that a numpy recarray might be appropriate. I'll look at whether values can be easily exported to a recarray – Loth 17/4, 2012 at 21:0

Also, it would be helpful if you could provide some working code that creates a small example of a DataFrame that has strings, etc., in places analogous to the one you're working with. So that we can test methods. – Loth 17/4, 2012 at 21:15

I have modified the question with a little example of such a data frame. – Amphiboly 17/4, 2012 at 21:29

I modified the answer with my best guess about what you wanted from the categorical variables. If I understood you, you wanted indicator columns, and the above should do the trick. – Loth 17/4, 2012 at 22:34

Thanks for putting a lot of effort into this. Your solution is now pretty close to what I am after though also a bit non-general, as your comment about dictionary ordering alludes. Having an encoding of the reverse mapping in the general case is also important. I did track down what I think is the equivalent function in R: model.matrix. I am still holding out that a better/more elegant solution will pop up which is why I am not clicking accept just yet. – Amphiboly 18/4, 2012 at 23:27

That's fine, no worries. I'd be interested in a more elegant solution too. There certainly ought to be a more Pythonic way to do it. – Loth 18/4, 2012 at 23:51

patsy.dmatrices may in many cases work well. If you just have a vector - a pandas.Series - then the below code may work producing a degenerate design matrix and without an intercept column.

def factor(series):
    """Convert a pandas.Series to pandas.DataFrame design matrix.

    Parameters
    ----------
    series : pandas.Series
        Vector with categorical values

    Returns
    -------
    pandas.DataFrame
        Design matrix with ones and zeroes.

    See Also
    --------
    patsy.dmatrices : Converts categorical columns to numerical

    Examples
    --------
    >>> import pandas as pd
    >>> design = factor(pd.Series(['a', 'b', 'a']))
    >>> design.ix[0,'[a]']
    1.0
    >>> list(design.columns)
    ['[a]', '[b]']

    """
    levels = list(set(series))
    design_matrix = np.zeros((len(series), len(levels)))
    for row_index, elem in enumerate(series):
        design_matrix[row_index, levels.index(elem)] = 1
    name = series.name or ""
    columns = map(lambda level: "%s[%s]" % (name, level), levels)
    df = pd.DataFrame(design_matrix, index=series.index, 
                      columns=columns)
    return df

Geognosy answered 20/3, 2014 at 19:3 Comment(0)

Pandas 0.13.1 from February 3, 2014 has a method:

>>> pd.Series(['one', 'one', 'two', 'three', 'two', 'one', 'six']).str.get_dummies()
   one  six  three  two
0    1    0      0    0
1    1    0      0    0
2    0    0      0    1
3    0    0      1    0
4    0    0      0    1
5    1    0      0    0
6    0    1      0    0

Geognosy answered 12/6, 2014 at 16:20 Comment(0)

import pandas as pd
import numpy as np

def get_design_matrix(data_in,columns_index,ref):
    columns_index_temp =  columns_index.copy( )
    design_matrix = pd.DataFrame(np.zeros(shape = [len(data_in),len(columns_index)-1]))
    columns_index_temp.remove(ref)
    design_matrix.columns = columns_index_temp
    for ii in columns_index_temp:
        loci = list(map(lambda x:x == ii,data_in))
        design_matrix.loc[loci,ii] = 1
    return(design_matrix)

get_design_matrix(data_in = ['one','two','three','six','one','two'],
                  columns_index = ['one','two','three','six'],
                  ref = 'one')


Out[3]: 
   two  three  six
0  0.0    0.0  0.0
1  1.0    0.0  0.0
2  0.0    1.0  0.0
3  0.0    0.0  1.0
4  0.0    0.0  0.0
5  1.0    0.0  0.0

Coracorabel answered 19/4, 2019 at 6:49 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags