Pandas read_csv, reading a boolean with missing values specified as an int
Asked Answered
G

3

6

I am trying to import a csv into a pandas dataframe. I have boolean variables denoted with 1's and 0's, where missing values are identified with a -9. When I try to specify the dtype as boolean, I get a host of different errors, depending on what I try.

Sample data: test.csv

var1, var2
0,   0
0,   1
1,   3
-9,  0
0,   2
1,   7

I try to specify the dtype as I import:

dtype_dict = {'var1':'bool','var2':'int'}
nan_dict = {'var1':[-9]}
foo = pd.read_csv('test.csv',dtype=dtype_dict, na_values=nan_dict)

I get the following error:

ValueError: cannot safely convert passed user dtype of |b1 for int64 dtyped data in column 0

I have also tried specifying the true and false values,

foo = pd.read_csv('test.csv',dtype=dtype_dict,na_values=nan_dict,
                 true_values=[1],false_values=[0])

but then I get a different error:

Exception: Must be all encoded bytes

The source code for the error says something about catching the occasional none, but nones or nulls are exactly what I want.

Gavriella answered 23/12, 2016 at 15:49 Comment(0)
H
4

You can specify the converters parameter for the var1 column:

from io import StringIO
import numpy as np
import pandas as pd

pd.read_csv(StringIO("""var1, var2
0,   0
0,   1
1,   3
-9,  0
0,   2
1,   7"""), converters = {'var1': lambda x: bool(int(x)) if x != '-9' else np.nan})

enter image description here

Housecoat answered 23/12, 2016 at 16:10 Comment(0)
T
0

Can you do something like this?

df=pd.read_csv("test.csv",names=["var1","var2"])
df.ix[df.var1==0,'var1Bool']=False
df.ix[df.var1==1,'var1Bool']=True

Thi should create you a new column and if you are satisfied you can just copy over the old one.

   var1  var2 var1Bool
0     0     0    False
1     0     1    False
2     1     3     True
3    -9     0      NaN
4     0     2    False
5     1     7     True
Treat answered 23/12, 2016 at 16:2 Comment(2)
Thanks. I would ideally like to do this on import because the data are kinda big, but this is a good workaround.Gavriella
I'd also like to understand what I'm getting wrong with the import and why it's not working. But I'll mark this as the answer if I don't get any more insights in the next couple hours.Gavriella
S
0

The error Must be all encoded bytes occurs because the parser is expecting strings, not numbers as values.

Your true/false values should be specified like this:

foo = pd.read_csv('test.csv',dtype=dtype_dict,na_values=nan_dict,
             true_values=['1'],false_values=['0'])
Sacksen answered 5/4, 2023 at 20:5 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.