Pandas : TypeError: float() argument must be a string or a number
Asked Answered
S

3

24

I have a dataframe that contains

user_id    date       browser  conversion  test  sex  age  country
   1    2015-12-03       IE        1         0    M   32.0   US

Here is my code:

from sklearn import tree
data['date'] = pd.to_datetime(data.date)
columns = [c for c in data.columns.tolist() if c not in ["test"]]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data[columns], data["test"])

I am getting this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-560-95a8a54aa939> in <module>()
      4 from sklearn import tree
      5 clf = tree.DecisionTreeClassifier(max_depth=2, min_samples_leaf = (len(data)/100) )
----> 6 clf = clf.fit(data[columns],data["test"])

C:\Users\SnehaPriya\Anaconda2\lib\site-packages\sklearn\tree\tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    152         random_state = check_random_state(self.random_state)
    153         if check_input:
--> 154             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    155             if issparse(X):
    156                 X.sort_indices()

C:\Users\SnehaPriya\Anaconda2\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    371                                       force_all_finite)
    372     else:
--> 373         array = np.array(array, dtype=dtype, order=order, copy=copy)
    374 
    375         if ensure_2d:

TypeError: float() argument must be a string or a number

How do I overcome this error?

Scolecite answered 21/12, 2016 at 6:41 Comment(0)
W
10

IIUC you need exclude column date also:

columns = [c for c in columns if c not in ["test", 'date']]

because error:

TypeError: float() argument must be a string or a number, not 'Timestamp'

Watercraft answered 21/12, 2016 at 7:6 Comment(0)
S
12

A solution which keeps the date(time) column:

data['date'] = pd.to_numeric(pd.to_datetime(data['date']))
Soldier answered 11/2, 2021 at 12:52 Comment(0)
W
10

IIUC you need exclude column date also:

columns = [c for c in columns if c not in ["test", 'date']]

because error:

TypeError: float() argument must be a string or a number, not 'Timestamp'

Watercraft answered 21/12, 2016 at 7:6 Comment(0)
C
1
Ideas to preserve datetime as features in the model

Assuming the dates are relevant only with respect to how much time has passed since the observation, a solution to keep the datetime column as a feature in the model is to convert it into time difference between now and the datetimes.

data['date'] = (pd.Timestamp('now') - pd.to_datetime(data['date'])).dt.total_seconds()

Or you can convert the datetimes into integers straight up.

data['date'] = pd.to_datetime(data['date']).astype('int64')

N.B. To convert strings to datetime, passing format= makes the conversion run much, much faster (25 times faster). See this post for the benchmark and see this post for ideas to pass the format if your datetime column doesn't have a uniform format.

Cacuminal answered 15/2, 2023 at 19:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.