How to handle date variable in machine learning data pre-processing [closed]
Asked Answered
S

2

26

I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:

  1. How to handle the date format? Shall I convert it to one number (like excel does automatically)? Shall I split it in more variables like day, month, year, hour, mins, seconds? any other possible suggestions?
  2. What if I would like to add distinct week number per year? shall I add variable like 342017(week 34 of year 2017)?
  3. Shall I make the same for question 2 for quarter of year?
#         Datetime               Gender        Purchase
1    23/09/2015 00:00:00           0             1
2    23/09/2015 01:00:00           1             0
3    25/09/2015 02:00:00           1             0
4    27/09/2015 03:00:00           1             1
5    28/09/2015 04:00:00           0             0
Semidiurnal answered 26/9, 2017 at 14:15 Comment(6)
This question is very broad. 1) pick a language (R or Python) second. 2) Asking us how/the best way to process your data is not what this site is for, this invites too much opinion. 3) Asking for a book, tool, reference is off-topic for the site as well. Please have a look at this post on what is appropriate for SO. stackoverflow.com/help/on-topicHotchpotch
Thank you for your quick reply. Question is very specific for the machine learning issue and I am asking on how people treat this kind of problem. The reason about tagging R and Python is because there are maybe packages that will help overpass the obstacle of data transformationSemidiurnal
I understand your question, and its importance in modeling. But this is not a programming question, i.e. you have no code, errors, incorrect/unexpected/inconsistent results/outputs. You do not even talk about what kind of algorithm you are training with your expected outputs/goals. If you want to discuss the pros/cons of various representations of date for machine learning/modeling I would suggest Datascience StackexchangeHotchpotch
Actually I do. I am talking about logistic regression. Indeed my 4th question is off topic though and I thank you for that. Do you have anything to contribute with the rest of my question though? e.g. would it be better if I could use the number 42270 instead of 23/09/2015 00:00:00 ? Shall I add another variable to show e.g. day name?Semidiurnal
First, R and python show dates in human readable format but represent them internally as secs or min or days from an origin time (e.g. 1970-01-01). You can represent your date column as day of the week, quarter (1:4), week (1:52), time from major holiday, time from last full moon, day of the month, day of the year (1:365) time between sales, season, time from start of sale or promotion, etc., etc. etc. The real question is how do you want to interpret your model variables? Lastly, R and python have packages to make working with dates very easy.Hotchpotch
check the library Feature-engine: feature-engine.readthedocs.io/en/latest/api_doc/datetime/… and feature-engine.readthedocs.io/en/latest/api_doc/creation/…Encyclopedic
C
31

Some random thoughts:

Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great; are there observed trends that can be coded into the data?

Possible suggestions of features include:

  • weekends vs weekdays
  • business hours and time of day
  • seasons
  • week of year number
  • month
  • year
  • beginning/end of month (pay days)
  • quarter
  • days to/from an action event(distance)
  • missing or incomplete data
  • etc.

All this depends on the data set and most won't apply.

some links:

http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction

https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models

http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/

Capybara answered 26/9, 2017 at 14:30 Comment(5)
@Charles I am currently attempting the days from an action event. However, some of the entries don't have the action of interest, or that action is occurring for the first time. How do I represent this as a feature? Surely using a 0 implies that there are 0 days between the previous and the current action, so I can't use it in that way.Bacchus
This depends on the model you're building. Some models will accept NULL values, others wont. For a regression you may need a flag - see:stats.stackexchange.com/questions/299663/…Capybara
@Charles I do collect all of these as screenshots. Just keep going.Intermarriage
is there any harm of including all variants, and letting the model chose what works best, beside the issue of maybe unnecessary features in a model and wasted computation?Declaratory
Generally it seems that it shouldn't be an issue, feature selection is more important with more features. Also Forecastegy published a nice youtube video with more ideas youtube.com/watch?v=ft77eXtn30QCapybara
G
17

Cyclic Feature Encoding

Data that has a unique set of values that repeat in a cycle are known as cyclic data. Time-related features are mainly cyclic in nature. For example, months of a year, days of a week, hours of time, minutes of time etc... These features have a set of values and all the observations will have a value from this set only. In many ML problems, we encounter such features. Handling such features properly have proved to help in the improvement of accuracy.

Implementation

def encode(data, col, max_val):
    data[col + '_sin'] = np.sin(2 * np.pi * data[col]/max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col]/max_val)
    return data

data['month'] = data.datetime.dt.month
data = encode(data, 'month', 12)

data['day'] = data.datetime.dt.day
data = encode(data, 'day', 31)

The Logic

A common method for encoding cyclical data is to transform the data into two dimensions using a sine and cosine transformation. Map each cyclical variable onto a circle such that the lowest value for that variable appears right next to the largest value. We compute the x- and y- components of that point using sin and cos trigonometric functions.

For handling months we consider them from 0-11 and refer to the below figure.

enter image description here

We can do that using the following transformations:

More on Feature Engineering Cyclic Features

Guesstimate answered 26/12, 2021 at 10:6 Comment(1)
This answer deserves more credit!Nupercaine

© 2022 - 2024 — McMap. All rights reserved.