Machine learning project: split training/test sets before or after exploratory data analysis?

Asked 21/1, 2019 at 1:8 Answered 21/10, 2020 at 13:58

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?

I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.

I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?

I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.

I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).

What do you do in your own work and why?

Thanks for helping a new programmer!

Indiraindirect answered 21/1, 2019 at 1:8 Comment(4)

Implying the complete data is small enough to work with easily (i.e., fits in memory) I always use the complete set for EDA. I only ever split into test /train when I'm modeling – Roundshouldered 21/1, 2019 at 1:51

This question is slightly off-topic for SO (it errs towards "opinion-based" and isn't about coding itself), but it is still a very good question. Here's one perspective; you may find others or otherwise receptive audiences on CrossValidated or RStudio Community. – Skidway 21/1, 2019 at 3:23

EDA doesn't need splitting. EDA just guides you on future steps. Observing any trends may help guide your feature selection and/or engineering. EDA may also help you better clean up your data. ML is as good as your data. – Shavonda 21/1, 2019 at 4:53

This is a great question on the proper machine learning data science pipeline. – Apart 27/1, 2020 at 19:57

To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).

Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.

Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.

Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.

This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.

Wildman answered 23/2, 2020 at 10:52 Comment(0)

I would take the problem the other way round; is it bad to use the test set ?

The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.

Eritrea answered 21/10, 2020 at 13:58 Comment(0)

From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.

Here are my reasons:

The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.

Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.

Harrisharrisburg answered 12/12, 2019 at 4:16 Comment(0)

Recommended topics

Hot tags