Using TimeSeriesSplit within cross_val_score
Asked Answered
C

1

6

I'm fitting a time series. In this sense, I'm trying to cross-validate using the TimeSeriesSplit function. I believe that the easiest way to apply this function is through the cross_val_score function, through the cv argument.

The question is simple, is the way I am passing the CV argument correct? Should I do the split(scaled_train) or should I use the split(X_train) or split(input_data) ? Or, should I cross-validate in another way?

This is the code I am writing:

  def fit_model1(data: pd.DataFrame):
      df = data
      scores_fit_model1 = []
      for sizes in test_sizes:
        # Generate Test Design
        input_data = df.drop('next_count',axis=1)
        output_data = df[['next_count']]
        X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
    
        #scaling
        scaler = MinMaxScaler()
        scaled_train = scaler.fit_transform(X_train)
        scaled_test = scaler.transform(X_test)
    
        #Build Model
        lr = LinearRegression()
        lr.fit(scaled_train, y_train.values.ravel())
        predictions  = lr.predict(scaled_test)
    
        #Cross Validation Definition
        time_split = TimeSeriesSplit(n_splits=10)
    
        #performance metrics
        r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean() 
        scores_fit_model1.append(r2)
        
      return scores_fit_model1
Cytherea answered 6/10, 2022 at 22:40 Comment(0)
B
1

The TimeSeriesSplit is simply an iterator that yields a growing window of sequential folds. Therefore, you can pass it as is to cv, or you can pass time_series_split(scaled_train), which amounts to the same thing: making splits in an array of the same size as your train data (which cross_val_score takes as the second positional parameter). It doesn't matter whether the TimeSeriesSplit gets the scaled or original data, as long as cross_val_score has the scaled data.

I made some minor simplifications in your code as well - scaling before the train_test_split, and making the output data a Series (so you don't need values.ravel):

def fit_model1(data: pd.DataFrame):
    df = data
    scores_fit_model1 = []
    for sizes in test_sizes:
        # Generate Test Design
        input_data = df.drop('next_count',axis=1)
        output_data = df['next_count']
        scaler = MinMaxScaler()
        scaled_input = scaler.fit_transform(input_data)
        X_train, X_test, y_train, y_test = train_test_split(scaled_input, output_data, test_size=sizes, random_state=0, shuffle=False)

        #Build Model
        lr = LinearRegression()
        lr.fit(X_train, y_train)
        predictions = lr.predict(X_test)

        #Cross Validation Definition
        time_split = TimeSeriesSplit(n_splits=10)

        #performance metrics
        r2 = cross_val_score(lr, X_train, y_train, cv=time_split, scoring = 'r2', n_jobs =1).mean() 
        scores_fit_model1.append(r2)

    return scores_fit_model1
Bashuk answered 7/10, 2022 at 8:21 Comment(2)
Thank you, Josh. That is, the CV argument will always cross-validate the dataset that cross_val_score takes as the second positional parameter, right? However, why do you suggest that MinMaxScaler() be applied before train_test_split?Cytherea
Exactly. And my thinking was that it would be simpler to scale everything together, but on second thought that could lead to a small information leak, the way that you've done it is the correct way.Bashuk

© 2022 - 2024 — McMap. All rights reserved.