I'm fitting a time series. In this sense, I'm trying to cross-validate using the TimeSeriesSplit
function. I believe that the easiest way to apply this function is through the cross_val_score
function, through the cv argument.
The question is simple, is the way I am passing the CV argument correct? Should I do the split(scaled_train)
or should I use the split(X_train)
or split(input_data)
? Or, should I cross-validate in another way?
This is the code I am writing:
def fit_model1(data: pd.DataFrame):
df = data
scores_fit_model1 = []
for sizes in test_sizes:
# Generate Test Design
input_data = df.drop('next_count',axis=1)
output_data = df[['next_count']]
X_train, X_test, y_train, y_test = train_test_split(input_data, output_data, test_size=sizes, random_state=0, shuffle=False)
#scaling
scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(X_train)
scaled_test = scaler.transform(X_test)
#Build Model
lr = LinearRegression()
lr.fit(scaled_train, y_train.values.ravel())
predictions = lr.predict(scaled_test)
#Cross Validation Definition
time_split = TimeSeriesSplit(n_splits=10)
#performance metrics
r2 = cross_val_score(lr, scaled_train, y_train.values.ravel(), cv=time_split.split(scaled_train), scoring = 'r2', n_jobs =1).mean()
scores_fit_model1.append(r2)
return scores_fit_model1