Cross-validation came into the picture because of the variable randomness of train_test_split. Cross-validation means splitting the entire data into k folds/Subsets, wherein each iteration k-1 subsets will be for training our model and 1 will be for testing our model. If I understand correctly x_train and y_train are obtained from train_test_split function, then how come we assure that usage of cross-validation helps us?
First, we split the data into train and test sets using train_test_split. Then we keep the test set aside and don't use it until we have our final model ready.
We only use the train set to train our models and do cross-validation as well.
The reason we do cross-validation is to ensure that our model doesn't overfit on a single set of data and is able to generalize well on new data.
Hence in cross-validation, we train our model k times, for k-fold cross-validation, where k-1 folds are used for training and 1 fold is used for testing.
Therefore, each time model will be trained on a different set of data and tested on a different set of data.
This helps to prevent the overfitting of the model and allows it to generalize well on unseen data.
After we have finalized the model, only then we test its performance on the test set created using train_test_split.