In machine learning algorithms are data greedy if you feed them more data they get a better understanding and prepare a better model, In the candy data, set you are limited to only 85 rows, So when you train your model on such data your model learning is limited to only a few points and when you test it overs test data possibly the data you fed to your model it never sees before so it generalizes very bad for test data.
Let's get it through an example:
Suppose you are creating a model for a school student where you are predicting who are the worst-performing students and while preparing your model you consider the data for the students whose score is below average(< 50%).
Till here we prepare our model, Now suppose in the same model we are feeding the data of a student who is among a topper. So What do you think? What is the predictivity of this model?.
Bad Right
Because of the data you use to test the model is new for the model, Model never sees and trained on such data before. That's what happening in Candy data.
So in such a situation, you can try the following steps:
- Check for the outliers in your data set both training and test.
- Create dummy data(Which is beyond the scope of your current course)
- Perform feature selection and choose only the feature which is important.
- Most important Be careful regarding the imputation techniques you use to impute null values. Need to check for a pattern in the missing values, traditional methods like filling with Median, Mode, Mean are not good in such condition.
In machine learning algorithms are data greedy if you feed them more data they get a better understanding and prepare a better model, In the candy data, set you are limited to only 85 rows, So when you train your model on such data your model learning is limited to only a few points and when you test it overs test data possibly the data you fed to your model it never sees before so it generalizes very bad for test data.
Let's get it through an example:
Suppose you are creating a model for a school student where you are predicting who are the worst-performing students and while preparing your model you consider the data for the students whose score is below average(< 50%).
Till here we prepare our model, Now suppose in the same model we are feeding the data of a student who is among a topper. So What do you think? What is the predictivity of this model?.
Bad Right
Because of the data you use to test the model is new for the model, Model never sees and trained on such data before. That's what happening in Candy data.
So in such a situation, you can try the following steps:
- Check for the outliers in your data set both training and test.
- Create dummy data(Which is beyond the scope of your current course)
- Perform feature selection and choose only the feature which is important.
- Most important Be careful regarding the imputation techniques you use to impute null values. Need to check for a pattern in the missing values, traditional methods like filling with Median, Mode, Mean are not good in such condition.