Imbalanced Classification Problems
The number of examples that belong to each class may be referred to as the class distribution.
Imbalanced classification refers to a classification predictive modeling problem where the number of examples in the training dataset for each class label is not balanced.
That is, where the class distribution is not equal or close to equal and is instead biased or skewed.
For example, we may collect measurements of flowers and have 80 examples of one flower species and 20 examples of a second flower species, and only these examples comprise our training dataset. This represents an example of an imbalanced classification problem.
- Majority Class: The class (or classes) in an imbalanced classification predictive modeling problem that has many examples.
- Minority Class: The class in an imbalanced classification predictive modeling problem that has few examples.
Challenge of Imbalanced Classification
When working with an imbalanced classification problem, the minority class is typically of the most interest. This means that a model’s skill in correctly predicting the class label or probability for the minority class is more important than the majority class or classes.
The minority class is harder to predict because there are few examples of this class, by definition. This means it is more challenging for a model to learn the characteristics of examples from this class, and to differentiate examples from this class from the majority class (or classes).
The abundance of examples from the majority class (or classes) can swamp the minority class. Most machine learning algorithms for classification predictive models are designed and demonstrated on problems that assume an equal distribution of classes. This means that a naive application of a model may focus on learning the characteristics of the abundant observations only, neglecting the examples from the minority class that is, in fact, of more interest and whose predictions are more valuable.
This implies that the learning process of most classification algorithms is often biased toward the majority class examples so that minority ones are not well modeled into the final system.