If you look at the documentation for the DecisionTreeClassifier class in scikit-learn, you’ll see something like this for the criterion
parameter:
The RandomForestClassifier documentation says the same thing. Both mention that the default criterion is “gini” for the Gini Impurity.
What is that?!
Gini Impurity
Gini index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen.
But what is actually meant by ‘impurity’? If all the elements belong to a single class, then it can be called pure.
The degree of Gini index varies between 0 and 1, where 0 denotes that all elements belong to a certain class or if there exists only one class, and 1 denotes that the elements are randomly distributed across various classes. A Gini Index of 0.5 denotes equally distributed elements into some classes.
Formula for Gini Index or Impurity
where pi is the probability of an object being classified to a particular class.
While building the decision tree, we would prefer choosing the attribute/feature with the least Gini index as the root node.
To learn more about Gini Impurity, please visit this link.
Comments
0 comments
Article is closed for comments.