Gini Index (CART, IBM IntelligentMiner)

  • If a data set D contains examples from n classes, gini index, gini(D) is defined as


  • where pj is the relative frequency of class j in D
  • If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as

\[gini_{A}(D)=\frac{|D_{1}|}{|D|}gini(D_{1})+\frac{|D_{2}|}{|D|} gini(D_{1})\]

  • Reduction in Impurity:

\[\Delta gini(A)=gini(D)-gini_{A}(D)\]

  • The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

