Gain Ratio for Attribute Selection (C4.5)

  • Information gain measure is biased towards attributes with a large number of values
  • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

\[SplitInfo_{A}(D)=-\sum_{j=1}^{v} \frac{|D_{j}|}{|D|}\times log_{2}(\frac{|D_{j}|}{|D|})\]
\[GainRatio(A) = \frac{Gain(A)}{SplitInfo(A)}\]

  • Ex.
    • gain_ratio(income) = 0.029/1.557 = 0.019
  • The attribute with the maximum gain ratio is selected as the splitting attribute

