Summary: Representation ofText Categorization Attributes

  • Representations of text are usually very high dimensional (one feature for each word)

  • High-bias algorithms that prevent overfitting in high-dimensional space should generally work best*

  • For most text categorization tasks, there are many relevant features and many irrelevant ones

  • Methods that combine evidence from many or all features (e.g. naive Bayes, kNN) often tend to work better than ones that try to isolate just a few relevant features*

                                                    *Although the results are a bit more mixed than often thought

