  • Sampling: obtaining a small sample s to represent the whole data set N
  • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data
  • Key principle: Choose a representative subset of the data
    • Simple random sampling may have very poor performance in the presence of skew
    • Develop adaptive sampling methods, e.g., stratified sampling:
  • Note: Sampling may not reduce database I/Os (page at a time)

