Data Cleaning

  • Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error
    • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
      • e.g., Occupation=“ ” (missing data)
    • noisy: containing noise, errors, or outliers
      • e.g., Salary=“−10” (an error)
    • inconsistent: containing discrepancies in codes or names, e.g.,
      • Age=“42”, Birthday=“03/07/2010”
      • Was rating “1, 2, 3”, now rating “A, B, C”
      • discrepancy between duplicate records
    • Intentional (e.g., disguised missing data)
      • Jan. 1 as everyone’s birthday?

