Current Slide

Small screen detected. You are viewing the mobile version of SlideWiki. If you wish to edit slides you will need to use a larger device.

Data Cleaning as a Process

  • Data discrepancy detection
    • Use metadata (e.g., domain, range, dependency, distribution)
    • Check field overloading
    • Check uniqueness rule, consecutive rule and null rule
    • Use commercial tools
      • Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections
      • Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)
  • Data migration and integration
    • Data migration tools: allow transformations to be specified
    • ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
  • Integration of the two processes
    • Iterative and interactive (e.g., Potter’s Wheels)

Speaker notes:

Content Tools


There are currently no sources for this slide.