
Agenda
- Data Warehouse: Basic Concepts
- Data Warehouse Modeling: Data Cube and OLAP
- Data Warehouse Design and Usage
- Data Warehouse Implementation
- Data Generalization by Attribute-Oriented Induction
- Summary
What is a Data Warehouse?
- Defined in many different ways, but not rigorously.
- A decision support database that is maintained separately from the organization’s operational database
- Support information processing by providing a solid platform of consolidated, historical data for analysis.
- “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
- Data warehousing:
- The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented
- Organized around major subjects, such as customer, product, sales
- Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
- Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Data Warehouse—Integrated
- Constructed by integrating multiple, heterogeneous data sources
- relational databases, flat files, on-line transaction records
- Data cleaning and data integration techniques are applied.
- Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
- E.g., Hotel price: currency, tax, breakfast covered, etc.
- When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant
- The time horizon for the data warehouse is significantly longer than that of operational systems
- Operational database: current value data
- Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
- Every key structure in the data warehouse
- Contains an element of time, explicitly or implicitly
- But the key of operational data may or may not contain “time element”
Data Warehouse—Nonvolatile
- A physically separate store of data transformed from the operational environment
- Operational update of data does not occur in the data warehouse environment
- Does not require transaction processing, recovery, and concurrency control mechanisms
- Requires only two operations in data accessing:
- initial loading of data and access of data
OLTP vs. OLAP
OLTP | OLAP | |
users | clerk, IT professional | knowledge worker |
function DB design | day to day operations application-oriented | decision support subject-oriented |
data | surrent, up-to-date detailed, flat relational isolated | historical, summarized, multidimensional integrated, sonsolidated |
usage access | repetetive read/write index/hash on prim key | ad-hoc lots of scans |
unit of work #records accessed | short, simple transaction tens | complex query millions |
#users DB size metric | thousands 100MB-GB transaction throughput | hundreds 100GB-TB query throughput, response |
Why a Separate Data Warehouse?
- High performance for both systems
- DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
- Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
- Different functions and different data:
- missing data: Decision support requires historical data which operational DBs do not typically maintain
- data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
- data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
- Note: There are more and more systems which perform OLAP analysis directly on relational databases
Data Warehouse: A Multi-Tiered ArchitectureUntitled

Three Data Warehouse Models
- Enterprise warehouse
- collects all of the information about subjects spanning the entire organization
- Data Mart
- a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
- Independent vs. dependent (directly from warehouse) data mart
- Virtual warehouse
- A set of views over operational databases
- Only some of the possible summary views may be materialized
Extraction, Transformation, and Loading (ETL)
- Data extraction
- get data from multiple, heterogeneous, and external sources
- Data cleaning
- detect errors in the data and rectify them when possible
- Data transformation
- convert data from legacy or host format to warehouse format
- Load
- sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions
- Refresh
- propagate the updates from the data sources to the warehouse
Metadata Repository
- Meta data is the data defining warehouse objects. It stores:
- Description of the structure of the data warehouse
- schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
- Operational meta-data
- data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
- The algorithms used for summarization
- The mapping from operational environment to the data warehouse
- Data related to system performance
- warehouse schema, view and derived data definitions
- Business data
- business terms and definitions, ownership of data, charging policies
From Tables and Spreadsheets to Data Cubes
- A data warehouse is based on a multidimensional data model which views data in the form of a data cube
- A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
- Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
- Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
- In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
Conceptual Modeling of Data Warehouses
- Modeling data warehouses: dimensions & measures
- Star schema: A fact table in the middle connected to a set of dimension tables
- Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
- Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
Example of Star Schema
Example of Snowflake Schema
Example of Fact Constellation
A Concept Hierarchy: Dimension (location)
Data Cube Measures: Three Categories
- Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning
- E.g., count(), sum(), min(), max()
- Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function
- E.g., avg(), min_N(), standard_deviation()
- Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.
- E.g., median(), mode(), rank()
View of Warehouses and Hierarchies

Multidimensional Data
- Sales volume as a function of product, month, and region
A Sample Data Cube

Cuboids Corresponding to the Cube

Typical OLAP Operations
- Roll up (drill-up): summarize data
- by climbing up hierarchy or by dimension reduction
- Drill down (roll down): reverse of roll-up
- from higher level summary to lower level summary or detailed data, or introducing new dimensions
- Slice and dice: project and select
- Pivot (rotate):
- reorient the cube, visualization, 3D to series of 2D planes
- Other operations
- drill across: involving (across) more than one fact table
- drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
Typical OLAP Operations
A Star-Net Query Model
Browsing a Data Cube
Design of Data Warehouse: A Business Analysis Framework
- Four views regarding the design of a data warehouse
- Top-down view
- allows selection of the relevant information necessary for the data warehouse
- Data source view
- exposes the information being captured, stored, and managed by operational systems
- Data warehouse view
- consists of fact tables and dimension tables
- Business query view
- sees the perspectives of data in the warehouse from the view of end-user
Data Warehouse Design Process
- Top-down, bottom-up approaches or a combination of both
- Top-down: Starts with overall design and planning (mature)
- Bottom-up: Starts with experiments and prototypes (rapid)
- From software engineering point of view
- Waterfall: structured and systematic analysis at each step before proceeding to the next
- Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around
- Typical data warehouse design process
- Choose a business process to model, e.g., orders, invoices, etc.
- Choose the grain (atomic level of data) of the business process
- Choose the dimensions that will apply to each fact table record
- Choose the measure that will populate each fact table record
Data Warehouse Development: A Recommended Approach
Data Warehouse Usage
- Three kinds of data warehouse applications
- Information processing
- supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
- Analytical processing
- multidimensional analysis of data warehouse data
- supports basic OLAP operations, slice-dice, drilling, pivoting
- Data mining
- knowledge discovery from hidden patterns
- supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools
From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)
- Why online analytical mining?
- High quality of data in data warehouses
- DW contains integrated, consistent, cleaned data
- Available information processing structure surrounding data warehouses
- ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools
- OLAP-based exploratory data analysis
- Mining with drilling, dicing, pivoting, etc.
- On-line selection of data mining functions
- Integration and swapping of multiple mining functions, algorithms, and tasks
Efficient Data Cube Computation
- Data cube can be viewed as a lattice of cuboids
- The bottom-most cuboid is the base cuboid
- The top-most cuboid (apex) contains only one cell
- How many cuboids in an n-dimensional cube with L levels?
\[ T = \prod_{i=1}^{n}(L_{i}+1) \]
- Materialization of data cube
- Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
- Selection of which cuboids to materialize
- Based on size, sharing, access frequency, etc.
The “Compute Cube” Operator
- Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales - Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year - Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
Indexing OLAP Data: Bitmap Index
- Index on a particular column
- Each value in the column has a bit vector: bit-op is fast
- The length of the bit vector: # of records in the base table
- The i-th bit is set if the i-th row of the base table has the value for the indexed column
- not suitable for high cardinality domains
- A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality domain as well [Wu, et al. TODS’06]
Indexing OLAP Data: Join Indices
- Join index: JI(R-id, S-id) where R (R-id, …) join S (S-id, …)
- Traditional indices map the values to a list of record ids
- It materializes relational join in JI file and speeds up relational join
- In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.
- E.g. fact table: Sales and two dimensions city and product
- A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city
- Join indices can span multiple dimensions
Efficient Processing OLAP Queries
- Determine which operations should be performed on the available cuboids
- Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection
- Determine which materialized cuboid(s) should be selected for OLAP op.
- Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available:
- 1) {year, item_name, city}
- 2) {year, brand, country}
- 3) {year, brand, province_or_state}
- 4) {item_name, province_or_state} where year = 2004
- Which should be selected to process the query?
- Explore indexing structures and compressed vs. dense array structs in MOLAP
OLAP Server Architectures
- Relational OLAP (ROLAP)
- Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware
- Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
- Greater scalability
- Multidimensional OLAP (MOLAP)
- Sparse array-based multidimensional storage engine
- Fast indexing to pre-computed summarized data
- Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
- Flexibility, e.g., low level: relational, high-level: array
- Specialized SQL servers (e.g., Redbricks)
- Specialized support for SQL queries over star/snowflake schemas
Attribute-Oriented Induction
- Proposed in 1989 (KDD ‘89 workshop)
- Not confined to categorical data nor particular measures
- How it is done?
- Collect the task-relevant data (initial relation) using a relational database query
- Perform generalization by attribute removal or attribute generalization
- Apply aggregation by merging identical, generalized tuples and accumulating their respective counts
- Interaction with users for knowledge presentation
Attribute-Oriented Induction: An Example
- Example: Describe general characteristics of graduate students in the University database
- Step 1. Fetch relevant set of data using an SQL statement, e.g.,
- Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa)
- from student
- where student_status in {“Msc”, “MBA”, “PhD” }
- Step 2. Perform attribute-oriented induction
- Step 3. Present results in generalized relation, cross-tab, or rule forms
Class Characterization: An Example
Basic Principles of Attribute-Oriented Induction
- Data focusing: task-relevant data, including dimensions, and the result is the initial relation
- Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes
- Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A
- Attribute-threshold control: typical 2-8, specified/default
- Generalized relation threshold control: control the final relation/rule size
Attribute-Oriented Induction: Basic Algorithm
- InitialRel: Query processing of task-relevant data, deriving the initial relation.
- PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?
- PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts.
- Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.
Presentation of Generalized Results
- Generalized relation:
- Relations where some or all attributes are generalized, with counts or other aggregation values accumulated.
- Cross tabulation:
- Mapping results into cross tabulation form (similar to contingency tables).
- Visualization techniques:
- Pie charts, bar charts, curves, cubes, and other visual forms.
- Quantitative characteristic rules:
- Mapping generalized result into characteristic rules with quantitative information associated with it, e.g.,
grad(x) Λ male(x) ⇒ birth_region(x) = “Canadd[t:53%] ∨ birth_region(x) = “foreign[t:47%]
Mining Class Comparisons
- Comparison: Comparing two or more classes
- Method:
- Partition the set of relevant data into the target class and the contrasting class(es)
- Generalize both classes to the same high level concepts
- Compare tuples with the same high level descriptions
- Present for every tuple its description and two measures
- support - distribution within single class
- comparison - distribution between classes
- Highlight the tuples with strong discriminant features
- Relevance Analysis:
- Find attributes (features) which best distinguish different classes
Concept Description vs. Cube-Based OLAP
- Similarity:
- Data generalization
- Presentation of data summarization at multiple levels of abstraction
- Interactive drilling, pivoting, slicing and dicing
- Differences:
- OLAP has systematic preprocessing, query independent, and can drill down to rather low level
- AOI has automated desired level allocation, and may perform dimension relevance analysis/ranking when there are many relevant dimensions
- AOI works on the data which are not in relational forms
Summary
- Data warehousing: A multi-dimensional model of a data warehouse
- A data cube consists of dimensions & measures
- Star schema, snowflake schema, fact constellations
- OLAP operations: drilling, rolling, slicing, dicing and pivoting
- Data Warehouse Architecture, Design, and Usage
- Multi-tiered architecture
- Business analysis design framework
- Information processing, analytical processing, data mining, OLAM (Online Analytical Mining)
- Implementation: Efficient computation of data cubes
- Partial vs. full vs. no materialization
- Indexing OALP data: Bitmap index and join index
- OLAP query processing
- OLAP servers: ROLAP, MOLAP, HOLAP
- Data generalization: Attribute-oriented induction
References
- S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96
- D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97
- R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
- S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997
- E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July 1993.
- J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
- A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999.
- J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107, 1998.
- V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96
References (cont')
- C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley, 2003
- W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
- R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2ed. John Wiley, 2002
- P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
- Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998
- A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
- S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
- P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
- J. Widom. Research problems in data warehousing. CIKM’95.
- K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans. on Database Systems (TODS), 31(1), 2006, pp. 1-38.
- March 13, 2013
