### Data Cube: A Lattice of Cuboids

### Data Cube: A Lattice of Cuboids

### Cube Materialization: Full Cube vs. Iceberg Cube

- Full cube vs. iceberg cube

- compute cube sales iceberg as
- select month, city, customer group, count(*)
- from salesInfo
- cube by month, city, customer group
- iceberg condition: having count(*) >= min support

- Computing
*only*the cuboid cells whose measure satisfies the iceberg condition - Only a small portion of cells may be “above the water’’ in a sparse cube
- Avoid explosive growth: A cube with 100 dimensions
- 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)
- How many aggregate cells if “having count >= 1”?
- What about “having count >= 2”?

### Iceberg Cube, Closed Cube & Cube Shell

- Is iceberg cube good enough?
- 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
- How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number!
- Close cube:
- Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c.
- Closed cube: a cube consisting of only closed cells
- What is the closed cube of the above base cuboid? Hint: only 3 cells
- Cube Shell
- Precompute only the cuboids involving a small # of dimensions, e.g., 3
- More dimension combinations will need to be computed on the fly

- For (A1, A2, … A10), how many combinations to compute?

### Roadmap for Efficient Computation

- General cube computation heuristics (Agarwal et al.’96)

- Computing full/iceberg cubes: 3 methodologies
- Bottom-Up: Multi-Way array aggregation (Zhao, Deshpande & Naughton, SIGMOD’97)
- Top-down:
- BUC (Beyer & Ramarkrishnan, SIGMOD’99)
- H-cubing technique (Han, Pei, Dong & Wang: SIGMOD’01)
- Integrating Top-Down and Bottom-Up:
- Star-cubing algorithm (Xin, Han, Li & Wah: VLDB’03)
- High-dimensional OLAP: A Minimal Cubing Approach (Li, et al. VLDB’04)
- Computing alternative kinds of cubes:
- Partial cube, closed cube, approximate cube, etc.

### General Heuristics (Agarwal et al. VLDB’96)

- Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples

- Aggregates may be computed from previously computed aggregates, rather than from the base fact table
**Smallest-child:**computing a cuboid from the smallest, previously computed cuboid**Cache-results:**caching results of a cuboid from which other cuboids are computed to reduce disk I/Os**Amortize-scans:**computing as many as possible cuboids at the same time to amortize disk reads**Share-sorts:**sharing sorting costs cross multiple cuboids when sort-based method is used**Share-partitions:**sharing the partitioning cost across multiple cuboids when hash-based algorithms are used