Data Cube: A Lattice of Cuboids
Data Cube: A Lattice of Cuboids
Cube Materialization: Full Cube vs. Iceberg Cube
 Full cube vs. iceberg cube
 compute cube sales iceberg as
 select month, city, customer group, count(*)
 from salesInfo
 cube by month, city, customer group
 iceberg condition: having count(*) >= min support
 Computing only the cuboid cells whose measure satisfies the iceberg condition
 Only a small portion of cells may be “above the water’’ in a sparse cube
 Avoid explosive growth: A cube with 100 dimensions
 2 base cells: (a1, a2, …., a100), (b1, b2, …, b100)
 How many aggregate cells if “having count >= 1”?
 What about “having count >= 2”?
Iceberg Cube, Closed Cube & Cube Shell
 Is iceberg cube good enough?
 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}
 How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number!
 Close cube:
 Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c.
 Closed cube: a cube consisting of only closed cells
 What is the closed cube of the above base cuboid? Hint: only 3 cells
 Cube Shell
 Precompute only the cuboids involving a small # of dimensions, e.g., 3
 More dimension combinations will need to be computed on the fly
 For (A1, A2, … A10), how many combinations to compute?
Roadmap for Efficient Computation
 General cube computation heuristics (Agarwal et al.’96)
 Computing full/iceberg cubes: 3 methodologies
 BottomUp: MultiWay array aggregation (Zhao, Deshpande & Naughton, SIGMOD’97)
 Topdown:
 BUC (Beyer & Ramarkrishnan, SIGMOD’99)
 Hcubing technique (Han, Pei, Dong & Wang: SIGMOD’01)
 Integrating TopDown and BottomUp:
 Starcubing algorithm (Xin, Han, Li & Wah: VLDB’03)
 Highdimensional OLAP: A Minimal Cubing Approach (Li, et al. VLDB’04)
 Computing alternative kinds of cubes:
 Partial cube, closed cube, approximate cube, etc.
General Heuristics (Agarwal et al. VLDB’96)
 Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples
 Aggregates may be computed from previously computed aggregates, rather than from the base fact table
 Smallestchild: computing a cuboid from the smallest, previously computed cuboid
 Cacheresults: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os
 Amortizescans: computing as many as possible cuboids at the same time to amortize disk reads
 Sharesorts: sharing sorting costs cross multiple cuboids when sortbased method is used
 Sharepartitions: sharing the partitioning cost across multiple cuboids when hashbased algorithms are used
Creator: sidraaslam
Contributors:

Licensed under the Creative Commons
Attribution ShareAlike CCBYSA license
This deck was created using SlideWiki.