Data Cube Implementation
pm jat @ daiict
21-09-2023 Data Cube Implementation 1
Summaries in “Data Cube”
Origin: ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) 𝑓𝑎𝑐𝑡
Item Axis: 𝐼𝑡𝑒𝑚𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)
Store Axis: 𝑆𝑡𝑜𝑟𝑒𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)
Customer Axis: 𝐶𝑢𝑠𝑡𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)
Floor: 𝐼𝑡𝑒𝑚𝐼𝐷,𝑆𝑡𝑜𝑟𝑒𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)
Front Wall: 𝐼𝑡𝑒𝑚𝐼𝐷,𝐶𝑢𝑠𝑡𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) 𝑓𝑎𝑐𝑡
Right Side Wall: 𝑆𝑡𝑜𝑟𝑒𝐼𝐷,𝐶𝑢𝑠𝑡𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)
Rest: 𝐶𝑢𝑠𝑡,𝑆𝑡𝑜𝑟𝑒,𝐼𝑡𝑒𝑚 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) 𝑓𝑎𝑐𝑡
Data Cube is a set of “cuboids” for “each
possible subset of the given
21-09-2023 dimensions”. Correct?
Data Cube Implementation 2
“Data Cube” as
Lattice of Cuboids
GROUP BY time
GROUP BY time, item
GROUP BY time, item, location
GROUP BY time, item, location, Supplier
21-09-2023
Figure Source: Data Mining Textbook [9] Data Cube Implementation 3
Data Cube as “lattice of Cuboids”
• Data Cube is represented as a “lattice of Cuboids”, where
• Node: Cuboid
– that contains “aggregated values” (called as measures) for “a
dimension(attribute) combination”
• Edge: Parent-Child Relationship, where
– child node has one (exactly one) extra attribute (dimension) than its parent
– child node cuboid can be computed from the parent cuboid aggregations
• Data cube is a set of “cuboids” for “each possible subset of the given dimensions”.
Correct?
21-09-2023 Data Cube Implementation 4
Data Cube as “4D cuboid” 4D representation of Lattice
shown here
21-09-2023
Figure Source: Data Mining Textbook [9] Data Cube Implementation 5
Querying Lattice
• Each node is
a view.
• “dimension
hierarchies”
• Rollup, and
Drill down?
A query can be reading full
node or part of a node from the lattice
21-09-2023
Figure Source: Data Mining Textbook[9] Data Cube Implementation 6
“Cube” “location” “Data Hierarchy”
Lattices can even be maintained
for different specific attributes –
making the cube further lower
granularity
21-09-2023 Data Cube Implementation 7
Figure Source: Data Mining Textbook[9]
“Cube” “Price” “Data Hierarchy”
Lattices can even be maintained for
different specific attribute values –
making the cube further lower
granularity
Figure Source: Data Mining Textbook[9]
21-09-2023 Data Cube Implementation 8
A Simple MR approach for Lattice Computation
Questions Remain: How efficient it would be? Can the resulting cube be materialized?
If yes, where do we store it? Can we
21-09-2023 Datahave an index on dimensions?
Cube Implementation 9
A Simple Spark solution for Lattice Computation
How do we materialize, index, search?
21-09-2023 BTW, What is major limitation of Spark-SQL wrt “SQL”
Data Cube Implementation 10 ?
OLAP Systems
• Materialized “Pre-Computation” and “Pre-Aggregation” of Lattice Cuboids remain
the key of OLAP systems for response time for interactive data analysis.
• Many systems offer query optimization based on pre-computed aggregates and
automatic maintenance of stored aggregates during updating of base data.
• Materializing all combinations of aggregates may become infeasible because it takes
too much storage and initial computation time.
• Instead, modern OLAP systems adopt the practical pre-aggregation approach of
materializing only select combinations of aggregates and then reusing these to
efficiently compute other aggregates.
21-09-2023 Data Cube Implementation 11
Types of OLAP Systems
• Relational OLAP (ROLAP) – Lattice Cuboids are stored in a Relational Tables
• Multidimensional OLAP (MOLAP) [3][4] – Lattices are stored as multi-dimensional
arrays, or so?
• Hybrid OLAP (HOLAP)
21-09-2023 Data Cube Implementation 12
ROLAP: Cube Lattice as a Relation
Identify:
• Base Cuboid? what is the dimension?
• Identify 2-D, 1-D, and 0-D cuboids?
21-09-2023 Data Cube Implementation 13
Relational OLAP (ROLAP)
• “Relational OLAP (ROLAP) systems use relational database technology for storing
data, and they also employ specialized index structures, such as bit-mapped indices,
to achieve good query performance” [3]
– Here we have schemas like “Star” and “Snowflake”
– Indexes based on Dimension attributes - may span to dimension tables
• Materialized Cubes are also stored in Relational tables.
– Cube may not be stored in Full. Say only base cubes are materialized, or say up to
N-M levels are stored, and so on.
21-09-2023 Data Cube Implementation 14
Relational OLAP (ROLAP)
• Strength is “Already established and are Robust Systems”
– Large data values could be efficiently processed using robust relational systems.
• With the help of indexes, particularly “bitmap indexes”
• Still, the disadvantages are:
– could be inefficient on ad hoc queries where we do not have materialized
“Cube”.
– May not scale, unsuited for “Big Data”
21-09-2023 Data Cube Implementation 15
Multidimensional OLAP (MOLAP) systems
• Cube lattices are represented in some other structure like multi-dimensional arrays.
• “Multidimensional data cubes are stored on disks in specialized multidimensional
structures”. [3]
• “MOLAP systems typically include provisions for handling sparse arrays and apply advanced
indexing and hashing to locate the data when performing queries” [3]
• MOLAP systems generally provide more space-efficient storage as well as faster query
response times.
• Options here are
– Proprietary MOLAP systems
– Multi-dimensional databases (MDDB). Some Relational Systems also support it, for
instance, Oracle [11]
– Key value stores becoming more common in “Big Data”
21-09-2023 Data Cube Implementation 16
Multidimensional OLAP (MOLAP)
• Though Multidimensional OLAP systems are a natural choice for data cubes and all
• BUT, traditionally they have been considered unsuitable for larger data volumes
– May require high main memory
21-09-2023 Data Cube Implementation 17
[3] Pedersen, Torben Bach, and Christian S. Jensen.
21-09-2023
"Multidimensional database technology." Computer 34.12 Data Cube Implementation
(2001) 18
Multidimensional Database History
[3] Pedersen, Torben Bach, and Christian S. Jensen.
21-09-2023
"Multidimensional database technology." Computer 34.12 Data Cube Implementation
(2001) 19
OLAP Systems
• Traditionally:
– Most DBMS systems “do support”
– Special planning is required in terms of “building”, “indexing”, “updating” data
cubes
– Special extensions for extract, integrated, load or third party tools!
• Large scale OLAP systems?
– Apache Kylin on Hadoop,
– Apache Druid,
– Presto,
– Cassandra
21-09-2023 Data Cube Implementation 20
Apache Kylin
• The Kylin project was started in 2013, in eBay's R&D in China.
• In November 2014, Kylin joined the Apache Software
https://en.wikipedia.org/wiki/Apache_Kylin
• Keylin 4 is the latest stable version:
https://kylin.apache.org/docs/index.html
21-09-2023 Data Cube Implementation 21
Apache Kylin Rationale
• Kylin’s core idea is the precomputation of result sets
• It calculates all possible query results in advance according to the specified
dimensions and indicators and speed up OLAP queries with fixed query patterns. .
https://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-architecture/
21-09-2023 Data Cube Implementation 22
Apache Kylin Rationale
• Kylin’s core idea is the precomputation of result sets
• It calculates all possible query results in advance according to the specified
dimensions and indicators and speed up OLAP queries with fixed query patterns. .
https://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-architecture/
21-09-2023 Data Cube Implementation 23
Apache Kylin - Architecture
21-09-2023 Data Cube Implementation 24
https://kylin.apache.org/index.html
Apache Kylin - Architecture
21-09-2023 Data Cube Implementation 25
Apache Kylin
21-09-2023 Data Cube Implementation
https://kylin.apache.org/docs/index.html 26
21-09-2023 Data Cube Implementation
https://kylin.apache.org/docs/index.html 27
Related Readings
[1] Lee, Suan, et al. "Scalable distributed data cube computation for large-scale multidimensional data
analysis on a Spark cluster." Cluster Computing 22 (2019): 2063-2087.
[2] Chen, Wenhao, et al. "An optimized distributed OLAP system for big data." 2017 2nd IEEE
International Conference on Computational Intelligence and Applications (ICCIA). IEEE, 2017.
[3] Pedersen, Torben Bach, and Christian S. Jensen. "Multidimensional database technology." Computer
34.12 (2001): 40-46.
[4] Thomsen, Erik. OLAP solutions: building multidimensional information systems. John Wiley & Sons,
2002.
[5] Jiatao Tao, Apache Kylin4 — A new storage and compute architecture, Apache Kylin Technical Blog,
2021. https://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-
architecture/
[6] Pedersen, Torben Bach, and Christian S. Jensen. "Multidimensional database technology." Computer
34.12 (2001): 40-46.
21-09-2023 Data Cube Implementation 28
Related Readings
[7] Yang Li, The future of Apache Kylin: More powerful and easy-to-use OLAP,
https://kylin.apache.org/blog/2022/01/12/The-Future-Of-Kylin/
[8] https://cwiki.apache.org/confluence/display/KYLIN/Global+Dictionary+on+Spark
[9] (Book) Han, Jiawei, Micheline Kamber, and Jian Pei. "Data mining concepts and techniques third
edition." University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University
(2012).
[10] (Book) Thomsen, Erik. OLAP solutions: building multidimensional information systems. John Wiley &
Sons, 2002.
[11] Oracle, “Understanding Multidimensional Databases”
https://docs.oracle.com/cd/E12825_01/epm.111/esb_dbag/frameset.htm?dinconc.htm
21-09-2023 Data Cube Implementation 29