KEMBAR78
Apache Kylin’s Performance Boost from Apache HBase | PPTX
Hongbin Ma, Luke Han
Kyligence Inc.
Apache Kylin’s
Performance Boost from
Apache HBase
About us
Hongbin Ma| 马洪宾
 PMC member of Apache Kylin
 Technical partner of Kyligence Inc.
 mahongbin@apache.org
Kyligence Inc.
 Kyligence is a leading data intelligence company focusing on Big Data technologies and
innovation, offering intelligent platform and product powered by Apache Kylin™ for
enterprise ready business analytics solutions.
Luke Han | 韩卿
 Co-creator & VP of Apache Kylin
 ASF Member
 Co-founder & CEO at Kyligence Inc.
 lukehan@apache.org
Apache Kylin aerial view
MapReduce/Spark
Kylin
BI Tools, Web App…
ANSI SQL
What is Apache Kylin
 Apache Kylin is an open source distributed analytics engine that
provides a SQL interface for multi-dimensional analysis on Hadoop
 Works well with extremely large datasets
 Provides REST API, ODBC and JDBC as user interface
 Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com,
etc.
Apache Kylin Global Adoptions
What is Apache Kylin
 Apache Kylin is an open source distributed analytics engine that
provides a SQL interface for multi-dimensional analysis on Hadoop
 Works well with extremely large datasets
 Provides REST API, ODBC and JDBC as user interface
 Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com,
etc.
 Apache Kylin pre-calculates OLAP cubes with a horizontal scalable
computation framework(MapReduce, Spark, etc.) and store the cubes
into a reliable & scalable data store(HBase, Casscandra, etc.)
Architecture Design
Cube Builder
(MapReduce, Spark, etc…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST ServerDataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
Cube data explained
dimensions cuboid cuboid lattice
Cubes stored in HBase
Let’s take a looks at
cuboid (D1,D3,D5)
where all dimensions are:
(D1,D2,D3,D4,D5)
This cuboid is donated as “cuboid 00010101”
Why HBase as the first choice?
 Well integrated with Hadoop
 Block encoding to reduce storage footprint
 Good at both seeking and scanning
 Coprocessors to move computation to data
 Scalable and flexible as a data store
Region server
How Kylin queries HBase
Kylin Query
Server
region
coprocessor
Country Metrics…DateSellerIDCuboidID
2. Scan with Fuzzy Key Filter
1. Filter/Aggregation push down
3. Half baked results
May still be slow when
 The cuboid is large because there’s really lots of combinations in it
 Cuboid layout is not friendly to query, e.g. filter on suffix dimensions while
group by prefix dimensions.
 The filter in query is huge and complex
 Regions are returning too many half-baked results
Solution: Cube + MPP
Kylin Query
Server
Novelty
 Compared with “pure” MPP solutions
 Cube data is more query-friendly because it is pre-aggregated and sorted.
 Faster speed
 Less CPU consumption
 Less storage read
 Able to leverage column storage and inverted index just like typical MPP
 Compared with “pure” Cubing technologies
 Overcome the bottleneck in cube size
 Overcome the bottleneck in cube visiting speed
Problem
 The sizes of different cuboids in the same cube may vary
 Too many parallelism for small cuboids is harmful
 A RPC is required for each shard, we don’t want to abuse network/CPU
resource
Solution: Shard Circle
0
1
2
3
4
5
6
7
8
9
Given estimated size for each cuboid 𝑆𝑖,
and expected size for each region 𝑆𝑟 (specified by modeler)
𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 =
𝑆𝑖
𝑆𝑟
𝑐𝑢𝑏𝑜𝑖𝑑𝑅𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 =
𝑆𝑖 ∗ 𝑓𝑎𝑐𝑡𝑜𝑟
𝑆𝑟
𝑐𝑢𝑏𝑜𝑖𝑑𝐶𝑖𝑟𝑐𝑙𝑒𝑆𝑡𝑎𝑟𝑡 = ℎ𝑎𝑠ℎ 𝑖 𝑀𝑂𝐷 𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚
Salted Cuboid Rows
 ShardID at the beginning of row key
 Configurable policies for computing ShardID
 From hash result of remaining row key – facilitate randomize
 From specific dimension values – facilitate runtime performance
Country Metrics…DateSellerIDCuboidIDShardID
Compute ShardID from SellerID
 For queries those group by SellerID
 Each shard aggregating non-joint subset of SellerIDs
 No further aggregation at merge side
 For queries those filter by SellerID
 The push down SellerID filter can be trimmed to contain only interested
SellerIDs
Experimental results
Small cuboids getting less shards
1.005586592
0.625 0.625
0.678571429
0.794117647
0
0.2
0.4
0.6
0.8
1
1.2
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
13 regions 23 regions
Q & A
To get more information about Apache Kylin:
 Apache Kylin Website: http://kylin.apache.org
 Kyligence Website: http://kyligence.io
 Twitter: @ApacheKylin
 Mail list: dev@kylin.apache.org

Apache Kylin’s Performance Boost from Apache HBase

  • 1.
    Hongbin Ma, LukeHan Kyligence Inc. Apache Kylin’s Performance Boost from Apache HBase
  • 2.
    About us Hongbin Ma|马洪宾  PMC member of Apache Kylin  Technical partner of Kyligence Inc.  mahongbin@apache.org Kyligence Inc.  Kyligence is a leading data intelligence company focusing on Big Data technologies and innovation, offering intelligent platform and product powered by Apache Kylin™ for enterprise ready business analytics solutions. Luke Han | 韩卿  Co-creator & VP of Apache Kylin  ASF Member  Co-founder & CEO at Kyligence Inc.  lukehan@apache.org
  • 3.
    Apache Kylin aerialview MapReduce/Spark Kylin BI Tools, Web App… ANSI SQL
  • 4.
    What is ApacheKylin  Apache Kylin is an open source distributed analytics engine that provides a SQL interface for multi-dimensional analysis on Hadoop  Works well with extremely large datasets  Provides REST API, ODBC and JDBC as user interface  Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com, etc.
  • 5.
  • 6.
    What is ApacheKylin  Apache Kylin is an open source distributed analytics engine that provides a SQL interface for multi-dimensional analysis on Hadoop  Works well with extremely large datasets  Provides REST API, ODBC and JDBC as user interface  Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com, etc.  Apache Kylin pre-calculates OLAP cubes with a horizontal scalable computation framework(MapReduce, Spark, etc.) and store the cubes into a reliable & scalable data store(HBase, Casscandra, etc.)
  • 7.
    Architecture Design Cube Builder (MapReduce,Spark, etc…) SQL Low Latency - SecondsRouting 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cubes (HBase) SQL REST ServerDataSource Abstraction Engine Abstraction Storage Abstraction
  • 8.
    Cube data explained dimensionscuboid cuboid lattice
  • 9.
    Cubes stored inHBase Let’s take a looks at cuboid (D1,D3,D5) where all dimensions are: (D1,D2,D3,D4,D5) This cuboid is donated as “cuboid 00010101”
  • 10.
    Why HBase asthe first choice?  Well integrated with Hadoop  Block encoding to reduce storage footprint  Good at both seeking and scanning  Coprocessors to move computation to data  Scalable and flexible as a data store
  • 11.
    Region server How Kylinqueries HBase Kylin Query Server region coprocessor Country Metrics…DateSellerIDCuboidID 2. Scan with Fuzzy Key Filter 1. Filter/Aggregation push down 3. Half baked results
  • 12.
    May still beslow when  The cuboid is large because there’s really lots of combinations in it  Cuboid layout is not friendly to query, e.g. filter on suffix dimensions while group by prefix dimensions.  The filter in query is huge and complex  Regions are returning too many half-baked results
  • 13.
    Solution: Cube +MPP Kylin Query Server
  • 14.
    Novelty  Compared with“pure” MPP solutions  Cube data is more query-friendly because it is pre-aggregated and sorted.  Faster speed  Less CPU consumption  Less storage read  Able to leverage column storage and inverted index just like typical MPP  Compared with “pure” Cubing technologies  Overcome the bottleneck in cube size  Overcome the bottleneck in cube visiting speed
  • 15.
    Problem  The sizesof different cuboids in the same cube may vary  Too many parallelism for small cuboids is harmful  A RPC is required for each shard, we don’t want to abuse network/CPU resource
  • 16.
    Solution: Shard Circle 0 1 2 3 4 5 6 7 8 9 Givenestimated size for each cuboid 𝑆𝑖, and expected size for each region 𝑆𝑟 (specified by modeler) 𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 = 𝑆𝑖 𝑆𝑟 𝑐𝑢𝑏𝑜𝑖𝑑𝑅𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 = 𝑆𝑖 ∗ 𝑓𝑎𝑐𝑡𝑜𝑟 𝑆𝑟 𝑐𝑢𝑏𝑜𝑖𝑑𝐶𝑖𝑟𝑐𝑙𝑒𝑆𝑡𝑎𝑟𝑡 = ℎ𝑎𝑠ℎ 𝑖 𝑀𝑂𝐷 𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚
  • 17.
    Salted Cuboid Rows ShardID at the beginning of row key  Configurable policies for computing ShardID  From hash result of remaining row key – facilitate randomize  From specific dimension values – facilitate runtime performance Country Metrics…DateSellerIDCuboidIDShardID
  • 18.
    Compute ShardID fromSellerID  For queries those group by SellerID  Each shard aggregating non-joint subset of SellerIDs  No further aggregation at merge side  For queries those filter by SellerID  The push down SellerID filter can be trimmed to contain only interested SellerIDs
  • 19.
  • 20.
    Small cuboids gettingless shards 1.005586592 0.625 0.625 0.678571429 0.794117647 0 0.2 0.4 0.6 0.8 1 1.2 SQL 1 SQL 2 SQL 3 SQL 4 SQL 5 13 regions 23 regions
  • 21.
    Q & A Toget more information about Apache Kylin:  Apache Kylin Website: http://kylin.apache.org  Kyligence Website: http://kyligence.io  Twitter: @ApacheKylin  Mail list: dev@kylin.apache.org

Editor's Notes