Apache Kylin’s Performance Boost from Apache HBase

Hongbin Ma, Luke Han
Kyligence Inc.
Apache Kylin’s
Performance Boost from
Apache HBase

About us
Hongbin Ma| 马洪宾
 PMC member of Apache Kylin
 Technical partner of Kyligence Inc.
 mahongbin@apache.org
Kyligence Inc.
 Kyligence is a leading data intelligence company focusing on Big Data technologies and
innovation, offering intelligent platform and product powered by Apache Kylin™ for
enterprise ready business analytics solutions.
Luke Han | 韩卿
 Co-creator & VP of Apache Kylin
 ASF Member
 Co-founder & CEO at Kyligence Inc.
 lukehan@apache.org

Apache Kylin aerial view
MapReduce/Spark
Kylin
BI Tools, Web App…
ANSI SQL

What is Apache Kylin
 Apache Kylin is an open source distributed analytics engine that
provides a SQL interface for multi-dimensional analysis on Hadoop
 Works well with extremely large datasets
 Provides REST API, ODBC and JDBC as user interface
 Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com,
etc.

What is Apache Kylin
 Apache Kylin is an open source distributed analytics engine that
provides a SQL interface for multi-dimensional analysis on Hadoop
 Works well with extremely large datasets
 Provides REST API, ODBC and JDBC as user interface
 Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com,
etc.
 Apache Kylin pre-calculates OLAP cubes with a horizontal scalable
computation framework(MapReduce, Spark, etc.) and store the cubes
into a reliable & scalable data store(HBase, Casscandra, etc.)

Architecture Design
Cube Builder
(MapReduce, Spark, etc…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST ServerDataSource
Abstraction
Engine
Abstraction
Storage
Abstraction

Cube data explained
dimensions cuboid cuboid lattice

Cubes stored in HBase
Let’s take a looks at
cuboid (D1,D3,D5)
where all dimensions are:
(D1,D2,D3,D4,D5)
This cuboid is donated as “cuboid 00010101”

Why HBase as the first choice?
 Well integrated with Hadoop
 Block encoding to reduce storage footprint
 Good at both seeking and scanning
 Coprocessors to move computation to data
 Scalable and flexible as a data store

Region server
How Kylin queries HBase
Kylin Query
Server
region
coprocessor
Country Metrics…DateSellerIDCuboidID
2. Scan with Fuzzy Key Filter
1. Filter/Aggregation push down
3. Half baked results

May still be slow when
 The cuboid is large because there’s really lots of combinations in it
 Cuboid layout is not friendly to query, e.g. filter on suffix dimensions while
group by prefix dimensions.
 The filter in query is huge and complex
 Regions are returning too many half-baked results

Solution: Cube + MPP
Kylin Query
Server

Novelty
 Compared with “pure” MPP solutions
 Cube data is more query-friendly because it is pre-aggregated and sorted.
 Faster speed
 Less CPU consumption
 Less storage read
 Able to leverage column storage and inverted index just like typical MPP
 Compared with “pure” Cubing technologies
 Overcome the bottleneck in cube size
 Overcome the bottleneck in cube visiting speed

Problem
 The sizes of different cuboids in the same cube may vary
 Too many parallelism for small cuboids is harmful
 A RPC is required for each shard, we don’t want to abuse network/CPU
resource

Solution: Shard Circle
0
1
2
3
4
5
6
7
8
9
Given estimated size for each cuboid 𝑆𝑖,
and expected size for each region 𝑆𝑟 (specified by modeler)
𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 =
𝑆𝑖
𝑆𝑟
𝑐𝑢𝑏𝑜𝑖𝑑𝑅𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 =
𝑆𝑖 ∗ 𝑓𝑎𝑐𝑡𝑜𝑟
𝑆𝑟
𝑐𝑢𝑏𝑜𝑖𝑑𝐶𝑖𝑟𝑐𝑙𝑒𝑆𝑡𝑎𝑟𝑡 = ℎ𝑎𝑠ℎ 𝑖 𝑀𝑂𝐷 𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚

Salted Cuboid Rows
 ShardID at the beginning of row key
 Configurable policies for computing ShardID
 From hash result of remaining row key – facilitate randomize
 From specific dimension values – facilitate runtime performance
Country Metrics…DateSellerIDCuboidIDShardID

Compute ShardID from SellerID
 For queries those group by SellerID
 Each shard aggregating non-joint subset of SellerIDs
 No further aggregation at merge side
 For queries those filter by SellerID
 The push down SellerID filter can be trimmed to contain only interested
SellerIDs

Small cuboids getting less shards
1.005586592
0.625 0.625
0.678571429
0.794117647
0
0.2
0.4
0.6
0.8
1
1.2
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
13 regions 23 regions

Q & A
To get more information about Apache Kylin:
 Apache Kylin Website: http://kylin.apache.org
 Kyligence Website: http://kyligence.io
 Twitter: @ApacheKylin
 Mail list: dev@kylin.apache.org

Apache Kylin’s Performance Boost from Apache HBase

More Related Content

What's hot

Similar to Apache Kylin’s Performance Boost from Apache HBase

More from HBaseCon

Recently uploaded

Apache Kylin’s Performance Boost from Apache HBase

Editor's Notes