KEMBAR78
Apache cassandra - future without boundaries (part1) | PPTX
August 6, 2015 www.ExigenServices.com
Apache Cassandra – Future without
Boundaries
Part 1
2 www.ExigenServices.com
I. RDBMS Pros and Cons
3 www.ExigenServices.com
Pros
1. Good balance between functionality and usability.
Powerful tools support.
2. SQL has feature rich syntax.
3. Set of widely accepted standards.
4. ACID
4 www.ExigenServices.com
Scalability
RDBMS were mainstream for tens of years till
 requirements for scalability increased
dramatically;
 complexity of processed data structures increased
dramatically;
5 www.ExigenServices.com
Scaling
Two ways of scaling:
– Vertical scaling
– Horizontal scaling
6 www.ExigenServices.com
CAP Theorem
7 www.ExigenServices.com
Cons
Cost of distributed transactions
a) Lower availability. Two DB with 99.9% have
availability.
99.9% * 99.9% ~ 99.8% (43 min. downtime per month).
b) Additional synchronization overhead.
c) As slow as slowest DB node + network latency.
d) 2PC is blocking protocol.
e) It is possible to lock resources forever.
8 www.ExigenServices.com
Cons
Usage of master - slave replication.
 Makes write side (master) performance
bottleneck and requires additional CPU/IO
resources.
 There is no partition tolerance.
9 www.ExigenServices.com
Sharding
a) Vertical sharding
b) Horizontal sharding
10 www.ExigenServices.com
Vertical sharding
DB instances are divided
by DB functions.
11 www.ExigenServices.com
Horizontal sharding
One table is divided onto
several resources
Hashcode sharding
12 www.ExigenServices.com
Cassandra sharding
 Cassandra uses hash code load balancing
 Cassandra better fits for reporting than for business
logic processing.
 Cassandra + Hadoop == OLAP server with high
performance and availability.
13 www.ExigenServices.com
II. Apache Cassandra. Overview
14 www.ExigenServices.com
Cassandra
Amazon Dynamo
(architecture)
 DHT
 Eventual consistency
 Tunable trade-offs, tunable
consistency
Google BigTable
(data model)
 Values are structured and
indexed
 Column families and columns
+
15 www.ExigenServices.com
Distributed and decentralized
 No master/slave nodes (server symmetry)
 No single point of failure
16 www.ExigenServices.com
DHT
Distributed hash table
 lookup service similar to a hash table - (key, value)
 any participating node can efficiently retrieve the value associated
with a given key
17 www.ExigenServices.com
Keyspace
Abstract keyspace, such as the set of 128 or 160
bit strings.
18 www.ExigenServices.com
Partitioning
 A keyspace partitioning scheme splits
ownership of this keyspace among the
participating nodes.
19 www.ExigenServices.com
Keyspace partitioning
 Keyspace distance function δ(k1,k2)
 A node with ID ix owns all the keys km for which
ix is the closest ID, measured according to δ(km,ix).
20 www.ExigenServices.com
Keyspace partitioning
 Imagine mapping range from 0 to 2128 into a circle
so the values wrap around.
21 www.ExigenServices.com
Keyspace partitioning
 Consider what happens if node C is removed
22 www.ExigenServices.com
Keyspace partitioning
 Consider what happens if node D is added
23 www.ExigenServices.com
Overlay network
 For any key k, each node either has a node ID
that owns k or has a link to a node whose node ID
is closer to k
 Greedy algorithm: at each step, forward the
message to the neighbor whose ID is closest to k
24 www.ExigenServices.com
Elastic scalability
 Adding/removing new node doesn’t require
reconfiguring of Cassandra, changing application
queries or restarting system
25 www.ExigenServices.com
High availability and fault tolerance
 Cassandra picks A and P from CAP
 Eventual consistency
26 www.ExigenServices.com
Tunable consistency
 Replication factor (number of copies of each piece
of data)
 Consistency level (number of replicas to access
on every read/write operation)
Consistency level Read / Write
ONE 1 replica
QUORUM N/2 + 1
ALL N
27 www.ExigenServices.com
Quorum consistency level
R = N/2 + 1
W = N/2 + 1
R + W > N
28 www.ExigenServices.com
Hybrid orientation
 Column orientation
– columns aren’t fixed
– columns can be sorted
– columns can be queried for a certain range
 Row orientation
– each row is uniquely identifiable by key
– columns are grouped into rows
29 www.ExigenServices.com
Schema-free
 You don’t have to define columns when you
create data model
 You think of queries you will use and then provide
data around them
31 www.ExigenServices.com
III. Data Model
32 www.ExigenServices.com
Table1 Table2
Database
Relational data model
Column1 Column2
Row1 value value
Row2 null value
…
Column1 Column2 Column3
Row1 value value value
Row2 null value null
…
33 www.ExigenServices.com
Cassandra data model
Keyspace
Column Family
RowKey1
RowKey2
Column1 Column2 Column3
Value3Value2Value1
Value4Value1
Column4Column1
34 www.ExigenServices.com
Keyspace
 Keyspace is close to a relational database
 Basic attributes:
– replication factor
– replica placement strategy
– column families (tables from relational model)
 Possible to create several keyspaces per application (for
example, if you need different replica placement strategy
or replication factor)
35 www.ExigenServices.com
Column family
 Container for collection of rows
 Column family is close to a table from relational
data model
Column Family
Row
RowKey
Column1 Column2 Column3
Value3Value2Value1
36 www.ExigenServices.com
Key-value store
Four-dimensional hash map
[Keyspace][ColumnFamily][RowKey][Column]
37 www.ExigenServices.com
Column family vs. Table
 The columns are not strictly defined
 A column family can hold columns or super
columns (collection of subcolumns)
38 www.ExigenServices.com
Column family vs. Table
 Column family has an comparator attribute
 Each column family is stored in separate file on
disk
39 www.ExigenServices.com
Column
 Basic unit of data structure
Column
name: byte[] value: byte[] clock: long
40 www.ExigenServices.com
Skinny and wide rows
 Wide rows – huge number of columns and
several rows (are used to store lists of things)
 Skinny rows – small number of columns and
many different rows (close to the relational model)
41 www.ExigenServices.com
Disadvantages of wide rows
 Badly work with RowCash
 If you have many rows and many columns you
end up with larger indexes
(~ 40GB of data and 10GB index)
42 www.ExigenServices.com
Column sorting
 Column sorting is typically important only with
wide model
 Comparator – is an attribute of column family that
specifies how column names will be compared for
sort order
43 www.ExigenServices.com
Comparator types
 Cassandra has following predefined types:
– AsciiType
– BytesType
– LexicalUUIDType
– IntegerType
– LongType
– TimeUUIDType
– UTF8Type
44 www.ExigenServices.com
Super column
Super column
name: byte[] cols: Map<byte[], Column>
• Cannot store map of super columns (only one
level deep)
• Five-dimensional hash:
[Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]
 Stores map of subcolumns
45 www.ExigenServices.com
Super column family
Column families:
– Standard (default)
 Can combine columns and super columns
– Super
 More strict schema constraints
 Can store only super columns
 Subcomparator can be specified for
subcolumns
46 www.ExigenServices.com
Note that
There are no joins in Cassandra, so you can
– join data on a client side
– create denormalized second column family

Apache cassandra - future without boundaries (part1)