Apache cassandra - future without boundaries (part1)

August 6, 2015 www.ExigenServices.com
Apache Cassandra – Future without
Boundaries
Part 1

2 www.ExigenServices.com
I. RDBMS Pros and Cons

Pros
1. Good balance between functionality and usability.
Powerful tools support.
2. SQL has feature rich syntax.
3. Set of widely accepted standards.
4. ACID

Scalability
RDBMS were mainstream for tens of years till
 requirements for scalability increased
dramatically;
 complexity of processed data structures increased
dramatically;

Scaling
Two ways of scaling:
– Vertical scaling
– Horizontal scaling

CAP Theorem

Cons
Cost of distributed transactions
a) Lower availability. Two DB with 99.9% have
availability.
99.9% * 99.9% ~ 99.8% (43 min. downtime per month).
b) Additional synchronization overhead.
c) As slow as slowest DB node + network latency.
d) 2PC is blocking protocol.
e) It is possible to lock resources forever.

Cons
Usage of master - slave replication.
 Makes write side (master) performance
bottleneck and requires additional CPU/IO
resources.
 There is no partition tolerance.

Sharding
a) Vertical sharding
b) Horizontal sharding

Vertical sharding
DB instances are divided
by DB functions.

Horizontal sharding
One table is divided onto
several resources
Hashcode sharding

Cassandra sharding
 Cassandra uses hash code load balancing
 Cassandra better fits for reporting than for business
logic processing.
 Cassandra + Hadoop == OLAP server with high
performance and availability.

II. Apache Cassandra. Overview

Cassandra
Amazon Dynamo
(architecture)
 DHT
 Eventual consistency
 Tunable trade-offs, tunable
consistency
Google BigTable
(data model)
 Values are structured and
indexed
 Column families and columns
+

Distributed and decentralized
 No master/slave nodes (server symmetry)
 No single point of failure

DHT
Distributed hash table
 lookup service similar to a hash table - (key, value)
 any participating node can efficiently retrieve the value associated
with a given key

Keyspace
Abstract keyspace, such as the set of 128 or 160
bit strings.

Partitioning
 A keyspace partitioning scheme splits
ownership of this keyspace among the
participating nodes.

Keyspace partitioning
 Keyspace distance function δ(k1,k2)
 A node with ID ix owns all the keys km for which
ix is the closest ID, measured according to δ(km,ix).

 Imagine mapping range from 0 to 2128 into a circle
so the values wrap around.

 Consider what happens if node C is removed

 Consider what happens if node D is added

Overlay network
 For any key k, each node either has a node ID
that owns k or has a link to a node whose node ID
is closer to k
 Greedy algorithm: at each step, forward the
message to the neighbor whose ID is closest to k

Elastic scalability
 Adding/removing new node doesn’t require
reconfiguring of Cassandra, changing application
queries or restarting system

High availability and fault tolerance
 Cassandra picks A and P from CAP
 Eventual consistency

Tunable consistency
 Replication factor (number of copies of each piece
of data)
 Consistency level (number of replicas to access
on every read/write operation)
Consistency level Read / Write
ONE 1 replica
QUORUM N/2 + 1
ALL N

Quorum consistency level
R = N/2 + 1
W = N/2 + 1
R + W > N

Hybrid orientation
 Column orientation
– columns aren’t fixed
– columns can be sorted
– columns can be queried for a certain range
 Row orientation
– each row is uniquely identifiable by key
– columns are grouped into rows

Schema-free
 You don’t have to define columns when you
create data model
 You think of queries you will use and then provide
data around them

III. Data Model

Table1 Table2
Database
Relational data model
Column1 Column2
Row1 value value
Row2 null value
…
Column1 Column2 Column3
Row1 value value value
Row2 null value null
…

Cassandra data model
Keyspace
Column Family
RowKey1
RowKey2
Value3Value2Value1
Value4Value1
Column4Column1

Keyspace
 Keyspace is close to a relational database
 Basic attributes:
– replication factor
– replica placement strategy
– column families (tables from relational model)
 Possible to create several keyspaces per application (for
example, if you need different replica placement strategy
or replication factor)

Column family
 Container for collection of rows
 Column family is close to a table from relational
data model
Column Family
Row
RowKey
Value3Value2Value1

Key-value store
Four-dimensional hash map
[Keyspace][ColumnFamily][RowKey][Column]

Column family vs. Table
 The columns are not strictly defined
 A column family can hold columns or super
columns (collection of subcolumns)

Column family vs. Table
 Column family has an comparator attribute
 Each column family is stored in separate file on
disk

Column
 Basic unit of data structure
Column
name: byte[] value: byte[] clock: long

Skinny and wide rows
 Wide rows – huge number of columns and
several rows (are used to store lists of things)
 Skinny rows – small number of columns and
many different rows (close to the relational model)

Disadvantages of wide rows
 Badly work with RowCash
 If you have many rows and many columns you
end up with larger indexes
(~ 40GB of data and 10GB index)

Column sorting
 Column sorting is typically important only with
wide model
 Comparator – is an attribute of column family that
specifies how column names will be compared for
sort order

Comparator types
 Cassandra has following predefined types:
– AsciiType
– BytesType
– LexicalUUIDType
– IntegerType
– LongType
– TimeUUIDType
– UTF8Type

Super column
Super column
name: byte[] cols: Map<byte[], Column>
• Cannot store map of super columns (only one
level deep)
• Five-dimensional hash:
[Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]
 Stores map of subcolumns

Super column family
Column families:
– Standard (default)
 Can combine columns and super columns
– Super
 More strict schema constraints
 Can store only super columns
 Subcomparator can be specified for
subcolumns

Note that
There are no joins in Cassandra, so you can
– join data on a client side
– create denormalized second column family

Apache cassandra - future without boundaries (part1)

More Related Content

Viewers also liked

Similar to Apache cassandra - future without boundaries (part1)

More from Return on Intelligence

Recently uploaded

Apache cassandra - future without boundaries (part1)