2: Data Model
Creating an Efficient Data Model
for highly-loaded Applications
~ 10,300 Registrations https://lime.link/blog/visualizing-crowd-sizes/
DataStax Developer Advocacy Special Unit
Your hosts David Jones-Gilardi
Developer Advocate
Bettina Swynnerton Jack Fryer
Community Engineer Community Manager
Cédrick Lunven Aleksandr Volochnev
Developer Advocate Developer Advocate
Your hosts Aleksandr Volochnev
Developer Advocate
Bettina Swynnerton Jack Fryer
Community Engineer Community Manager
Cédrick Lunven Erick Ramirez
Developer Advocate Lord of the Cassandra Rings
Cassandra Workshop SERIES
PART 1: How to build Applications with Cassandra
● Week 1 (7/01-7/08) : Getting Started with Cassandra
● Week 2 (7/08-7/15) : Data Modelling for Apache Cassandra™
● Week 3 (7/15-7/22) : Application Development, Backend Services and CRUD
● Week 4 (7/22-7/29) : Application Development, Microservices and REST API
PART 2: Test, Deploy and monitor your clusters and applications
● Week 5 (7/29-8/05): Operating your Cassandra clusters
● Week 6 (8/05-8/12): Running Cassandra performance tests
● Week 7 (8/12-8/19): Testing your deployments and troubleshooting
● Week 8 (8/19-8/26): Deploying Cassandra with Kubernetes
How does it work ?
Attend one of the 2 LIVE STREAMED workshop
1 (Wednesday or Thursday) Choose the one what matches your
timezone.
Courses: youtube.com/DataStaxDevs Runtime: dtsx.io/workshop
YouTube
Twitch
Questions: bit.ly/cassandra-workshop Materials: github.com/DataStax-Academy
Discord
YouTube
Quizz: menti.com Coding (starting week #3)
How does it work ?
Attend one of the 2 LIVE STREAMED workshop
1 (Wednesday or Thursday) -> They are recorded
2 Do Homeworks
Forum: community.datastax.com Training: academy.datastax.com
Chat: bit.ly/cassandra-workshop Validate your week:
Discord
How does it work ?
Attend one of the 2 LIVE STREAMED workshop
1 (Wednesday or Thursday) -> They are recorded
2 Homeworks
3 Relax.
Developer Workshop Series Week II
➔ Keyspaces, Tables, Partitions
➔ The Art of Data Modelling
➔ Data Types
➔ What’s NEXT?
Cassandra Cloud-Native Workshop Series
menti.com
89 49 47
Developer Workshop Series Week II
➔ Keyspaces, Tables, Partitions
➔ The Art of Data Modelling
➔ Data Types
➔ What’s NEXT?
Cassandra Cloud-Native Workshop Series
Infrastructure: a Node
A single bare-metal
server, a virtual instance
or a docker container.
Infrastructure: a Datacenter (Ring)
A group of nodes located
in the same physical
location, a cloud
datacenter or an
availability zone.
Infrastructure: a Cluster
A group of
datacenters configured to
work together.
Data Structure: a Cell
An intersect of a row and
a column, stores data.
Data Structure: a Row
A single, structured data
item in a table.
Data Structure: a Partition
A group of rows having the ID First Name Last Name Department
same partition token, a base unit
of access in Cassandra. 1 John Doe Wizardry
IMPORTANT: stored together, all 399 Marisha Chapez Wizardry
the rows are guaranteed to be
neighbours. 415 Maximus Flavius Wizardry
Data Structure: a Table
ID First Name Last Name Department
A group of columns and 1 John Doe Wizardry
rows storing partitions.
2 Mary Smith Dark Magic
3 Patrick McFadin DevRel
Data Structure: a Keyspace
A group of tables sharing
replication strategy,
replication factor and other
properties
Data Structure: Overall
Keyspace columns
● Tabular data model, with one twist
Table
● Keyspaces contain tables
- - - -
● Tables are organized in rows and columns
- - -
x ● Groups of related rows called partitions are stored
partitions - - - together on the same node (or nodes)
● Each row contains a partition key
- - -
○ One or more columns that are hashed to
y - - -
determine which node(s) store that data
- - -
z - - -
rows
Partition key
Example Data: Users organized by city
Keyspace users
Table users_by_city
Last First
City Address Email
Name Name
Hellson Kevin 23 Jackson St. kevin@gmail.com
Phoenix Lastfall Norda 3 Stone St dino@gmail.com
partitions Smith Jana 3 Stone St wilm@gmail.com
Franklin George 2 Star St gjet@gmail.com
rows
Seattle Jackson Jane 2 Star St jane@gmail.com
Jasons Judy 2 StarSt judy@gmail.com
Partition key column Clustering columns Data columns
Creating a Keyspace in CQL
keyspace replication strategy
CREATE KEYSPACE users
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'us-west-1' : 3,
'eu-central-1' : 3
};
Replication factor by data center
Creating a Table in CQL
keyspace table
CREATE TABLE users.users_by_city (
city text,
column last_name text,
definitions first_name text,
address text,
email text,
PRIMARY KEY ((city), last_name, first_name, email));
Primary key Partition key Clustering columns
Primary Key
CREATE TABLE users.users_by_city (
city text,
An identifier for a row. Consists
last_name text,
of at least one Partition Key and first_name text,
zero or more Clustering address text,
Columns. email text,
PRIMARY KEY ((city), last_name, first_name, email));
MUST ENSURE UNIQUENESS.
MAY DEFINE SORTING.
Partition key Clustering columns
Good Examples:
PRIMARY KEY ((city), last_name, first_name, email);
PRIMARY KEY (user_id);
Bad Example:
PRIMARY KEY ((city), last_name, first_name);
Partition Key
CREATE TABLE users.users_by_city (
city text,
An identifier for a partition. last_name text,
Consists of at least one column, first_name text,
address text,
may have more if needed
email text,
PRIMARY KEY ((city), last_name, first_name, email));
PARTITIONS ROWS.
Partition key Clustering columns
Good Examples:
PRIMARY KEY (user_id);
PRIMARY KEY ((video_id), comment_id);
Bad Example:
PRIMARY KEY ((sensor_id), logged_at);
Clustering Column(s)
CREATE TABLE users.users_by_city (
city text,
Used to ensure uniqueness last_name text,
and sorting order. Optional. first_name text,
address text,
email text,
PRIMARY KEY ((city), last_name, first_name, email));
Partition key Clustering columns
PRIMARY KEY ((city), last_name, first_name); Not Unique
PRIMARY KEY ((city), last_name, first_name, email);
PRIMARY KEY ((video_id), comment_id); Not Sorted
PRIMARY KEY ((video_id), created_at, comment_id);
Partition: The Beginning
CREATE TABLE users.users_by_city (
city text,
last_name text,
first_name text,
address text,
email text,
PRIMARY KEY ((city), last_name, first_name, email));
● Every node is responsible for a range of tokens (0-100500, 100501-201000...)
● INSERT a new row, we get the value of its Partition Key (can’t be null!)
● We hash this value using MurMur3 hasher http://murmurhash.shorelabs.com/
“Seattle” becomes 2466717130 Partition Key = Seattle, Partition Token = 2466717130
● This partition belongs to the node[s] responsible for this token
● The INSERT query goes to the nodes storing this partition (Notice Replication Factor)
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big partitions
● Avoid hot partitions
PRIMARY KEY (user_id);
PRIMARY KEY ((video_id), comment_id);
PRIMARY KEY ((country), user_id);
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big partitions
● Avoid hot partitions
Example: open a video? Get the comments in a single query!
PRIMARY KEY ((video_id), created_at, comment_id);
PRIMARY KEY ((comment_id), created_at);
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big partitions
● Avoid hot partitions
PRIMARY KEY ((video_id), created_at, comment_id);
PRIMARY KEY ((country), user_id);
● No technical limitations, but…
● Up to ~100k rows in a partition
● Up to ~100MB in a Partition
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big partitions?
● Avoid hot partitions
Example: a huge IoT infrastructure, hardware all
● Sensor ID: UUID
over the world, different sensors reporting their state
● Timestamp: Timestamp
every 10 seconds. Every sensor reports its UUID,
● Value: float
timestamp of the report, sensor’s value.
PRIMARY KEY ((sensor_id), reported_at);
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big and constantly growing partitions!
● Avoid hot partitions
Example: a huge IoT infrastructure, hardware all over the
● Sensor ID: UUID
world, different sensors reporting their state every 10
● Timestamp: Timestamp
seconds. Every sensor reports its UUID, timestamp of the
● Value: float
report, sensor’s value.
PRIMARY KEY ((sensor_id), reported_at);
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big and constantly growing partitions!
● Avoid hot partitions
Example: a huge IoT infrastructure, hardware all over the
BUCKETING
● Sensor ID - UUID
world, different sensors reporting their state every 10
● Timestamp:
seconds. Every sensor reports its UUID, timestamp of the
● Value: float
report, sensor’s value.
PRIMARY KEY ((sensor_id), reported_at);
PRIMARY KEY ((sensor_id, _____), reported_at);
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big and constantly growing partitions!
● Avoid hot partitions BUCKETING
Example: a huge IoT infrastructure, hardware all over the
● Sensor ID: UUID
world, different sensors reporting their state every 10
● MonthYear: Integer or String
seconds. Every sensor reports its UUID, timestamp of the
● Timestamp: Timestamp
report, sensor’s value.
● Value: float
PRIMARY KEY ((sensor_id), reported_at);
PRIMARY KEY ((sensor_id, _____), reported_at);
PRIMARY KEY ((sensor_id, month_year), reported_at);
Rules of a Good Partition The Slide of the Year Award!
● Store together what you retrieve together
● Avoid big partitions
● Avoid hot partitions
PRIMARY KEY (user_id);
PRIMARY KEY ((video_id), created_at, comment_id);
PRIMARY KEY ((country), user_id);
Always specify the partition key!
This?
If there is no partition key in
a query, which node you will
ask?
Or this?
Always specify the partition key! Or maybe that?!
If there is no partition key in
a query, which node you will
ask?
Always specify the partition key!
If there is no partition key in
a query, which node you will This?
ask?
CREATE TABLE users.users_by_city (
city text,
last_name text,
first_name text,
address text,
email text,
PRIMARY KEY ((city), last_name, first_name, email)); Or this?
SELECT address FROM users_by_city WHERE first_name = “Anna”;
SELECT address FROM users_by_city WHERE city = “Otterberg” AND last_name = “Koshkina”;
Developer Workshop Series Week II
➔ Keyspaces, Tables, Partitions
➔ The Art of Data Modelling
➔ Data Types
➔ What’s NEXT?
Cassandra Cloud-Native Workshop Series
Relational Data
Modelling Data
1. Analyze raw data
2. Identify entities, their
properties and relations Models
3. Design tables, using
normalization and foreign
keys.
4. Use JOIN when doing
queries to join denormalized
data from multiple tables Application
Normalization
“Database normalization is the
process of structuring a relational
database in accordance with a series
of so-called normal forms in order to
reduce data redundancy and improve
data integrity. It was first proposed by
Edgar F. Codd as part of his relational
model.”
PROS: Simple write, Data Integrity
CONS: Slow read, Complex Queries
Denormalization
“Denormalization is a strategy used on
a database to increase performance. In
computing, denormalization is the
process of trying to improve the read
performance of a database, at the
expense of losing some write
performance, by adding redundant
copies of data”
PROS: Quick Read, Simple Queries
CONS: Multiple Writes, Manual Integrity
NoSQL Data Modelling
Application
1. Analyze user behaviour
(customer first!)
2. Identify workflows, their
dependencies and needs
3. Define Queries to fulfill these Models
workflows
4. Knowing the queries, design
tables, using denormalization.
5. Use BATCH when inserting or
updating denormalized data of Data
multiple tables
Let’s go practical!
Designing Process: Step by Step
Entities & Relationships
Queries
Designing Process:
Conceptual Data Model
Designing Process:
Application Workflow
Use-Case I:
● A User opens a Video Page
WF1: Find comments related to target video using its identifier, most recent first
Use-Case II + III:
● A User opens a Profile
● A Moderator verifies a User if spammer or not
WF2: Find comments related to target user using its identifier, get most recent first
Designing Process:
Mapping
Query I: Find comments posted for a user
comments_by_user
with a known id (show most recent first)
Query II: Find comments for a video with a comments_by_video
known id (show most recent first)
Designing Process:
Mapping
SELECT * FROM comments_by_user comments_by_user
WHERE userid = <some UUID>
SELECT * FROM comments_by_video comments_by_video
WHERE videoid = <some UUID>
Designing Process:
Logical Data Model
comments_by_user comments_by_video
↑ ↑
↑ ↑
Designing Process:
Physical Data Model
comments_by_user comments_by_video
UUID UUID
↑ ↑
TIMEUUID TIMEUUID
UUID UUID
TEXT TEXT
Designing Process:
Schema DDL
CREATE TABLE IF NOT EXISTS comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY ((userid), commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
CREATE TABLE IF NOT EXISTS comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY ((videoid), commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
menti.com
89 49 47
Developer Workshop Series Week II
➔ Keyspaces, Tables, Partitions
➔ The Art of Data Modelling
➔ Data Types
➔ What’s NEXT?
Cassandra Cloud-Native Workshop Series
Basic Data Types
Collections
I’m an ordered LIST
I’m a SET with a bunch
of unordered things 0 1 2 3 4 6
g...
somethin
a thing.. I’m a MAP of
.
an key/value pairs Key Value
r...
oth
er yet anothe K1 V1
t hin
g K2 V2
...
in g... K3 V3
sm all th
K4 V4
K5 V5
Collection: Set
{‘Family’, ‘Disney’, ‘Princess’}
{‘Thriller’, ‘Short’}
{‘Tragicomedy’, ‘Western’}
Collection: Set
Insert
Replace entire set
Add to set
Collection: List
List
Collection: List
Insert
Replace entire list
Append
Collection: List
Replace an element
Note: replacing an element requires a read-before-write, which
implies performance penalty.
Collection: Map
Collection: Map
Insert
Replace entire map
Add to map
User Defined Types
Cool
Thing!
User Defined Types
Insert
Replace entire UDT
Replace one UDT field
User Defined Types
Select field
Counters
• 64-bit signed integer
• Use-case:
• Imprecise values such as likes, views, etc.
• Two operations:
• Increment
• Decrement
• First op assumes the value is zero
Counters
• Cannot be part of primary key
• Counters not mixed with other types in table
• Value cannot be set
• Rows with counters cannot be inserted
• Updates are not idempotent
• Counters should not be used for precise
values
Counters
Counters
This format must be
observed
Incrementing a counter:
This can be an
integer value
Just change the
sign
Decrementing a counter:
menti.com
89 49 47
Developer Workshop Series Week II
➔ Keyspaces, Tables, Partitions
➔ The Art of Data Modelling
➔ Data Types
➔ What’s NEXT?
Cassandra Cloud-Native Workshop Series
📚📕 Homework Week II 📕📚
1. Do notebooks: Data Modelling and Advanced Data Types
https://github.com/DataStax-Academy/cassandra-workshop-series/tree/master/week2-DataModel
2. Watch the DS220 Course (videos only)
https://academy.datastax.com/resources/ds220
3. Validate your participation, fill the form (5 questions)
https://forms.gle/YYZcZti4r3JgCLtV9
📚📕 Bonus Week II 📕📚
● Single-Row Partitions
https://katacoda.com/datastax/courses/cassandra-intro/tables-single-row-partitions
● Multi-Row Partitions
https://katacoda.com/datastax/courses/cassandra-intro/tables-multi-row-partitions
● Advanced Data Types
https://katacoda.com/datastax/courses/cassandra-intro/advanced-data-types
Training Courses at DataStax Academy
○ DS201: Foundations of Apache Cassandra™
○ DS220: Practical Application Data Modeling
with Apache Cassandra™
Developer Resources
● Join academy.datastax.com
LEARN
● Browse www.datastax.com/dev
Join community.datastax.com
ASK/SHARE
Ask/answer community user questions - share your expertise
Follow us @DataStaxDevs
CONNECT
We are on Youtube - Twitter - Twitch!
Slides and code for this course are available at
MATERIALS https://github.com/DataStax-Academy/cassandra-workshop-series
Part 1 - How to build Applications with Cassandra
Every Wednesday (NAM/LATAM) - 9am PDT / 12pm EDT / 5pm BST / 6pm CEST / 21:30pm IST
Every Thursday (APAC/EMEA) - 8am BST / 9am CEST / 12:30 pm IST / 15 pm SGT / 17pm ACT
July 1 Getting Started with Cassandra
July 8 Data Modelling for Apache Cassandra™
July 15 Application Development with Cassandra part 1
July 22 Application Development with Cassandra part 2
Part 2 - Test, Deploy and Monitor
Every Wednesday (NAM/LATAM) - 9am PDT / 12pm EDT / 5pm BST / 6pm CEST / 21:30pm IST
Every Thursday (APAC/EMEA) - 8am BST / 9am CEST / 12:30 pm IST / 15 pm SGT / 17pm ACT
July 29 Operating your Cassandra clusters
Aug 5 Running Cassandra performance tests
Aug 12 Testing your deployments and troubleshooting
Aug 19 Deploying Cassandra with Kubernetes
menti.com
89 49 47
Thank You