KEMBAR78
Apache Cassandra & Data Modeling | PDF
Apache  Cassandra  
and  data  modeling
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
About  Me
Massimiliano  Tomassi
Software  Engineer  at  DataStax
max.tomassi@datastax.com
@max_tomassi
2
CONTEXTUAL
Characteristics  of  cloud  applications
ALWAYS-­ON DISTRIBUTED SCALABLEREAL-­TIME
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
DataStax  provides  data  management  
for  cloud  applications.
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
From  validation  to  momentum.
400+
Employees
$190M
Funding
500+
Customers
Founded  in  April  2010
Santa  Clara  •  San  Francisco  •  Austin  •  
London  •  Paris  •  Berlin  •  Tokyo  •  Sydney
(Series  E  – Sept.  2014) 30%    +
2016  World’s  Best
100  Cloud  Companies  
Ranked  #1  in  multiple  operational
database  categories
©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
Products:  DataStax Enterprise  (DSE)  
6
Why  Apache  Cassandra?
7
Distributed
8
Distributed  architecture
9
Masterless architecture
(easier  to  scale)
10
CREATE TABLE myapp.measurements (
sensor_id uuid,
time timestamp,
value double
PRIMARY KEY ((sensor_id), time)
);
PARTITION  
KEY
CLUSTERNING  
KEY
Distributed  architecture
11
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Cassandra  Query  Language  (CQL)
Distributed  architecture
12
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Distributed  architecture
13
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Distributed  architecture
14
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Hash  function
59 token
Distributed  architecture
15
INSERT INTO myapp.measurements
(sensor_id, time, value) VALUES
(100, '2017-01-30 11:44:42', 980.50);
Coordinator
Hash  function
59 token
Distributed  architecture
16
Coordinator
Replicated
17
Replication
18
RF  =  1
19
Replication
RF  =  3
Each  token  range  is  replicated  
into  RF numbers  of  nodes.
Replication
20
RF  =  3
Replication
21
RF  =  3
Replication
22
RF  =  3
Consistency?
CL  =  ONE
CL  =  QUORUM
CL  =  ALL
Multi  data  center
23
Geographically  distributed
24
New  York London
Separate  different  workloads
25
OLTP Analytics
Hybrid  on  premise/cloud  deployment  
26
Automatic  replication  across  datacenters
27
Data  Center  1
Data  Center  2
Automatic  replication  across  datacenters
28
Data  Center  1
Data  Center  2
Use  Cases
29
Use  Cases
30
31
https://www.meetup.com/Rome-­Cassandra-­Users/
32
Data  Modeling
33
The  relational  way
34
Authors Books
1
*
CREATE TABLE books (
book_id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
PRIMARY KEY (book_id),
FOREIGN KEY (author_id) REFERENCES authors(id)
)
The  relational  way
SELECT *
FROM books
WHERE id = 9876;
35
QUERY1:  Find  a  book  by  its  id
CREATE TABLE books (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
PRIMARY KEY (book_id),
FOREIGN KEY (author_id) REFERENCES authors(id)
)
The  relational  way
36
SELECT id, name, release_date
FROM books
WHERE author_id = 12345
ORDER BY release_date DESC;
QUERY2:  Find  all  books  for  an  author  sorted  by  release  date  descending  
CREATE TABLE books (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(50),
release_date date,
author_id int,
PRIMARY KEY (book_id),
FOREIGN KEY (author_id) REFERENCES authors(id)
)
The  Cassandra  way
37
• No  foreign  keys
• No  JOINS
• No  filtering  allowed  on  non  primary  key  columns  (could  use  secondary  indexes  but…)
So  what?
• Denormalize!
• Write  your  data  the  way  you  want  to  read  it,  even  if  it  means  duplication
• Data  should  be  read  from  a  SINGLE node  and  in  a  SEQUENTIAL way  (choose  your  
partition  key  and  clustering  key  wisely)
The  Cassandra  way
38
QUERY  1:  Find  a  book  by  its  id
CREATE TABLE books (
book_id uuid,
release_date timestamp,
name text,
PRIMARY KEY (book_id)
);
SELECT *
FROM books
WHERE book_id = 9876;
The  Cassandra  way
39
QUERY  2:  Find  all  books  for  an  author  sorted  by  release  date  descending
CREATE TABLE books_by_author (
author_id uuid,
release_date timestamp,
book_id uuid,
name text
PRIMARY KEY ((author_id), release_date, book_id)
)
WITH CLUSTERING ORDER BY (release_date DESC);
SELECT book_id, name, release_date
FROM books_by_author
WHERE author_id = 12345;
40
CREATE TABLE myapp.measurements (
sensor_id uuid,
time timestamp,
value double
PRIMARY KEY ((sensor_id), time)
);
PARTITION  
KEY
CLUSTERNING  
KEY
How’s  data  stored?
41
sensor_id 100
timestamp value
2017-01-30 11:00:00 75.90
2017-01-30 11:30:00 112.50
2017-01-30 12:00:00 45
2017-01-30 12:30:00 92.30
2017-01-30 13:00:00 67.15
2017-01-30 13:30:00 32.20
SINGLE  LARGE  PARTITION
Large  partitions  causing  issues  before  C*  3.6
• Slow  reads
• Compaction  issues
• Repair  issues
42
C*  3.6  mitigated  this  issue
• But  still  be  aware  of  that  and  always  TEST  your  data  model  against  your  workload
43
CREATE TABLE myapp.measurements (
sensor_id uuid,
time_bucket timestamp,
time timestamp,
value double
PRIMARY KEY ((sensor_id, time_bucket), time)
);
PARTITION  
KEY
CLUSTERNING  
KEY
Possible  alternative
sensor_id 100
timestamp value
time_bucket 2017-01-30 12:00:00
44
sensor_id 100
timestamp value
2017-01-30 11:00:00 75.90
2017-01-30 11:30:00 112.50
2017-01-30 12:00:00 45
2017-01-30 12:30:00 92.30
time_bucket 2017-01-30 11:00:00
MULTIPLE  PARTITIONS
Caveat:  choose  your  time_bucket wisely!
• Different  partitions  will  be  stored  into  different  nodes
• Use  a  time_bucket that  can  satisfies  the  way  you  want  to  read  your  data  
• (e.g.  don’t  use  an  hourly  time  bucket  if  you  want  to  extract  data  by  day)
45
A  real(ish)  example!
KillrVideo
killrvideo.github.io
46
47
Find  video  by  id
CREATE TABLE videos (
videoid uuid,
userid uuid,
name text,
description text,
location text,
location_type int,
preview_image_location text,
tags set<text>,
added_date timestamp,
PRIMARY KEY (videoid)
);
48
49
Get  number  of  views  by  video  id
First  attempt
CREATE TABLE video_playback_stats (
videoid uuid,
views int,
PRIMARY KEY (videoid)
);
SELECT views
FROM video_playback_stats where videoid = 12345;
UPDATE video_playback_stats SET views = 101
WHERE videoid = 12345;
50
100
Get  number  of  views  by  video  id
Problem:  concurrent  access
51
User  1:              Read(views:  100)                                                                                                  Write(views:  101)
User  2:                                          Read(views:  100)            Write(views:  101)
We  lost  an  update
Get  number  of  views  by  video  id
Better  solution
CREATE TABLE video_playback_stats (
videoid uuid,
views counter,
PRIMARY KEY (videoid)
);
UPDATE video_playback_stats
SET views = views + 1 WHERE videoid = 12345;
52
Counter  type
53
• Special  column  used  to  store  a  number  that  is  changed  in  
increments
• All  non-­counter  columns  in  the  table  must  be  defined  as  part  of  the  
primary  key
• NON  IDEMPOTENT  OPERATIONS!  Might  lead  to  a  non  100%  
accurate  count
54
Get  average  ratings  by  video  id
CREATE TABLE video_ratings (
videoid uuid,
rating_counter counter,
rating_total counter,
PRIMARY KEY (videoid)
);
UPDATE video_ratings
SET rating_counter = rating_counter + 1,
rating_total = rating_total + 5
WHERE videoid = 12345;
55
• Cassandra  doesn’t  provide  aggregation  operations  
(SUM/MIN/MAX)
56
Find  latest  comments  by  video  id
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY ((videoid), commentid)
)
WITH CLUSTERING ORDER BY (commentid DESC);
57
Where  do  we  get  this  from?
Client  side  JOINs
58
videoid K
commentid C ↓
userid
comment
comments_by_video
userid K
(… details …)
users
SELECT *
FROM comments_by_video
WHERE videoid = 12345;
For each comment in result:
SELECT *
FROM users
WHERE userid = <comment.userid>
(Chebotko diagrams)
59
Find  latest  videos  order  by  added_date desc
CREATE TABLE latest_videos (
yyyymmdd text,
added_date timestamp,
videoid uuid,
userid uuid,
name text,
preview_image_location text,
PRIMARY KEY ((yyyymmdd), added_date, videoid)
)
WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);
60
• How  can  be  the  data  partitioned?
Mind  the  hotspots!
• All  the  videos  added  the  same  day  will  be  stored  on  the  same  node.  
This  means  that  all  the  writes  into  that  table  will  go  to  the  same  node  
for  24  hours.
• Can  be  mitigated  by  splitting  the  row  using  an  arbitrary  group  
number,  making  the  partition  key  (yyyymmdd, group_number)
61
62
TAG
Find  videos  by  tag
CREATE TABLE videos_by_tag (
tag text,
videoid uuid,
added_date timestamp,
name text,
preview_image_location text,
tagged_date timestamp,
userid uuid,
PRIMARY KEY ((tag), videoid)
);
63
Attention:  unbounded  duplication
• If  a  user  adds  50  tags  to  a  video,  the  same  data  is  duplicated  50  
times  for  a  single  video.  
• The  duplication  factor  is  not  under  our  control:  potential  risk  for  data  
growing  very  quickly
• Consider  limiting  the  number  of  tags  the  user  can  use  on  the  
application  side
64
65
First  attempt
66
CREATE TABLE IF NOT EXISTS users (
userid uuid,
firstname text,
lastname text,
email text,
password, text
created_date timestamp,
PRIMARY KEY (userid)
);
We  have  to  support  lookup  by  email
Better  solution
67
CREATE TABLE user_credentials (
email text,
password text,
userid uuid,
PRIMARY KEY (email)
);
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);
Mind  the  overwrites!
68
INSERT INTO user_credentials (email, password, userid)
VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456);
INSERT INTO user_credentials (email, password, userid)
VALUES (’max.tomassi@datastax.com’, ‘yyy’, 98765);
email password userid
max.tomassi@datastax.com xxx 123456
email password userid
max.tomassi@datastax.com yyy 98765
Lightweight  Transactions
69
INSERT INTO user_credentials (email, password, userid)
VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456)
IF NOT EXISTS
70
Find  videos  by  user  order  by  added_date desc
CREATE TABLE user_videos (
userid uuid,
added_date timestamp,
videoid uuid,
name text,
preview_image_location text,
PRIMARY KEY ((userid), added_date, videoid))
WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC);
71
Data  duplication
72
• Tables  videos, user_videos, latest_videos,
videos_by_tag all  store  similar  information  about  videos
• The  same  data  is  duplicated  4  times:  disk  space  consumed  more  
quickly
Alternative
• Store  the  video  information  in  the  videos table  only
• The  other  3  tables  only  store  the  video_id,  the  data  is  joined  on  the  
client  side
• We  save  disk  space  but  we  lose  performance
Data  consistency
73
• Every  time  a  new  video  is  added  we  need  to  insert  a  new  record  into  
4  tables.  Those  records  must  be  kept  in  sync
• We  can  use  batches  for  this,  Cassandra  will  ensure  that  all  the  
statements  will  succeed  
BEGIN BATCH
INSERT INTO videos (…) VALUES (…);
INSERT INTO user_videos (…) VALUES (…);
INSERT INTO latest_videos (…) VALUES (…);
INSERT INTO videos_by_tag (…) VALUES (…);
APPLY BATCH;
74
Find  latest  comments  by  user
CREATE TABLE comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY ((userid), commentid))
WITH CLUSTERING ORDER BY (commentid DESC);
75
Q&A
76
Thank  you!
For  more  information  and  training
www.datastax.com
academy.datastax.com
77

Apache Cassandra & Data Modeling

  • 1.
    Apache  Cassandra   and data  modeling ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 2.
    About  Me Massimiliano  Tomassi Software Engineer  at  DataStax max.tomassi@datastax.com @max_tomassi 2
  • 3.
    CONTEXTUAL Characteristics  of  cloud applications ALWAYS-­ON DISTRIBUTED SCALABLEREAL-­TIME ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 4.
    DataStax  provides  data management   for  cloud  applications. ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 5.
    From  validation  to momentum. 400+ Employees $190M Funding 500+ Customers Founded  in  April  2010 Santa  Clara  •  San  Francisco  •  Austin  •   London  •  Paris  •  Berlin  •  Tokyo  •  Sydney (Series  E  – Sept.  2014) 30%    + 2016  World’s  Best 100  Cloud  Companies   Ranked  #1  in  multiple  operational database  categories ©  2017  DataStax,  All  Rights  Reserved.   Company  Confidential
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    10 CREATE TABLE myapp.measurements( sensor_id uuid, time timestamp, value double PRIMARY KEY ((sensor_id), time) ); PARTITION   KEY CLUSTERNING   KEY
  • 11.
    Distributed  architecture 11 INSERT INTOmyapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Cassandra  Query  Language  (CQL)
  • 12.
    Distributed  architecture 12 INSERT INTOmyapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator
  • 13.
    Distributed  architecture 13 INSERT INTOmyapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator
  • 14.
    Distributed  architecture 14 INSERT INTOmyapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator Hash  function 59 token
  • 15.
    Distributed  architecture 15 INSERT INTOmyapp.measurements (sensor_id, time, value) VALUES (100, '2017-01-30 11:44:42', 980.50); Coordinator Hash  function 59 token
  • 16.
  • 17.
  • 18.
  • 19.
    19 Replication RF  =  3 Each token  range  is  replicated   into  RF numbers  of  nodes.
  • 20.
  • 21.
  • 22.
    Replication 22 RF  =  3 Consistency? CL =  ONE CL  =  QUORUM CL  =  ALL
  • 23.
  • 24.
  • 25.
  • 26.
    Hybrid  on  premise/cloud deployment   26
  • 27.
    Automatic  replication  across datacenters 27 Data  Center  1 Data  Center  2
  • 28.
    Automatic  replication  across datacenters 28 Data  Center  1 Data  Center  2
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
    The  relational  way 34 AuthorsBooks 1 * CREATE TABLE books ( book_id INT NOT NULL AUTO_INCREMENT, name VARCHAR(50), release_date date, author_id int, PRIMARY KEY (book_id), FOREIGN KEY (author_id) REFERENCES authors(id) )
  • 35.
    The  relational  way SELECT* FROM books WHERE id = 9876; 35 QUERY1:  Find  a  book  by  its  id CREATE TABLE books ( id INT NOT NULL AUTO_INCREMENT, name VARCHAR(50), release_date date, author_id int, PRIMARY KEY (book_id), FOREIGN KEY (author_id) REFERENCES authors(id) )
  • 36.
    The  relational  way 36 SELECTid, name, release_date FROM books WHERE author_id = 12345 ORDER BY release_date DESC; QUERY2:  Find  all  books  for  an  author  sorted  by  release  date  descending   CREATE TABLE books ( id INT NOT NULL AUTO_INCREMENT, name VARCHAR(50), release_date date, author_id int, PRIMARY KEY (book_id), FOREIGN KEY (author_id) REFERENCES authors(id) )
  • 37.
    The  Cassandra  way 37 •No  foreign  keys • No  JOINS • No  filtering  allowed  on  non  primary  key  columns  (could  use  secondary  indexes  but…) So  what? • Denormalize! • Write  your  data  the  way  you  want  to  read  it,  even  if  it  means  duplication • Data  should  be  read  from  a  SINGLE node  and  in  a  SEQUENTIAL way  (choose  your   partition  key  and  clustering  key  wisely)
  • 38.
    The  Cassandra  way 38 QUERY 1:  Find  a  book  by  its  id CREATE TABLE books ( book_id uuid, release_date timestamp, name text, PRIMARY KEY (book_id) ); SELECT * FROM books WHERE book_id = 9876;
  • 39.
    The  Cassandra  way 39 QUERY 2:  Find  all  books  for  an  author  sorted  by  release  date  descending CREATE TABLE books_by_author ( author_id uuid, release_date timestamp, book_id uuid, name text PRIMARY KEY ((author_id), release_date, book_id) ) WITH CLUSTERING ORDER BY (release_date DESC); SELECT book_id, name, release_date FROM books_by_author WHERE author_id = 12345;
  • 40.
    40 CREATE TABLE myapp.measurements( sensor_id uuid, time timestamp, value double PRIMARY KEY ((sensor_id), time) ); PARTITION   KEY CLUSTERNING   KEY How’s  data  stored?
  • 41.
    41 sensor_id 100 timestamp value 2017-01-3011:00:00 75.90 2017-01-30 11:30:00 112.50 2017-01-30 12:00:00 45 2017-01-30 12:30:00 92.30 2017-01-30 13:00:00 67.15 2017-01-30 13:30:00 32.20 SINGLE  LARGE  PARTITION
  • 42.
    Large  partitions  causing issues  before  C*  3.6 • Slow  reads • Compaction  issues • Repair  issues 42 C*  3.6  mitigated  this  issue • But  still  be  aware  of  that  and  always  TEST  your  data  model  against  your  workload
  • 43.
    43 CREATE TABLE myapp.measurements( sensor_id uuid, time_bucket timestamp, time timestamp, value double PRIMARY KEY ((sensor_id, time_bucket), time) ); PARTITION   KEY CLUSTERNING   KEY Possible  alternative
  • 44.
    sensor_id 100 timestamp value time_bucket2017-01-30 12:00:00 44 sensor_id 100 timestamp value 2017-01-30 11:00:00 75.90 2017-01-30 11:30:00 112.50 2017-01-30 12:00:00 45 2017-01-30 12:30:00 92.30 time_bucket 2017-01-30 11:00:00 MULTIPLE  PARTITIONS
  • 45.
    Caveat:  choose  your time_bucket wisely! • Different  partitions  will  be  stored  into  different  nodes • Use  a  time_bucket that  can  satisfies  the  way  you  want  to  read  your  data   • (e.g.  don’t  use  an  hourly  time  bucket  if  you  want  to  extract  data  by  day) 45
  • 46.
  • 47.
  • 48.
    Find  video  by id CREATE TABLE videos ( videoid uuid, userid uuid, name text, description text, location text, location_type int, preview_image_location text, tags set<text>, added_date timestamp, PRIMARY KEY (videoid) ); 48
  • 49.
  • 50.
    Get  number  of views  by  video  id First  attempt CREATE TABLE video_playback_stats ( videoid uuid, views int, PRIMARY KEY (videoid) ); SELECT views FROM video_playback_stats where videoid = 12345; UPDATE video_playback_stats SET views = 101 WHERE videoid = 12345; 50 100
  • 51.
    Get  number  of views  by  video  id Problem:  concurrent  access 51 User  1:              Read(views:  100)                                                                                                  Write(views:  101) User  2:                                          Read(views:  100)            Write(views:  101) We  lost  an  update
  • 52.
    Get  number  of views  by  video  id Better  solution CREATE TABLE video_playback_stats ( videoid uuid, views counter, PRIMARY KEY (videoid) ); UPDATE video_playback_stats SET views = views + 1 WHERE videoid = 12345; 52
  • 53.
    Counter  type 53 • Special column  used  to  store  a  number  that  is  changed  in   increments • All  non-­counter  columns  in  the  table  must  be  defined  as  part  of  the   primary  key • NON  IDEMPOTENT  OPERATIONS!  Might  lead  to  a  non  100%   accurate  count
  • 54.
  • 55.
    Get  average  ratings by  video  id CREATE TABLE video_ratings ( videoid uuid, rating_counter counter, rating_total counter, PRIMARY KEY (videoid) ); UPDATE video_ratings SET rating_counter = rating_counter + 1, rating_total = rating_total + 5 WHERE videoid = 12345; 55 • Cassandra  doesn’t  provide  aggregation  operations   (SUM/MIN/MAX)
  • 56.
  • 57.
    Find  latest  comments by  video  id CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY ((videoid), commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); 57 Where  do  we  get  this  from?
  • 58.
    Client  side  JOINs 58 videoidK commentid C ↓ userid comment comments_by_video userid K (… details …) users SELECT * FROM comments_by_video WHERE videoid = 12345; For each comment in result: SELECT * FROM users WHERE userid = <comment.userid> (Chebotko diagrams)
  • 59.
  • 60.
    Find  latest  videos order  by  added_date desc CREATE TABLE latest_videos ( yyyymmdd text, added_date timestamp, videoid uuid, userid uuid, name text, preview_image_location text, PRIMARY KEY ((yyyymmdd), added_date, videoid) ) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC); 60 • How  can  be  the  data  partitioned?
  • 61.
    Mind  the  hotspots! •All  the  videos  added  the  same  day  will  be  stored  on  the  same  node.   This  means  that  all  the  writes  into  that  table  will  go  to  the  same  node   for  24  hours. • Can  be  mitigated  by  splitting  the  row  using  an  arbitrary  group   number,  making  the  partition  key  (yyyymmdd, group_number) 61
  • 62.
  • 63.
    Find  videos  by tag CREATE TABLE videos_by_tag ( tag text, videoid uuid, added_date timestamp, name text, preview_image_location text, tagged_date timestamp, userid uuid, PRIMARY KEY ((tag), videoid) ); 63
  • 64.
    Attention:  unbounded  duplication •If  a  user  adds  50  tags  to  a  video,  the  same  data  is  duplicated  50   times  for  a  single  video.   • The  duplication  factor  is  not  under  our  control:  potential  risk  for  data   growing  very  quickly • Consider  limiting  the  number  of  tags  the  user  can  use  on  the   application  side 64
  • 65.
  • 66.
    First  attempt 66 CREATE TABLEIF NOT EXISTS users ( userid uuid, firstname text, lastname text, email text, password, text created_date timestamp, PRIMARY KEY (userid) ); We  have  to  support  lookup  by  email
  • 67.
    Better  solution 67 CREATE TABLEuser_credentials ( email text, password text, userid uuid, PRIMARY KEY (email) ); CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) );
  • 68.
    Mind  the  overwrites! 68 INSERTINTO user_credentials (email, password, userid) VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456); INSERT INTO user_credentials (email, password, userid) VALUES (’max.tomassi@datastax.com’, ‘yyy’, 98765); email password userid max.tomassi@datastax.com xxx 123456 email password userid max.tomassi@datastax.com yyy 98765
  • 69.
    Lightweight  Transactions 69 INSERT INTOuser_credentials (email, password, userid) VALUES (’max.tomassi@datastax.com’, ‘xxx’, 123456) IF NOT EXISTS
  • 70.
  • 71.
    Find  videos  by user  order  by  added_date desc CREATE TABLE user_videos ( userid uuid, added_date timestamp, videoid uuid, name text, preview_image_location text, PRIMARY KEY ((userid), added_date, videoid)) WITH CLUSTERING ORDER BY (added_date DESC, videoid ASC); 71
  • 72.
    Data  duplication 72 • Tables videos, user_videos, latest_videos, videos_by_tag all  store  similar  information  about  videos • The  same  data  is  duplicated  4  times:  disk  space  consumed  more   quickly Alternative • Store  the  video  information  in  the  videos table  only • The  other  3  tables  only  store  the  video_id,  the  data  is  joined  on  the   client  side • We  save  disk  space  but  we  lose  performance
  • 73.
    Data  consistency 73 • Every time  a  new  video  is  added  we  need  to  insert  a  new  record  into   4  tables.  Those  records  must  be  kept  in  sync • We  can  use  batches  for  this,  Cassandra  will  ensure  that  all  the   statements  will  succeed   BEGIN BATCH INSERT INTO videos (…) VALUES (…); INSERT INTO user_videos (…) VALUES (…); INSERT INTO latest_videos (…) VALUES (…); INSERT INTO videos_by_tag (…) VALUES (…); APPLY BATCH;
  • 74.
  • 75.
    Find  latest  comments by  user CREATE TABLE comments_by_user ( userid uuid, commentid timeuuid, videoid uuid, comment text, PRIMARY KEY ((userid), commentid)) WITH CLUSTERING ORDER BY (commentid DESC); 75
  • 76.
  • 77.
    Thank  you! For  more information  and  training www.datastax.com academy.datastax.com 77