KEMBAR78
Cassandra data modelling best practices | DOCX
Cassandra data modelling best practices:
1. Composite Type use throughAPIclientisnotrecommended.
2. Supercolumnfamilyuse isnotrecommendedasitde serializeall the columnsonusage as
againstdeserializationof singlecolumn.
3. We can create wide rows (huge columnsandseveral rows) andskinnyrows(smallcol andhuge
rows).
4. Valuelesscolumn;if Rowid={City+uid} we wanttowrite/readonlyCitythenuidcanbe emptyor
valuelesscolumn.
5. Can expire columnbasedonTTLset inseconds.
6. Countercolumnsmaintaintostore a numberthatincrementallycountsthe occurrencesof a
particulareventorprocess.For example,youmightuse acountercolumnto count the number
of timesapage isviewed.
7. Keyspace:aclusterhas one keyspace perapplication.
Top level containerforColumnFamilies.
ColumnFamily:A containerforRow KeysandColumnFamilies
Row Key:The unique identifierfordatastoredwithinaColumnFamily
Column:Name-Valuepairwithanadditional field:timestamp
SuperColumn:A Dictionaryof Columns identifiedbyRow Key.
8. RandomPartitioneristhe recommendedpartitioningscheme.Ithasthe followingadvantages
overOrderedPartitioningasinBOP
Randompartitioner:Ituseshashon the Row Keyto determine whichnode inthe clusterwill be
responsible forthe data.The hash value isgeneratedbydoingMD5 on the Row Key.Each node
inthe clusterina data centerisassignedsectionsof thisrange (token) andisresponsible for
storingthe data whose RowKey’shashvalue fallswithinthisrange.
TokenRange = (2^127) ÷ (# of nodesinthe cluster)
If the clusterisspannedacrossmultiple datacenters,the tokensare createdforindividual data
centers. Whichisbetter.
Byte OrderedPartitioner(BOP):Itallowsyoutocalculate yourowntokensandassign to nodes
yourself asopposedtoRandomPartitionerautomaticallydoingthisforyou.
9. Partitioning => Pickingoutone node tostore firstcopy of data on
Replication => Pickingoutadditional nodestostore more copiesof data.
Storage commitlog(durability)  flushittomemtables(in-memorystructures)  SSTables
whichcompact data usingcompactiontoremove stale dataand tombstones(indicatorthatdata
deleted).
10. Binaryprotocol isfasterthan thrift.
11. Why RP?
1. RP ensuresthatthe data is evenlydistributedacrossall nodesinthe clusterandnotcreate
data hotspotas inBOP.
2. When a newnode isaddedto the cluster,RPcan quicklyassignita new tokenrange and
move minimumamountof datafromothernodesto the new node whichitis now responsible
for.With BOP,thiswill have tobe done manually.
3. Multiple ColumnFamiliesIssue:BOPcancause unevendistributionof dataif youhave
multiple columnfamilies.
4. The onlybenefitthatBOPhasoverRP isthat it allowsyoutodo row slices.You can obtaina
cursor like inRDBMS and move overyourrows.
12. columnfamilyasa map of a map.
SortedMap<RowKey,SortedMap<ColumnKey,ColumnValue>>
A map givesefficientkeylookup,andthe sortednature givesefficientscans.InCassandra,we
can use row keysandcolumnkeysto do efficientlookupsandrange scans.
13. The numberof columnkeysisunbounded.Inotherwords,youcan have wide rows.
A keycan itself holdavalue.Inotherwords,youcan have a valuelesscolumn.
14. You needtopass the timestampwitheachcolumnvalue,forCassandratouse internallyfor
conflictresolution.However,the timestampcanbe safelyignoredduringmodeling.
15. Start withquerypatternsandcreate ER model.Thenstartdeformalizingandduplicating. helps
to identifythe most frequentquerypatternsandisolate the lessfrequent.
Querypattern:
Get userby userid
Get itembyitemid
Get all the itemsthata particularuserlikes
Get all the userswholike a particularitem
Option1: Exact replicaof relational model.
Option2: Normalizedentitieswithcustomindexes
Option3: Normalizedentitieswithde-normalizationintocustomindexes
Option4: Partiallyde-normalizedentities
Keyspaces: container for column families and a cluster has 1 keyspace per application.
CREATE KEYSPACE keyspace_name WITH
strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor='2';
Single device per row - Time Series Pattern 1
Partitioning to limit row size - Time Series Pattern 2
The solution is to use a pattern called row partitioning by adding data to the row key to limit the
amount of columns you get per device.
Reverse order timeseries with expiring columns -
Time Series Pattern 3
Data for a dashboard application and we only want to show the last 10 temperature readings. With
TTL time to live for data value it is possible.
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES
('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
create table Inbound (
InboundID int not null primary key auto_increment,
ParticipantID int not null,
FromParticipantID int not null,
Occurred date not null,
Subject varchar(50) not null,
Story text not null,
foreign key (ParticipantID) references Participant(ParticipantID),
foreign key (FromParticipantID) references Participant(ParticipantID));
create table Inbound (
ParticipantID int,
Occurred timeuuid,
FromParticipantID int,
Subject text,
Story text,
primary key (ParticipantID, Occurred));
1 Define the User Scenarios This ensures User participation and commitment.
2
Define the Steps in each
Scenario
Clarify the User Interaction.
3 Derive the Data Model.
Use a Modelling Tool, such as Data Architect or ERWin to
generate SQL.
4 Relate Data Entities to each Step. Create Cross-reference matrix to check results.
5
Identify Transactions for each
Entity
Confirm that each Entity has Transactions to load and read
Data
6 Prepare sample Data In collaboration with the Users.
7 Prepare Test Scripts Agree sign-off with the Users.
8 Define a Load Sequence
Reference Data, basics such as Products, any existing Users
or Customers,etc..
9 Run the Test Scripts Get User Sign-off to record progress.

Cassandra data modelling best practices

  • 1.
    Cassandra data modellingbest practices: 1. Composite Type use throughAPIclientisnotrecommended. 2. Supercolumnfamilyuse isnotrecommendedasitde serializeall the columnsonusage as againstdeserializationof singlecolumn. 3. We can create wide rows (huge columnsandseveral rows) andskinnyrows(smallcol andhuge rows). 4. Valuelesscolumn;if Rowid={City+uid} we wanttowrite/readonlyCitythenuidcanbe emptyor valuelesscolumn. 5. Can expire columnbasedonTTLset inseconds. 6. Countercolumnsmaintaintostore a numberthatincrementallycountsthe occurrencesof a particulareventorprocess.For example,youmightuse acountercolumnto count the number of timesapage isviewed. 7. Keyspace:aclusterhas one keyspace perapplication. Top level containerforColumnFamilies. ColumnFamily:A containerforRow KeysandColumnFamilies Row Key:The unique identifierfordatastoredwithinaColumnFamily Column:Name-Valuepairwithanadditional field:timestamp SuperColumn:A Dictionaryof Columns identifiedbyRow Key. 8. RandomPartitioneristhe recommendedpartitioningscheme.Ithasthe followingadvantages overOrderedPartitioningasinBOP Randompartitioner:Ituseshashon the Row Keyto determine whichnode inthe clusterwill be responsible forthe data.The hash value isgeneratedbydoingMD5 on the Row Key.Each node inthe clusterina data centerisassignedsectionsof thisrange (token) andisresponsible for storingthe data whose RowKey’shashvalue fallswithinthisrange. TokenRange = (2^127) ÷ (# of nodesinthe cluster) If the clusterisspannedacrossmultiple datacenters,the tokensare createdforindividual data centers. Whichisbetter. Byte OrderedPartitioner(BOP):Itallowsyoutocalculate yourowntokensandassign to nodes yourself asopposedtoRandomPartitionerautomaticallydoingthisforyou. 9. Partitioning => Pickingoutone node tostore firstcopy of data on Replication => Pickingoutadditional nodestostore more copiesof data. Storage commitlog(durability)  flushittomemtables(in-memorystructures)  SSTables whichcompact data usingcompactiontoremove stale dataand tombstones(indicatorthatdata deleted). 10. Binaryprotocol isfasterthan thrift. 11. Why RP? 1. RP ensuresthatthe data is evenlydistributedacrossall nodesinthe clusterandnotcreate data hotspotas inBOP. 2. When a newnode isaddedto the cluster,RPcan quicklyassignita new tokenrange and move minimumamountof datafromothernodesto the new node whichitis now responsible for.With BOP,thiswill have tobe done manually.
  • 2.
    3. Multiple ColumnFamiliesIssue:BOPcancauseunevendistributionof dataif youhave multiple columnfamilies. 4. The onlybenefitthatBOPhasoverRP isthat it allowsyoutodo row slices.You can obtaina cursor like inRDBMS and move overyourrows. 12. columnfamilyasa map of a map. SortedMap<RowKey,SortedMap<ColumnKey,ColumnValue>> A map givesefficientkeylookup,andthe sortednature givesefficientscans.InCassandra,we can use row keysandcolumnkeysto do efficientlookupsandrange scans. 13. The numberof columnkeysisunbounded.Inotherwords,youcan have wide rows. A keycan itself holdavalue.Inotherwords,youcan have a valuelesscolumn. 14. You needtopass the timestampwitheachcolumnvalue,forCassandratouse internallyfor conflictresolution.However,the timestampcanbe safelyignoredduringmodeling. 15. Start withquerypatternsandcreate ER model.Thenstartdeformalizingandduplicating. helps to identifythe most frequentquerypatternsandisolate the lessfrequent. Querypattern:
  • 3.
    Get userby userid Getitembyitemid Get all the itemsthata particularuserlikes Get all the userswholike a particularitem Option1: Exact replicaof relational model. Option2: Normalizedentitieswithcustomindexes Option3: Normalizedentitieswithde-normalizationintocustomindexes Option4: Partiallyde-normalizedentities
  • 4.
    Keyspaces: container forcolumn families and a cluster has 1 keyspace per application. CREATE KEYSPACE keyspace_name WITH strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor='2'; Single device per row - Time Series Pattern 1 Partitioning to limit row size - Time Series Pattern 2 The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device. Reverse order timeseries with expiring columns - Time Series Pattern 3 Data for a dashboard application and we only want to show the last 10 temperature readings. With TTL time to live for data value it is possible. CREATE TABLE latest_temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC); INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:03:00','72F') USING TTL 20;
  • 5.
    create table Inbound( InboundID int not null primary key auto_increment, ParticipantID int not null, FromParticipantID int not null, Occurred date not null, Subject varchar(50) not null, Story text not null, foreign key (ParticipantID) references Participant(ParticipantID), foreign key (FromParticipantID) references Participant(ParticipantID)); create table Inbound ( ParticipantID int, Occurred timeuuid, FromParticipantID int, Subject text, Story text, primary key (ParticipantID, Occurred)); 1 Define the User Scenarios This ensures User participation and commitment. 2 Define the Steps in each Scenario Clarify the User Interaction. 3 Derive the Data Model. Use a Modelling Tool, such as Data Architect or ERWin to generate SQL. 4 Relate Data Entities to each Step. Create Cross-reference matrix to check results. 5 Identify Transactions for each Entity Confirm that each Entity has Transactions to load and read Data 6 Prepare sample Data In collaboration with the Users. 7 Prepare Test Scripts Agree sign-off with the Users. 8 Define a Load Sequence Reference Data, basics such as Products, any existing Users or Customers,etc.. 9 Run the Test Scripts Get User Sign-off to record progress.