KEMBAR78
Bulk Loading Data into Cassandra | PDF
Planet Cassandra 2014

Bulk-Loading Data into Cassandra
Patricia Gorla

@patriciagorla

Cassandra Consultant

www.thelastpickle.com
About Us
•

Work with clients to deliver and improve
Apache Cassandra services


•

Apache Cassandra committer, Datastax
MVP, Hector maintainer, Apache Usergrid
committer


•

Based in New Zealand & USA
Why is bulk loading useful?
•

Performance tests
Why is bulk loading useful?
•

Performance tests

•

Migrating historical data
Why is bulk loading useful?
•

Performance tests

•

Migrating historical data

•

Changing topologies
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies

•

Conclusion
Cassandra Write Path

write[0]
Cassandra Write Path
•

write[0]

Writes written to both the commit log and
memtable.

commitlog

memtable
Cassandra Write Path
•

•

write[0]

Writes written to both the commit log and
memtable.

Memtable is sorted.

commitlog

memtable
Cassandra Write Path
•

write[0]

Memtable flushed out to sstables.

commitlog

memtable

sstable[0]
sstable[2]
sstable[1]
Cassandra Write Path
•

write[0]

Compaction helps keep the read latency
low.

commitlog

memtable

sstable[0]
sstable[2]
sstable[1]

sstable[n]
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt

Contains all data needed to regenerate components
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt

Index of row keys
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt

Index summary from Index.db file
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt

Bloom filter over sstable
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt

Table of contents of all components
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies

•

Conclusion
create keyspace test
with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = {replication_factor:1};
!

create column family test
with comparator = 'AsciiType'
and default_validation_class = 'AsciiType'
and key_validation_class = 'AsciiType';

Set up keyspace and column family
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
AsciiType.instance,
null, // subcomparator for super columns
size_per_sstable_mb
);

SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
AsciiType.instance,
null, // subcomparator for super columns
size_per_sstable_mb
);

SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
AsciiType.instance,
null, // subcomparator for super columns
size_per_sstable_mb
);

SStableGen.java
ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024));
KeyGenerator keyGen = new KeyGenerator();
long dataSize = 0;
writer = new SSTableSimpleUnsortedWriter(…);
while (dataSize < max_data_bytes) {
writer.newRow(key);
for (int j=0; j<num_cols; j++) {
ByteBuffer colName = ByteBufferUtil.bytes("col_" + j);
ByteBuffer colValue = ByteBuffer.wrap(new byte[20]);
randomBytes.get(colValue.array());
colValue.position(0);
writer.addColumn(colName, colValue, timestamp);
if (randomBytes.remaining() < colValue.limit()) {
randomBytes.position(0);
}
else {
randomBytes.position(randomBytes.position() + colValue.limit());
}
}
}
}
patricia@dev:~/../data$
total 64
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia

ls -lh mykeyspace/mycf
staff
staff
staff
staff
staff
staff
staff

43B
79K
16B
36B
4.3K
80B
79B

Feb
Feb
Feb
Feb
Feb
Feb
Feb

2
2
2
2
2
2
2

15:31
15:31
15:31
15:31
15:31
15:31
15:31

mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
mykeyspace-mycf-jb-1-Index.db
mykeyspace-mycf-jb-1-Statistics.db
mykeyspace-mycf-jb-1-Summary.db
mykeyspace-mycf-jb-1-TOC.txt

Examining sstable output
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1]
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1]
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1]
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1]
progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server

•

Throttle command
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server

•

Throttle command

•

Parallelise processes
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies

•

Conclusion
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableSimpleUnsortedWriter(…);
!

// assume orders are in date order
for (Order order : oldOrders) {
customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));
customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER,
timestamp);
!

orders.newRow(ByteBufferUtil.bytes(order.userId));
orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId),
timestamp);
orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp);
orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp);
}
!

customerOrders.close()
orders.close()
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableSimpleUnsortedWriter(…);
!

// assume orders are in date order
for (Order order : oldOrders) {
customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));
customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER,
timestamp);
!

orders.newRow(ByteBufferUtil.bytes(order.userId));
orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId),
timestamp);
orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp);
orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp);
}
!

customerOrders.close()
orders.close()
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableSimpleUnsortedWriter(…);
!

// assume orders are in date order
for (Order order : oldOrders) {
customerOrders.newRow(ByteBufferUtil.bytes(order.customerId));
customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER,
timestamp);
!

orders.newRow(ByteBufferUtil.bytes(order.userId));
orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId),
timestamp);
orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp);
orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp);
}
!

customerOrders.close()
orders.close()
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies

•

Conclusion
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,cass3
!

Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2,
cass3,cass4,cass5,cass6]
!

progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0
(0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,cass3
!

Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2,
cass3,cass4,cass5,cass6]
!

progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0
(0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,cass3
!

Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2,
cass3,cass4,cass5,cass6]
!

progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0
(0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats
pending tasks: 30
Active compaction remaining time :
n/a
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies

•

Conclusion
cqlsh> CREATE KEYSPACE "test"
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
!

cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;

CQL: Keep schema consistent
CQL3 Considerations
•

Uses CompositeType comparator
Planet Cassandra 2014

Q&A
Patricia Gorla

@patriciagorla

Cassandra Consultant

www.thelastpickle.com

Bulk Loading Data into Cassandra