ScyllaDB: NoSQL at Ludicrous Speed

SCYLLA: NoSQL at Ludicrous Speed
Duarte Nunes
@duarte_nunes

❏ Introducing ScyllaDB
❏ Seastar
❏ Resource Management
❏ Workload Conditioning
❏ Closing
AGENDA

ScyllaDB
● Clustered NoSQL database compatible with
Apache Cassandra
● ~10X performance on same hardware
● Low latency, esp. higher percentiles
● Self tuning
● Mechanically sympathetic C++14

YCSB Benchmark:
3 node Scylla cluster vs 3, 9, 15, 30
Cassandra machines
3 Scylla
30 Cassandra
3 Cassandra
3 Scylla
30 Cassandra
3 Cassandra

Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study
Scylla and Cassandra
handling the full load
(peak of ~12M RPM)
200ms
10ms
20x Lower
Latency
5

Scylla benchmark by Samsung
op/s
Full report: http://tinyurl.com/msl-scylladb

Data model
Partition Key1
Clustering Key1
Clustering Key1 Clustering Key2
Clustering Key2
...
...
...
...
...
CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id ));
INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’');
SortedbyPrimaryKey

Log-Structured Merge Tree
SStable 1
Time

SStable 1
SStable 2
Time

SStable 1
SStable 2
SStable 3
Time

SStable 1
SStable 2
SStable 3
Time
SStable 4

SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 1+2+3

SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 5
SStable 1+2+3

SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 5
SStable 1+2+3
Foreground Job Background Job

Request path
SSTable
Memtable
Reads

Request path
SSTable
Memtable
Reads
Commit Log
Writes

Implementation Goals
● Efficiency:
○ Make the most out of every cycle
● Utilization:
○ Squeeze every cycle from the machine
● Control
○ Spend the cycles on what we want, when we want

❏ System Architecture
❏ Node Architecture
❏ Seastar
❏ Resource Management
❏ Closing
AGENDA

● Thread-per-core design (shard)
○ No blocking. Ever.
Enter Seastar
www.seastar-project.org

Enter Seastar
● Asynchronous networking, file I/O,
multicore

Enter Seastar
multicore
● Future/promise based APIs

Enter Seastar
multicore
● Future/promise based APIs
● Usermode TCP/IP stack included in the box

Seastar task scheduler
Traditional stack Seastar stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
Context switch cost is
high. Large stacks pollutes
the caches
No sharing, millions of
parallel events

Pedis https://github.com/fastio/pedis

Futures
future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) {
int id = buf_to_id(buf);
unsigned core = id % smp::count;
return smp::submit_to(core, [id] {
return lookup(id);
}).then([this] (sstring result) {
return _conn->write(result);
});
});

No escaping the monad
future<> f = …;
f.get(); // not allowed

Unless...
future<> f = seastar::async([&] {
future<> f = …;
f.get();
});

Seastar memory allocator
● Non-Thread safe!
○ Each core gets a private memory pool

● Allocation back pressure
○ Allocator calls a callback when low on memory
○ Scylla evicts cache in response

● Allocation back pressure
○ Allocator calls a callback when low on memory
○ Scylla evicts cache in response
● Inter-core free() through message passing

Usermode I/O scheduler
Storage
Block Layer
Filesystem

Storage
Block Layer
Filesystem
Disk I/O Scheduler
Class A Class B

Query
Commitlog
Compaction
Queue
Queue
Queue
Userspace
I/O
Scheduler
Disk

Figuring out optimal disk
concurrency
Max useful disk concurrency

Cassandra cache
Linux page cache
SSTables
● 4k granularity
● Thread-safe
● Synchronous APIs
● General-purpose
● Lack of control2
● ...on the other hand
○ Exists
○ Hundreds of man-years
○ Handling lots of edge cases

Cassandra cache
Linux page cache
SSTables
● Parasitic rows
SSTable page (4k)
Your data (300b)

Cassandra cache
Linux page cache
SSTables
● Page faults
Page fault
Suspend thread
Initiate I/O
Context switch
I/O completes
Context switch
Interrupt
Map page
Resume thread
App
thread
Kernel
SSD

Cassandra cache
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
● Complex tuning

Scylla cache
Unified cache
SSTables

Probabilistic Cache Warmup
● A replica with a cold cache should be sent
less requests
●

Yet another allocator
(Problems with malloc/free)
● Memory gets fragmented over time
○ If the workload changes sizes of allocated objects
○ Allocating a large contiguous block requires
evicting most of cache

Log-structured memory allocation
● Bump-pointer allocation to current segment
● Frees leave holes in segments
● Compaction will try to solve this

Compacting LSA
● Teach allocator how to move objects around
○ Updating references
● Garbage collect Compact!
○ Starting with the most sparse segments
○ Lock to pin objects
● Used mostly for the cache
○ Large majority of memory allocated
○ Small subset of allocation sites

● Internal feedback loops to balance
competing loads
○ Consume what you export
Workload Conditioning

Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
WAN
CPU

Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
WAN
CPU

Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
WAN
CPU
Adjust priority

Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Memory
Monitor
WAN
CPU

Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Memory
Monitor
Adjust priority
WAN
CPU

❏ System Architecture
❏ Node Architecture
❏ Seastar
❏ Closing
AGENDA

● Careful system design and control of the software
stack can maximize throughput
● Without sacrificing latency
● Without requiring complex end-user tuning
● While having a lot of fun
Conclusions

● Download: http://www.scylladb.com
● Twitter: @ScyllaDB
● Source: http://github.com/scylladb/scylla
● Mailing lists: scylladb-user @ groups.google.com
● Slack: ScyllaDB-Users
● Blog: http://www.scylladb.com/blog
● Join: http://www.scylladb.com/company/careers
● Me: duarte@scylladb.com
How to interact

SCYLLA, NoSQL at Ludicrous Speed
Thank you.
@duarte_nunes

ScyllaDB: NoSQL at Ludicrous Speed

More Related Content

What's hot

Similar to ScyllaDB: NoSQL at Ludicrous Speed

More from J On The Beach

Recently uploaded

ScyllaDB: NoSQL at Ludicrous Speed