KEMBAR78
ScyllaDB: NoSQL at Ludicrous Speed | PDF
SCYLLA: NoSQL at Ludicrous Speed
Duarte Nunes
@duarte_nunes
❏ Introducing ScyllaDB
❏ Seastar
❏ Resource Management
❏ Workload Conditioning
❏ Closing
AGENDA
ScyllaDB
● Clustered NoSQL database compatible with
Apache Cassandra
● ~10X performance on same hardware
● Low latency, esp. higher percentiles
● Self tuning
● Mechanically sympathetic C++14
YCSB Benchmark:
3 node Scylla cluster vs 3, 9, 15, 30
Cassandra machines
3 Scylla
30 Cassandra
3 Cassandra
3 Scylla
30 Cassandra
3 Cassandra
Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study
Scylla and Cassandra
handling the full load
(peak of ~12M RPM)
200ms
10ms
20x Lower
Latency
5
Scylla benchmark by Samsung
op/s
Full report: http://tinyurl.com/msl-scylladb
Dynamo-based system
Data model
Partition Key1
Clustering Key1
Clustering Key1 Clustering Key2
Clustering Key2
...
...
...
...
...
CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id ));
INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’');
SortedbyPrimaryKey
Log-Structured Merge Tree
SStable 1
Time
Log-Structured Merge Tree
SStable 1
SStable 2
Time
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3
Time
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3
Time
SStable 4
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 1+2+3
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 5
SStable 1+2+3
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 5
SStable 1+2+3
Foreground Job Background Job
Request path
SSTable
Memtable
Request path
SSTable
Memtable
Reads
Request path
SSTable
Memtable
Reads
Commit Log
Writes
Implementation Goals
● Efficiency:
○ Make the most out of every cycle
● Utilization:
○ Squeeze every cycle from the machine
● Control
○ Spend the cycles on what we want, when we want
❏ Introducing ScyllaDB
❏ System Architecture
❏ Node Architecture
❏ Seastar
❏ Resource Management
❏ Workload Conditioning
❏ Closing
AGENDA
● Thread-per-core design (shard)
○ No blocking. Ever.
Enter Seastar
www.seastar-project.org
Enter Seastar
www.seastar-project.org
● Thread-per-core design (shard)
○ No blocking. Ever.
● Asynchronous networking, file I/O,
multicore
Enter Seastar
www.seastar-project.org
● Thread-per-core design (shard)
○ No blocking. Ever.
● Asynchronous networking, file I/O,
multicore
● Future/promise based APIs
Enter Seastar
www.seastar-project.org
● Thread-per-core design (shard)
○ No blocking. Ever.
● Asynchronous networking, file I/O,
multicore
● Future/promise based APIs
● Usermode TCP/IP stack included in the box
Seastar task scheduler
Traditional stack Seastar stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
Context switch cost is
high. Large stacks pollutes
the caches
No sharing, millions of
parallel events
Seastar memcached
Pedis https://github.com/fastio/pedis
Futures
future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) {
int id = buf_to_id(buf);
unsigned core = id % smp::count;
return smp::submit_to(core, [id] {
return lookup(id);
}).then([this] (sstring result) {
return _conn->write(result);
});
});
Futures
future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) {
int id = buf_to_id(buf);
unsigned core = id % smp::count;
return smp::submit_to(core, [id] {
return lookup(id);
}).then([this] (sstring result) {
return _conn->write(result);
});
});
Futures
future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) {
int id = buf_to_id(buf);
unsigned core = id % smp::count;
return smp::submit_to(core, [id] {
return lookup(id);
}).then([this] (sstring result) {
return _conn->write(result);
});
});
Futures
future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) {
int id = buf_to_id(buf);
unsigned core = id % smp::count;
return smp::submit_to(core, [id] {
return lookup(id);
}).then([this] (sstring result) {
return _conn->write(result);
});
});
Futures
future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) {
int id = buf_to_id(buf);
unsigned core = id % smp::count;
return smp::submit_to(core, [id] {
return lookup(id);
}).then([this] (sstring result) {
return _conn->write(result);
});
});
Futures
future<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) {
int id = buf_to_id(buf);
unsigned core = id % smp::count;
return smp::submit_to(core, [id] {
return lookup(id);
}).then([this] (sstring result) {
return _conn->write(result);
});
});
No escaping the monad
future<> f = …;
f.get(); // not allowed
Unless...
future<> f = seastar::async([&] {
future<> f = …;
f.get();
});
Unless...
future<> f = seastar::async([&] {
future<> f = …;
f.get();
});
Seastar memory allocator
● Non-Thread safe!
○ Each core gets a private memory pool
Seastar memory allocator
● Non-Thread safe!
○ Each core gets a private memory pool
● Allocation back pressure
○ Allocator calls a callback when low on memory
○ Scylla evicts cache in response
Seastar memory allocator
● Non-Thread safe!
○ Each core gets a private memory pool
● Allocation back pressure
○ Allocator calls a callback when low on memory
○ Scylla evicts cache in response
● Inter-core free() through message passing
❏ Introducing ScyllaDB
❏ System Architecture
❏ Node Architecture
❏ Seastar
❏ Resource Management
❏ Workload Conditioning
❏ Closing
AGENDA
Usermode I/O scheduler
Storage
Block Layer
Filesystem
Usermode I/O scheduler
Storage
Block Layer
Filesystem
Disk I/O Scheduler
Class A Class B
Usermode I/O scheduler
Query
Commitlog
Compaction
Queue
Queue
Queue
Userspace
I/O
Scheduler
Disk
Figuring out optimal disk
concurrency
Max useful disk concurrency
Cassandra cache
Linux page cache
SSTables
● 4k granularity
● Thread-safe
● Synchronous APIs
● General-purpose
● Lack of control2
● ...on the other hand
○ Exists
○ Hundreds of man-years
○ Handling lots of edge cases
Cassandra cache
Linux page cache
SSTables
● Parasitic rows
SSTable page (4k)
Your data (300b)
Cassandra cache
Linux page cache
SSTables
● Page faults
Page fault
Suspend thread
Initiate I/O
Context switch
I/O completes
Context switch
Interrupt
Map page
Resume thread
App
thread
Kernel
SSD
Cassandra cache
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
● Complex tuning
Scylla cache
Unified cache
SSTables
Probabilistic Cache Warmup
● A replica with a cold cache should be sent
less requests
●
Yet another allocator
(Problems with malloc/free)
● Memory gets fragmented over time
○ If the workload changes sizes of allocated objects
○ Allocating a large contiguous block requires
evicting most of cache
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
Memory
OOM :(
Memory
OOM :(
Memory
OOM :(
Memory
OOM :(
Memory
OOM :(
Memory
OOM :(
Memory
OOM :(
Memory
Log-structured memory allocation
● Bump-pointer allocation to current segment
● Frees leave holes in segments
● Compaction will try to solve this
Compacting LSA
● Teach allocator how to move objects around
○ Updating references
● Garbage collect Compact!
○ Starting with the most sparse segments
○ Lock to pin objects
● Used mostly for the cache
○ Large majority of memory allocated
○ Small subset of allocation sites
❏ Introducing ScyllaDB
❏ System Architecture
❏ Node Architecture
❏ Seastar
❏ Resource Management
❏ Workload Conditioning
❏ Closing
AGENDA
● Internal feedback loops to balance
competing loads
○ Consume what you export
Workload Conditioning
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
WAN
CPU
Workload Conditioning
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
WAN
CPU
Workload Conditioning
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
WAN
CPU
Workload Conditioning
Adjust priority
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Memory
Monitor
WAN
CPU
Workload Conditioning
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Memory
Monitor
Adjust priority
WAN
CPU
Workload Conditioning
❏ Introducing ScyllaDB
❏ System Architecture
❏ Node Architecture
❏ Seastar
❏ Workload Conditioning
❏ Closing
AGENDA
● Careful system design and control of the software
stack can maximize throughput
● Without sacrificing latency
● Without requiring complex end-user tuning
● While having a lot of fun
Conclusions
● Download: http://www.scylladb.com
● Twitter: @ScyllaDB
● Source: http://github.com/scylladb/scylla
● Mailing lists: scylladb-user @ groups.google.com
● Slack: ScyllaDB-Users
● Blog: http://www.scylladb.com/blog
● Join: http://www.scylladb.com/company/careers
● Me: duarte@scylladb.com
How to interact
SCYLLA, NoSQL at Ludicrous Speed
Thank you.
@duarte_nunes

ScyllaDB: NoSQL at Ludicrous Speed