KEMBAR78
Understanding AntiEntropy in Cassandra | PPTX
When Bad Things
Happen to Good Data
Understanding Anti-Entropy in Cassandra
#cassandra13
Jason Brown
@jasobrown jasedbrown@gmail.com
About me
• Senior Software Engineer, Netflix
• Apache Cassandra committer
• E-commerce Architect, Major League Baseball Advanced
Media
• Wireless Developer (J2ME and BREW)
#cassandra13
Maintaining consistent state is hard in a distributed system
CAP theorem is working against you
#cassandra13
Inconsistencies creep in
• Node is down
• Network partition
• Dropped Mutations
• Process crash before flush
• File corruption
#cassandra13
Anti-Entropy Overview
• Write time
• Tunable consistency
• Atomic batches
• Hinted handoff
• Read time
• Consistent reads
• Read repair
• Maintenance time
• Node repair
#cassandra13
Write Time
#cassandra13
C* Write Basics
• Determine all replica nodes, in all DCs
• Send to all replicas in local DC
• Send to one replica in remote DCs
• It will forward to peers
• All respond back to coordinator
#cassandra13
Writes – request path
#cassandra13
Writes – response path
#cassandra13
Tunable consistency
Coordinator blocks for specified count of replicas to respond
consistency levels:
• ANY
• ONE / TWO / THREE
• LOCAL_QUORUM
• EACH_QUORUM
• ALL
#cassandra13
Hinted Handoff
Save a copy of the write for down nodes, and replay later
Hint = target replica ID + mutation data
#cassandra13
Hinted Handoff - storing
• On coordinator, store hint for nodes not up
• Also, if a replica doesn’t respond within
write_request_timeout_in_ms, store a hint
• max_hint_window_in_ms – max time a node will create
hints for a dead node
#cassandra13
Hinted Handoff - replay
• Try to send hints to nodes
• Runs every ten minutes
• Multithreaded (c* 1.2)
• Throttleable (kb per second)
#cassandra13
Hinted Handoff – down node
#cassandra13
Hinted Handoff – replay
#cassandra13
What if coordinator dies?
#cassandra13
Atomic Batches
• Coordinator stores incoming mutation to two peers in
same DC
• Deletes batch from peers on successful completion
• Peers will play batch if not deleted
• Runs every 60 seconds
• With c* 1.2, all mutates use atomic batch
#cassandra13
Read time
#cassandra13
Cassandra reads - setup
• Determine replicas to invoke
• consistency level vs. read repair
• First data node responds with full data set, other send
digest
• Coordinator waits for consistency_level nodes to respond
#cassandra13
LOCAL_QUORUM read
#cassandra13
Consistent reads
• Compare digests
• If any mismatches
• re-request to same nodes (full data set)
• compare full data sets, send updates
• block until out of date replicas respond successfully
• Return merged data set to client
#cassandra13
Read repair
• Synchronizes the client-requested data amongst all
replicas
• Piggy-backs on normal reads, but waits for all replicas to
responds (asynchronously)
• Compares the digests and follow same alg as consistent
read
#cassandra13
Read Repair
#cassandra13
Green lines = LOCAL_QUORUM nodes
Blue lines = nodes for read repair
Read repair configuration
• Setting per column family
• Percentage of all reads to CF
• Local DC vs. Global
#cassandra13
Read repair fixes data that is actually
requested,
…but what about data that isn’t requested?
#cassandra13
Node repair - introduction
• Repairs inconsistencies across all replicas for a given
range
• nodetool repair
• repairs the ranges the node contains
• one or more column families (within the same keyspace)
• can choose local datacenter only (c* 1.2)
#cassandra13
Node Repair - cautions
• Should be part of standard c* operations
• Especially if you delete data
• Repair is IO and CPU intensive
#cassandra13
Node Repair – details, 1
• Determine peer nodes with matching ranges
• Triggers a major (validation) compaction on peer nodes
• read and generate hash for every row in CF
• add result to a Merkle Tree
• return tree to initiator
#cassandra13
Node Repair – details, 2
• Initiator awaits trees from participating nodes
• Compares every tree to every other tree
• If any differences detected, the differing nodes exchange
conflicting range(s)
• Written out as new, local SSTables
#cassandra13
Read Repair – example
#cassandra13
#cassandra13
#cassandra13
#cassandra13
#cassandra13
Anti-Entropy – Wrap Up
• CAP Theorem lives, tradeoffs must be understood and
made
• C* contains processes to make diverging data sets
consistent
• Tunable controls exist at write and read times, as well on-
demand
#cassandra13
Thank you!
Q & A time
@jasobrown
#cassandra13

Understanding AntiEntropy in Cassandra

  • 1.
    When Bad Things Happento Good Data Understanding Anti-Entropy in Cassandra #cassandra13 Jason Brown @jasobrown jasedbrown@gmail.com
  • 2.
    About me • SeniorSoftware Engineer, Netflix • Apache Cassandra committer • E-commerce Architect, Major League Baseball Advanced Media • Wireless Developer (J2ME and BREW) #cassandra13
  • 3.
    Maintaining consistent stateis hard in a distributed system CAP theorem is working against you #cassandra13
  • 4.
    Inconsistencies creep in •Node is down • Network partition • Dropped Mutations • Process crash before flush • File corruption #cassandra13
  • 5.
    Anti-Entropy Overview • Writetime • Tunable consistency • Atomic batches • Hinted handoff • Read time • Consistent reads • Read repair • Maintenance time • Node repair #cassandra13
  • 6.
  • 7.
    C* Write Basics •Determine all replica nodes, in all DCs • Send to all replicas in local DC • Send to one replica in remote DCs • It will forward to peers • All respond back to coordinator #cassandra13
  • 8.
    Writes – requestpath #cassandra13
  • 9.
    Writes – responsepath #cassandra13
  • 10.
    Tunable consistency Coordinator blocksfor specified count of replicas to respond consistency levels: • ANY • ONE / TWO / THREE • LOCAL_QUORUM • EACH_QUORUM • ALL #cassandra13
  • 11.
    Hinted Handoff Save acopy of the write for down nodes, and replay later Hint = target replica ID + mutation data #cassandra13
  • 12.
    Hinted Handoff -storing • On coordinator, store hint for nodes not up • Also, if a replica doesn’t respond within write_request_timeout_in_ms, store a hint • max_hint_window_in_ms – max time a node will create hints for a dead node #cassandra13
  • 13.
    Hinted Handoff -replay • Try to send hints to nodes • Runs every ten minutes • Multithreaded (c* 1.2) • Throttleable (kb per second) #cassandra13
  • 14.
    Hinted Handoff –down node #cassandra13
  • 15.
    Hinted Handoff –replay #cassandra13
  • 16.
    What if coordinatordies? #cassandra13
  • 17.
    Atomic Batches • Coordinatorstores incoming mutation to two peers in same DC • Deletes batch from peers on successful completion • Peers will play batch if not deleted • Runs every 60 seconds • With c* 1.2, all mutates use atomic batch #cassandra13
  • 18.
  • 19.
    Cassandra reads -setup • Determine replicas to invoke • consistency level vs. read repair • First data node responds with full data set, other send digest • Coordinator waits for consistency_level nodes to respond #cassandra13
  • 20.
  • 21.
    Consistent reads • Comparedigests • If any mismatches • re-request to same nodes (full data set) • compare full data sets, send updates • block until out of date replicas respond successfully • Return merged data set to client #cassandra13
  • 22.
    Read repair • Synchronizesthe client-requested data amongst all replicas • Piggy-backs on normal reads, but waits for all replicas to responds (asynchronously) • Compares the digests and follow same alg as consistent read #cassandra13
  • 23.
    Read Repair #cassandra13 Green lines= LOCAL_QUORUM nodes Blue lines = nodes for read repair
  • 24.
    Read repair configuration •Setting per column family • Percentage of all reads to CF • Local DC vs. Global #cassandra13
  • 25.
    Read repair fixesdata that is actually requested, …but what about data that isn’t requested? #cassandra13
  • 26.
    Node repair -introduction • Repairs inconsistencies across all replicas for a given range • nodetool repair • repairs the ranges the node contains • one or more column families (within the same keyspace) • can choose local datacenter only (c* 1.2) #cassandra13
  • 27.
    Node Repair -cautions • Should be part of standard c* operations • Especially if you delete data • Repair is IO and CPU intensive #cassandra13
  • 28.
    Node Repair –details, 1 • Determine peer nodes with matching ranges • Triggers a major (validation) compaction on peer nodes • read and generate hash for every row in CF • add result to a Merkle Tree • return tree to initiator #cassandra13
  • 29.
    Node Repair –details, 2 • Initiator awaits trees from participating nodes • Compares every tree to every other tree • If any differences detected, the differing nodes exchange conflicting range(s) • Written out as new, local SSTables #cassandra13
  • 30.
    Read Repair –example #cassandra13
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Anti-Entropy – WrapUp • CAP Theorem lives, tradeoffs must be understood and made • C* contains processes to make diverging data sets consistent • Tunable controls exist at write and read times, as well on- demand #cassandra13
  • 36.
    Thank you! Q &A time @jasobrown #cassandra13