CS 542 -- Concurrency Control, Distributed Commit

CS 542 Database Management SystemsConcurrency ControlCommit in Distributed SystemsJ Singh April 11, 2011

Today’s MeetingConcurrency ControlIntention LocksIndex LockingOptimistic CCValidationTimestamp OrderingMulti-version CCCommit in Distributed DatabasesTwo Phase CommitPaxos AlgorithmConcluding thoughtsReferences (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Scheduler Architecture for CCScheduler has two partsAccepts read/write requests from transactionsAssures serializationKeeps track of active and pending transactions Controls commit, abort, delayToday’s lecture discusses Part 2 functionality

The Lock TableA relation that associates database elements with locking information about that elementImplemented as a hash tableSize is proportional to the number of lock elements, not to the size of the entire databaseDB element ALock information for A

Scheduler Priority LogicWhen a transaction releases a lock that other transactions are waiting for, what policy to use?First-Come-First-Served: Grant the lock to the longest waiting request. No starvation (waiting forever for lock)Priority to Shared Locks: Grant all S locks waiting, then one X lock. Grant X lock if no others waitingPriority to Upgrading: If there is a U lock waiting to upgrade to an X lock, grant that first.Each has its advantages and disadvantagesConfigurable for a database instance

Motivation for intention locksBesides scanning through the table, if we need to modify a few tuples. What kind of lock to put on the table?Have to be X (if we only have S or X).But, blocks all other read requests!

Intention LocksAllow intention locks IS, IX.Before S locking an item, must IS lock the root.Before X locking an item, must IX lock the root.Should make sure:If Ti S locks a node, no Tj can X lock an ancestor.Achieved if S conflicts with IXIf TjX locks a node, no Tican S or X lock an ancestor.Achieved if X conflicts with IS and IX.

Allowed Lock SharingsLock RequesterIXSSIXXIS ÖÖÖÖÖISIXÖÖLock HolderSÖÖSIXÖX

Multiple Granularity Lock ProtocolEach txn starts from the root of the hierarchy.To get a lock on any node, must hold an intentional lock on its parent node!E.g. to get S lock on a node, must hold IS or IX on parent.E.g. to get X lock on a node, must hold IX or SIX on parent.Full table of rules:Must release locks in bottom-up order.

Example 1T1(IS)T1(S)T1 needs a shared lock on t2T2 needs a shared lock on R1, T2(S)R1t1t4t2t3

Example 2T1(IS), T2(IX)T2(IX)T1(S)T1 needs a shared lock on t2T2 needs an exclusive lock on t4No conflictR1t1t4t2t3

Examples 3, 4, 5T1 scans R, and updates a few tuples:T1 gets an SIX lock on R, and occasionally upgrades to X on the tuples.T2 uses an index to read only part of R:T2 gets an IS lock on R, and repeatedly gets an S lock on tuples of R.T3 reads all of R:T3 gets an S lock on R. OR, T3 could behave like T2; can use lock escalationas it goes.Lock RequesterIXSSIXXIS ÖÖÖÖÖISIXÖÖLock HolderSÖÖSIXÖX

Insert and DeleteTransactionsT1:SELECT MAX(Price) WHERE Rating = 1;SELECT MAX(Price) WHERE Rating = 2;T2:INSERT <Apple, Arkansas Black, 1, 96>;DELETE WHERE Rating = 2 AND Price = (SELECT MAX(Price) WHERE Rating = 2);ExecutionT1 locks all records w/Rating=1 and gets 80.T2 inserts <Arkansas Black, 96>T2 deletes <Fuji, 75>T1 locks all records w/Rating=2 and gets 65.Result:

T2 then T1: 96, 65Insert and Delete RulesWhen T1 inserts t1 into R,Give X lock on t1 to T1When T2 deletes t2 from R,It must obtain an X lock on t2This will fix the Fuji delete problem (how so?)But there is still a problem: Phantom Reads. Seen with Arkansas Black in the exampleSolution: use multiple granularity treeBefore inserting Q, obtain an X lock for parent(Q)

Did Insert/Delete expose a flaw in 2PL?The flaw was with the assumption that by locking all tuples, T1 had locked the set!We needed to lock the setWould we bottleneck on the relation if the workload were insert- and delete-heavy?There is another way to solve the problem:Lock at the index (if one exists)Since B+ trees are not 100% full, we can maintain multiple locks in different sections of the tree.IndexPut a lock here.r=1

Index Locking (p1)Higher levels of the tree only direct searches for leaf pages.For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.)We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.

Index Locking (p2)Search: Start at root and go down; repeatedly, S lock child then unlock parent.Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe:If child is safe, release all locks on ancestors.Safe node: Node such that changes will not propagate up beyond this node.Inserts: Node is not full.Deletes: Node is not half-empty.

ExampleROOTWhere to lock?1) Delete 38*2) Insert 45*3) Insert 25*A20B35CF384423HDEGI20*22*23*24*35*36*38*41*44*

Optimistic CCLocking is a conservative approach in which conflicts are prevented. Disadvantages:Lock management overhead.Deadlock detection/resolution.Not discussed in CS-542 lectures, expecting that you are familiar with itIf conflicts are rare, we may be able to gain performance by not locking, and instead checking for conflicts before txns commit.Two approachesKung-Robinson ModelDivides every transaction into three phases: read, validate, writeMakes commit/abort decision based on what’s being read and writtenTimestamp Ordering AlgorithmsClever use of timestamps to determine which operations are conflict-free and which must be aborted

Kung-Robinson ModelKey idea:Let transactions work in isolationValidate reads and writes when ready to commitMake Validation AtomicValidated ≡ CommittedTransactions have three phases:READ: txns read from the database, make changes to private copies of objects.VALIDATE: Check if schedule so far is serializable.WRITE: Make local copies of changes public.oldROOTmodifiedobjectsnew

ValidationTest conditions that are sufficient to ensure that no conflict occurred.Each txn is assigned a numeric id.Just use a timestamp.Transaction ids assigned at end of READ phase, just before validation begins. ReadSet(Ti): Set of objects read by txn Ti.WriteSet(Ti): Set of objects modified by Ti.Validation is atomicDone in a critical section

Validation TestsTestFIN(Ti) < START(Tj)FIN(Ti) < VAL(Tj) ANDWriteSet(Ti ) ∩ReadSet(Tj ) is empty.VAL(Ti) < VAL(Tj) ANDWriteSet(Ti ) ∩ReadSet(Tj ) is empty ANDWriteSet(Ti ) ∩WriteSet(Tj ) is empty.TiTjTiTiRVWRVWRVWTjRVWTjRVWRVWSituation

Overheads in Kung-Robinson CCMust record read/write activity in ReadSet and WriteSet per txn.Must create and destroy these sets as needed.Must check for conflicts during validation, and must make validated writes “global”.Critical section can reduce concurrency.Scheme for making writes global can reduce clustering of objects.Optimistic CC restarts transactions that fail validation.Work done so far is wasted; requires clean-up.

Timestamp Ordering CCMain idea:Put a timestamp on the last read and write action on every objectUse this timestamp to detect if a transaction attempts an illegal operationAbort the offending transaction if it doesAlgorithm: Give each object a read-timestamp (RTS) and a write-timestamp (WTS), Give each txn a timestamp (TS) when it beginsAction ai of txn Ti must occur before action aj of txn Tj ifIf action ai of txn Ti conflicts with action aj of txn Tj, and TS(Ti) < TS(Tj), then ai must occur before aj. Otherwise, restart the violating txn.

Rules for Timestamps-Based schedulingAlgorithm setupRT(X)The read time of X, the highest timestamp of transaction that has read X.WT(X)The write time of X, the highest timestamp of transaction that has write X.C(X)The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.Scheduler receives a request from T to operate on XThe request is realizable under some conditions and not under others

Physically UnrealizableRead too lateA transaction U that started after transaction T but wrote a value for X before T reads XIn other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.U writes XT reads XT startU start

Physically UnrealizableWrite too lateA transaction U that started after T, but read X before T got a chance to write X.In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.U reads XT writes XT startU start

Dirty ReadAfter T reads the value of X written by U, U could abortIn other words, if TS(T) = RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later value in X. If C(X) is true, then the previous writer of X is committed, all is good.If C(X) is false, we must delay T.U writes XT reads XU startT startU aborts

Write after WriteT tries to write X after a later transaction (U) has written itOK to ignore the write by T because it will get overwritten anywayExcept if U aborts And the new value of T is lost foreverSolve this problem by introducing the concept of a “tentative write”U writes XT writes XU abortU startT startT commit

Rules for Timestamps-based SchedulingScheduler receives a request to commit T. It must find all the database elements X written by T and set C(X)=true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed.Scheduler receives a request to abort T or decides to rollback T, Any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.

Multiversion TimestampsMultiversion schemes keep old versions of data item to increase concurrency.Each successful write results in the creation of a new version of the data item written.Use timestamps to label versions.When a read(X) operation is issued, select an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version.

Timestamps vs LockingGenerally, timestamping performs better than locking in situations where:Most transactions are read-only.It is rare that concurrent transaction will try to read and write the same element.This is generally the case for Web ApplicationsIn high-conflict situation, locking performs better than timestamps

Practical Use2-Phase Locks (or variants)Used by most relational databasesMulti-level granularitySupport for table, page and tuple-level locksUsed by most relational databasesMulti-version concurrency controlOracle 8 forward: Divide transactions into read-only and read-writeRead-only transactions use multi-version concurrency and never waitRead-write transactions use 2PLPostgres, others as well, offer some level of MVCC

Distributed Commit MotivationFruitCo hasIts main Sales office in OregonFarms and Warehouse are in WashingtonFinance is in UtahAll three sites have local data centers with their own systemsWhen an order is placed, the Sales system must send the billing information to Utah and shipping information to Washington.When an order is placed, all three databases must be updated, or none should be.

Two Phase CommitThe Basic Idea

Two-Phase Commit (2PC)Phase 1 : The TM gets the RMs ready to write the results into the databasePhase 2 : Everybody writes the results into the databaseTM :The process at the site where the transaction originates and which controls the executionRM :The process at the other sites that participate in executing the transactionGlobal Commit Rule:The TM aborts a transaction if and only if at least one RM votes to abort it.The TM commits a transaction if and only if all of the RMs vote to commit it.

Centralized 2PCPPPPCCCPPPPready?yes/nocommit/abort?commited/abortedPhase 1Phase 2

State Transitions in 2PCINITIALINITIALREADY Prepare Commit commandVote-commitPrepare Prepare Vote-abortWAITGlobal-abortGlobal-commitVote-commit (all) Vote-abort AckAckGlobal-commitGlobal-abortABORTCOMMITCOMMITABORTTMRMs

When TM Fails…Timeout in INITIALWho caresTimeout in WAITCannot unilaterally commitCan unilaterally abortTimeout in ABORT or COMMITStay blocked and wait for the acksTMINITIALCommit commandPrepareWAIT Vote-abort Vote-commit Global-commitGlobal-abortABORTCOMMIT

When an RM Fails…INITIALTimeout in INITIALTM must have failed in INITIAL stateUnilaterally abortTimeout in READYStay blockedRMs Prepare Vote-commit Prepare Vote-abortREADYGlobal-abortGlobal-commitAckAckABORTCOMMIT

When TM Recovers…Failure in INITIALStart the commit process upon recoveryFailure in WAITRestart the commit process upon recoveryFailure in ABORT or COMMITNothing special if all the acks have been receivedOtherwise the termination protocol is involvedINITIALTMCommit commandPrepareWAIT Vote-commit Vote-abort Global-commitGlobal-abortABORTCOMMIT

When an RM Recovers…Failure in INITIALUnilaterally abort upon recoveryFailure in READYThe TM has been informed about the local decisionTreat as timeout in READY state and invoke the termination protocolFailure in ABORT or COMMITNothing special needs to be doneINITIALRMs Prepare Vote-commit Prepare Vote-abortREADYGlobal-abortGlobal-commitAckAckCOMMITABORT

2PC Protocol ActionsRM TM INITIALINITIALPREPAREwritebegin_commitin logwrite abortin logNoReady toCommit?VOTE-ABORTYesVOTE-COMMITwrite readyin logWAITYesGLOBAL-ABORTwrite abortin logREADYAny No?NoVOTE-COMMITwrite commitin logAbortType ofmsgACKCommitwrite abortin logABORTCOMMITACKwrite commitin logwriteend_of_transactionin logABORTCOMMIT

Two-phase commit commentaryTwo-phase commit protocol limitation: it is a blocking protocol. The failure of the TM can cause the protocol to block until the TM is repaired. If the TM fails right after every RM has sent a Prepared message, then the other RMs have no way of knowing whether the TM committed or aborted.RMs will block resource processes while waiting for a message from the TM. A TM will also block resources while waiting for replies from RMs. A TM can also block indefinitely if no acknowledgement is received from the RM. “Federated” two-phase commit protocols, aka three-phase protocols, have been proposed but are still unproven.Paxos Consensus Algorithm. Consensus on Transaction Commit, Jim Gray and Leslie Lamport, Microsoft Research, 2005, MSR-TR-2003-96

Fault-Tolerant Two Phase CommitPreparedclientTMRMRequestCommitPreparePreparedPrepareTMRMRequestCommitPreparePreparedIf the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)

Fault-Tolerant Two Phase CommitclientTMRMabortPreparedPreparecommitcommitTMRMTMPreparedcommitPrepareRequestCommitPreparePreparedInconsistent! Now What?PreparePreparedcommitcommitabortIf the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)The complexity is a mess.But… What if….?

Fault Tolerant 2PC Several workarounds proposed in database community:Often called "3-phase" or "non-blocking" commit.None with complete algorithm and correctness proof.

Propose XconsensusboxclientW ChosenPropose WclientW ChosenclientW ChosenConsensuscollects proposed valuesPicks one proposed valueremembers it forever

Consensus for Commit – The Obvious ApproachconsensusboxRMclientTMPropose PreparedPrepared ChosenRequest CommitPreparedPrepareCommitCommitPrepareCommitTMRMPrepared ChosenPreparedRequestCommitPreparePreparedPropose PreparedPrepared ChosenCommitCommitGet consensus on TM’s decision.TM just learns consensus value.TM is “stateless”

Consensus for Commit – The Paxos Commit ApproachRMclientTMRequest CommitconsensusboxPropose RM1 PreparedPrepareRM1 Prepared ChosenCommitCommitPrepareconsensusboxCommitRMTMPropose RM2 PreparedRM2 Prepared ChosenRequestCommitPreparePropose RM1 PreparedPropose RM2 PreparedRM1 Prepared ChosenRM2 Prepared ChosenCommitCommitGet consensus on each RM’s choice.TM just combines consensus values.TM is “stateless”

The Obvious ApproachPaxos CommitOne fewer message delayPreparePreparePreparedPropose RM1 PreparedPropose RM2 PreparedPropose PreparedRM1 Prepared ChosenPrepared ChosenRM2 Prepared ChosenCommitCommit

RMConsensus boxPropose RM PreparedacceptorTMacceptorTMacceptorConsensus in ActionPropose RM PreparedVote RM PreparedPropose RM PreparedRM PreparedChosenVote RM PreparedVote RM PreparedThe normal (failure-free) caseTwo message delaysCan optimize

RMConsensus boxacceptorTMacceptorTMTMacceptorConsensus in ActionTM can always learn what was chosen,or get Aborted chosen if nothing chosen yet;if majority of acceptors working .

The Complete AlgorithmSubtle.More weird cases than most people imagine.Proved correct.

PaxosCommit in a NutshellAcceptors0…2FClient TMRM1…Nrequestcommitpreparepreparedall preparedcommitN RMs2F+1 acceptors (~2F+1 TMs)If F+1 acceptors see all RMs prepared, then transaction committed.2F(N+1) + 3N + 1 messages5 message delays 2 stable write delays.

Paxos Commit EvaluationTwo-Phase Commit3N+1 messagesN+1 stable writes4 message delays2 stable-write delaysAvailability is compromisedPaxos Commit3N+ 2F(N+1) +1 messagesN+2F+1 stable writes5 message delays2 stable-write delaysTolerates F FaultsPaxos≡ 2PC for F = 0Paxos Algorithm is the basis of Google’s Global Distributed Lock ManagerChubby has F=2 (5 Acceptors)

OLTP Through the Looking Glass (p1)WorkloadTPC-C BenchmarkQuote:Overall, we identify overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. … Substantial time is spent in logging, latching, locking, Btree, and buffer management.OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008Took out components of a DBMS and measured its performance impact

OLTP Through the Looking Glass (p2)Concurrency ControlLook for applications where it can be turned offSome sort of optimistic concurrency controlMulti-core SupportLatching (inter-thread communication) remains a significant bottleneckCache-conscious B-TreesReplication ManagementLoss of transactional consistency if log shippingRecovery is not instantaneousMaintaining transactional consistencyWeak ConsistencyStarbucks doesn’t need two phase commitHow to achieve eventual consistency without transactional consistencyAreas for Research that may yield dividends

End of an Era?The Relational Model is not necessarily the answerIt was excellent for data processingNot a natural fit forData WarehousesWeb-oriented searchReal-time analytics, andSemi-structured datai.e., Semantic WebSQL is not the answerCoupling between modern programming languages and SQL are “ugly beyond belief”Programming languages have evolved while SQL has remained staticPascalC/C++JavaThe little languages: Python, Perl, PHP, RubyThe end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007A critique of the “one size fits all” assumption in DBMS

CS 542 -- Concurrency Control, Distributed Commit

More Related Content

What's hot

Viewers also liked

Similar to CS 542 -- Concurrency Control, Distributed Commit

More from J Singh

Recently uploaded

CS 542 -- Concurrency Control, Distributed Commit