-
Notifications
You must be signed in to change notification settings - Fork 698
vsr: clients hedge request, backups send replies #3206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a89187c to
478e50d
Compare
478e50d to
0960944
Compare
|
I've implemented #3206 (comment) as part of the latest commit, primary now also hedges the With that, I believe this PR is ready for review, do let me know what your thoughts are @matklad! |
d327dfd to
ad97a06
Compare
f71477b to
91f4ebd
Compare
3d6a936 to
23eeea8
Compare
3e7725d to
82b0229
Compare
|
Implemented relaxed peer identification logic in 71fb24d. Will think about ways to test this out tomorrow, perhaps the way to go here is Vortex! |
|
In local testing we found an interesting scenario which was leading to lots of spurious termination & reconnections:
To tackle this, 199ee7e adds a pub const Peer = union(enum) {
unknown,
replica: u8,
client: u128,
client_likely: u128,
pub fn transition(old: Peer, new: Peer) enum { retain, update, reject } {This is used in switch (vsr.Peer.transition(buffer.peer, header.peer_type())) {
.reject => buffer.invalidate(.misdirected),
.update => buffer.peer = header.peer_type(),
.retain => {},
} |
f8e50d4 to
b8b2fd6
Compare
|
b8b2fd6 tightens the state transitions as suggested in the comments! |
801f943 to
768e9af
Compare
3713b68 to
392764e
Compare
|
Finally fixed the CI issue with 0422fa7. The test was getting stuck during the phase when we send a The issue was that this PR added the check for dropping |
c92b921 to
0422fa7
Compare
Problem
Before #2821, clients would contact backups upon request timeout (currently set to a static value of 600ms) if their attempt at reaching the primary would fail. Backups would forward requests to the primary, but only if the request's view was smaller than the backup's view. Consequently, a single message drop between the client → primary would cause the client to round-robin sending the request to all backups before trying the primary again, leading to a multi-second latency before the cluster processes the request.
To resolve this and remove bimodality wherein forwarding happens only in the failure path, #2821 makes it so that clients only send requests to the primary. While this solves the problem of the client → cluster protocol being very brittle in the face of message drops, and also the bimodality, it compromises logical availability of the cluster. Specifically, if the client → primary link is down, the client simply keeps retrying its request to the primary, even if it can connect to backup which can further connect to primary!
Solution
This PR improves the cluster's logical availability by making it so that a client partitioned from the primary can submit a request to the cluster, and also receive a reply from the cluster. It also resolves #444!
Now, the client always sends a request to two replicas to hedge its requests, to avoid bimodality wherein we only forward after request timeout is fired:
Sending requests to two replicas increases egress network bandwidth utilization on the client by 2x, but we decided that its a worthwhile price to pay for predictability in performance, even in the face of client → primary network partitions!
Additionally, backups can also send replies directly to clients, if they receive a duplicate request from the client for which they already have a reply in the client table. Additionally, for each op, the primary and a randomly selected backup send a reply to the client, to avoid bimodality wherein clients wait to resend a duplicate request to a backup after request timeout is fired.
Sending replies via the primary and a random backup increases ingress network bandwidth utilization on the client by 2x, but we decided that its a worthwhile price to pay for predictability in performance, even in the face of client → primary network partitions!