KEMBAR78
vsr: clients hedge request, backups send replies by chaitanyabhandari · Pull Request #3206 · tigerbeetle/tigerbeetle · GitHub
Skip to content

Conversation

chaitanyabhandari
Copy link
Contributor

@chaitanyabhandari chaitanyabhandari commented Aug 26, 2025

Problem

Before #2821, clients would contact backups upon request timeout (currently set to a static value of 600ms) if their attempt at reaching the primary would fail. Backups would forward requests to the primary, but only if the request's view was smaller than the backup's view. Consequently, a single message drop between the client → primary would cause the client to round-robin sending the request to all backups before trying the primary again, leading to a multi-second latency before the cluster processes the request.

To resolve this and remove bimodality wherein forwarding happens only in the failure path, #2821 makes it so that clients only send requests to the primary. While this solves the problem of the client → cluster protocol being very brittle in the face of message drops, and also the bimodality, it compromises logical availability of the cluster. Specifically, if the client → primary link is down, the client simply keeps retrying its request to the primary, even if it can connect to backup which can further connect to primary!

Solution

This PR improves the cluster's logical availability by making it so that a client partitioned from the primary can submit a request to the cluster, and also receive a reply from the cluster. It also resolves #444!

Now, the client always sends a request to two replicas to hedge its requests, to avoid bimodality wherein we only forward after request timeout is fired:

  1. Primary (based on the view that it knows about): This assumes the client → primary path would provide the smallest request-response latency. We could precisely track the request-response latencies and select a replica based on those measurements, but we settled on the 90% solution where we always forward to the primary.
  2. Randomly selected backup: It is crucial that this replica is randomly selected as opposed to selecting the replica that provides the second best request-response latency. This is because the latter could make us more prone to being affected by a correlated fault, wherein both our requests are forwarded to replicas within an AZ to which the client has no connectivity.

Sending requests to two replicas increases egress network bandwidth utilization on the client by 2x, but we decided that its a worthwhile price to pay for predictability in performance, even in the face of client → primary network partitions!

Additionally, backups can also send replies directly to clients, if they receive a duplicate request from the client for which they already have a reply in the client table. Additionally, for each op, the primary and a randomly selected backup send a reply to the client, to avoid bimodality wherein clients wait to resend a duplicate request to a backup after request timeout is fired.

Sending replies via the primary and a random backup increases ingress network bandwidth utilization on the client by 2x, but we decided that its a worthwhile price to pay for predictability in performance, even in the face of client → primary network partitions!

@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch 2 times, most recently from a89187c to 478e50d Compare August 26, 2025 02:57
@chaitanyabhandari chaitanyabhandari changed the title vsr: clients hedge requests on retry, backups send replies vsr: clients hedge request, backups send replies Aug 26, 2025
@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch from 478e50d to 0960944 Compare August 26, 2025 15:49
@chaitanyabhandari
Copy link
Contributor Author

I've implemented #3206 (comment) as part of the latest commit, primary now also hedges the Reply, sending it to a random backup as well to increase availability. This does increase the ingress network bandwidth utilization on the client by ~2x, since in the normal case the client receives each reply twice. However, it does reduce the chance of bimodal latencies as well, since clients don't have to wait a request timeout to send a duplicate request to a backup to get a reply.

With that, I believe this PR is ready for review, do let me know what your thoughts are @matklad!

@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch 3 times, most recently from d327dfd to ad97a06 Compare August 27, 2025 12:55
@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch 2 times, most recently from f71477b to 91f4ebd Compare August 27, 2025 18:59
@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch 4 times, most recently from 3d6a936 to 23eeea8 Compare August 28, 2025 02:36
@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch 3 times, most recently from 3e7725d to 82b0229 Compare August 28, 2025 21:02
@chaitanyabhandari
Copy link
Contributor Author

Implemented relaxed peer identification logic in 71fb24d. Will think about ways to test this out tomorrow, perhaps the way to go here is Vortex!

@chaitanyabhandari
Copy link
Contributor Author

chaitanyabhandari commented Sep 1, 2025

In local testing we found an interesting scenario which was leading to lots of spurious termination & reconnections:

  1. R0 connects to R1, peer type for R0 on R1 is unknown
  2. R0 sends ping to R1, peer type for R0 on R1 is replica
  3. R0 forwards request to R1, R1 classifies this as a misdirected message due to the replicaclient transition and terminates connection!

To tackle this, 199ee7e adds a client_likely status to vsr.Peer, and explicitly encodes the transition between client_likely, client, and replica:

pub const Peer = union(enum) {
    unknown,
    replica: u8,
    client: u128,
    client_likely: u128,

    pub fn transition(old: Peer, new: Peer) enum { retain, update, reject } {

This is used in MessageBuffer to update the peer:

        switch (vsr.Peer.transition(buffer.peer, header.peer_type())) {
            .reject => buffer.invalidate(.misdirected),
            .update => buffer.peer = header.peer_type(),
            .retain => {},
        }

@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch from f8e50d4 to b8b2fd6 Compare September 2, 2025 15:24
@chaitanyabhandari
Copy link
Contributor Author

b8b2fd6 tightens the state transitions as suggested in the comments!

@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch from 801f943 to 768e9af Compare September 2, 2025 16:34
@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch from 3713b68 to 392764e Compare September 2, 2025 19:22
@chaitanyabhandari
Copy link
Contributor Author

chaitanyabhandari commented Sep 3, 2025

Finally fixed the CI issue with 0422fa7.

The test was getting stuck during the phase when we send a ping_client with a low client release, and expect an eviction message to be returned by the cluster.

2025-09-02 21:56:18.666Z warning(replica): 0N: on_ping_client: ignoring unsupported client version; too low (client=1 version=0.0.1<0.16.4)
2025-09-02 21:56:18.666Z warning(replica): 0N: sending eviction message to client=1 reason=client_release_too_low
2025-09-02 21:56:18.666Z debug(replica): 0N: send_message_to_client_base: dropped eviction (log_view_durable=0 log_view=1)

The issue was that this PR added the check for dropping eviction messages when the log_view isn't durable... which is overly strict! We now only do this for pong_client and reply, which actually are the messages that externalize view.

@chaitanyabhandari chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch from c92b921 to 0422fa7 Compare September 3, 2025 01:58
@chaitanyabhandari chaitanyabhandari added this pull request to the merge queue Sep 3, 2025
Merged via the queue into main with commit d5a22b7 Sep 3, 2025
38 checks passed
@chaitanyabhandari chaitanyabhandari deleted the cbb/cluster_logical_availablity branch September 3, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VSR: Allow backup to reply to partitioned client

4 participants