vsr: clients hedge request, backups send replies #3206

chaitanyabhandari · 2025-08-26T02:15:18Z

Problem

Before #2821, clients would contact backups upon request timeout (currently set to a static value of 600ms) if their attempt at reaching the primary would fail. Backups would forward requests to the primary, but only if the request's view was smaller than the backup's view. Consequently, a single message drop between the client → primary would cause the client to round-robin sending the request to all backups before trying the primary again, leading to a multi-second latency before the cluster processes the request.

To resolve this and remove bimodality wherein forwarding happens only in the failure path, #2821 makes it so that clients only send requests to the primary. While this solves the problem of the client → cluster protocol being very brittle in the face of message drops, and also the bimodality, it compromises logical availability of the cluster. Specifically, if the client → primary link is down, the client simply keeps retrying its request to the primary, even if it can connect to backup which can further connect to primary!

Solution

This PR improves the cluster's logical availability by making it so that a client partitioned from the primary can submit a request to the cluster, and also receive a reply from the cluster. It also resolves #444!

Now, the client always sends a request to two replicas to hedge its requests, to avoid bimodality wherein we only forward after request timeout is fired:

Primary (based on the view that it knows about): This assumes the client → primary path would provide the smallest request-response latency. We could precisely track the request-response latencies and select a replica based on those measurements, but we settled on the 90% solution where we always forward to the primary.
Randomly selected backup: It is crucial that this replica is randomly selected as opposed to selecting the replica that provides the second best request-response latency. This is because the latter could make us more prone to being affected by a correlated fault, wherein both our requests are forwarded to replicas within an AZ to which the client has no connectivity.

Sending requests to two replicas increases egress network bandwidth utilization on the client by 2x, but we decided that its a worthwhile price to pay for predictability in performance, even in the face of client → primary network partitions!

Additionally, backups can also send replies directly to clients, if they receive a duplicate request from the client for which they already have a reply in the client table. Additionally, for each op, the primary and a randomly selected backup send a reply to the client, to avoid bimodality wherein clients wait to resend a duplicate request to a backup after request timeout is fired.

Sending replies via the primary and a random backup increases ingress network bandwidth utilization on the client by 2x, but we decided that its a worthwhile price to pay for predictability in performance, even in the face of client → primary network partitions!

src/vsr/replica.zig

chaitanyabhandari · 2025-08-26T15:52:23Z

I've implemented #3206 (comment) as part of the latest commit, primary now also hedges the Reply, sending it to a random backup as well to increase availability. This does increase the ingress network bandwidth utilization on the client by ~2x, since in the normal case the client receives each reply twice. However, it does reduce the chance of bimodal latencies as well, since clients don't have to wait a request timeout to send a duplicate request to a backup to get a reply.

With that, I believe this PR is ready for review, do let me know what your thoughts are @matklad!

src/vsr/client.zig

src/vsr/replica.zig

src/vsr/message_header.zig

src/vsr/replica.zig

chaitanyabhandari · 2025-08-29T02:44:00Z

Implemented relaxed peer identification logic in 71fb24d. Will think about ways to test this out tomorrow, perhaps the way to go here is Vortex!

src/vsr/message_header.zig

src/vsr/replica.zig

src/message_bus.zig

chaitanyabhandari · 2025-09-01T20:16:47Z

In local testing we found an interesting scenario which was leading to lots of spurious termination & reconnections:

R0 connects to R1, peer type for R0 on R1 is unknown
R0 sends ping to R1, peer type for R0 on R1 is replica
R0 forwards request to R1, R1 classifies this as a misdirected message due to the replica → client transition and terminates connection!

To tackle this, 199ee7e adds a client_likely status to vsr.Peer, and explicitly encodes the transition between client_likely, client, and replica:

pub const Peer = union(enum) {
    unknown,
    replica: u8,
    client: u128,
    client_likely: u128,

    pub fn transition(old: Peer, new: Peer) enum { retain, update, reject } {

This is used in MessageBuffer to update the peer:

        switch (vsr.Peer.transition(buffer.peer, header.peer_type())) {
            .reject => buffer.invalidate(.misdirected),
            .update => buffer.peer = header.peer_type(),
            .retain => {},
        }

src/message_buffer.zig

src/vsr.zig

…ansition

…_likely

chaitanyabhandari · 2025-09-02T15:25:52Z

b8b2fd6 tightens the state transitions as suggested in the comments!

src/message_bus.zig

chaitanyabhandari · 2025-09-03T01:50:14Z

Finally fixed the CI issue with 0422fa7.

The test was getting stuck during the phase when we send a ping_client with a low client release, and expect an eviction message to be returned by the cluster.

2025-09-02 21:56:18.666Z warning(replica): 0N: on_ping_client: ignoring unsupported client version; too low (client=1 version=0.0.1<0.16.4)
2025-09-02 21:56:18.666Z warning(replica): 0N: sending eviction message to client=1 reason=client_release_too_low
2025-09-02 21:56:18.666Z debug(replica): 0N: send_message_to_client_base: dropped eviction (log_view_durable=0 log_view=1)

The issue was that this PR added the check for dropping eviction messages when the log_view isn't durable... which is overly strict! We now only do this for pong_client and reply, which actually are the messages that externalize view.

chaitanyabhandari added fuzz vopr_lite fuzz vopr labels Aug 26, 2025

chaitanyabhandari force-pushed the cbb/cluster_logical_availablity branch 2 times, most recently from a89187c to 478e50d Compare August 26, 2025 02:57

chaitanyabhandari changed the title ~~vsr: clients hedge requests on retry, backups send replies~~ vsr: clients hedge request, backups send replies Aug 26, 2025

chaitanyabhandari commented Aug 26, 2025

View reviewed changes