KEMBAR78
Add store support for ProxySQL by timvaillancourt · Pull Request #129 · github/freno · GitHub
Skip to content

Conversation

@timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Aug 7, 2020

This PR adds support for ProxySQL as a Freno "store" for retrieving MySQL inventory

Freno connects to ProxySQL's admin port (MySQL protocol) to query the stats.stats_mysql_connection_pool table to gather MySQL inventory. stats.stats_mysql_connection_pool can be queried using the read-only ProxySQL stats user, limiting any security exposure to the ProxySQL admin port

ProxySQL backends with the status of ONLINE and SHUNNED_REPLICATION_LAG will be considered for Freno probes, all other nodes are considered unhealthy and are skipped

A gotcha worth noting: stats.stats_mysql_connection_pool can return a false-positive ONLINE state for servers immediately after a LOAD MYSQL SERVERS TO RUNTIME but importantly before the first healthcheck to the server. This is because the status of servers in main.mysql_servers is assumed to be true after being loaded to runtime

To mitigate this gotcha, nodes that are considered unhealthy are cached briefly in the ignoreServerCache and are ignored for polling so that we don't mistakenly consider a node to be healthy before ProxySQL actually health-checks it (and potentially considers it unhealthy)

cc @shlomi-noach here (because it won't let me tag you as an official reviewer 😢)

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Aug 7, 2020

This repo provides a Docker-based playground for MySQL+ProxySQL+Freno, using the branch from this PR to build Freno: https://github.com/timvaillancourt/proxysql-freno

To test it out:

$ git clone https://github.com/timvaillancourt/proxysql-freno
$ cd proxysql-freno
$ make
$ make logs

You should get the following output once the setup completes:

freno_1         | 2020-08-07 21:36:07 DEBUG getting ProxySQL data from mysql://freno:*****@proxysql:6032/stats, hostgroup id: 20 (testbed)
freno_1         | 2020-08-07 21:36:07 DEBUG Read 3 hosts from ProxySQL mysql://freno:*****@proxysql:6032/stats, hostgroup id: 20 (testbed)
freno_1         | 2020-08-07 21:36:07 DEBUG read instance key: replica1:3306
freno_1         | 2020-08-07 21:36:07 DEBUG read instance key: replica2:3306
freno_1         | 2020-08-07 21:36:07 DEBUG read instance key: replica3:3306
freno_1         | 2020-08-07 21:36:07 DEBUG updating MySQLClusterProbes: testbed
...

@shlomi-noach
Copy link

ProxySQL backends with the status of ONLINE will be considered for Freno probes, all other nodes are considered unhealthy and are skipped

Since I'm not very familiar with ProxySQL: what does ONLINE stand for? It's critical that it must mean "the server should be up and running". Looking at the docs, I'm not entirely sure.

Assuming ONLINE actually means "healthy": (skip next section if that's not the case)

Watch out for this trap. What you want to collect are servers that should be healthy, not servers that report in proxy as healthy.

To explain, let's look at HAProxy: suppose you have 10 replicas, all listed in HAProxy and are all healthy. Great! Now one of them starts lagging. Within a few seconds, HAProxy will notice it's lagging and report it as DOWN. freno must keep collecting this server. Let's see what happens if not. Let's think of the following scenario: a high load transition pushes replication lag to the limits. First replica begins lagging. HAProxy marks it as DOWN. freno now ignores that replica, and only checks for the other 9 replicas -- which are healthy. So freno reports OK to the transition process. The transition doubles down. Now 9 replicas serve the traffic 10 replicas used to. As result, now another replica starts lagging. Within a few seconds HAProxy marks it as DOWN. So freno ignores it. So freno now only looks at the remaining 8 replicas. And says OK to the transition, which doubles down. Now only 8 replicas server a 10 replica's job. You can see where this is going, this loic will exhaust all replicas.

Instead, freno should include the DOWN replicas. It will double down to run its own xinetd check on those replicas to see if they are expected to be down (HTTP 404). And it will check for replication lag.

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Aug 9, 2020

To explain, let's look at HAProxy: suppose you have 10 replicas, all listed in HAProxy and are all healthy. Great! Now one of them starts lagging. Within a few seconds, HAProxy will notice it's lagging and report it as DOWN. freno must keep collecting this server. Let's see what happens if not. Let's think of the following scenario: a high load transition pushes replication lag to the limits. First replica begins lagging. HAProxy marks it as DOWN. freno now ignores that replica, and only checks for the other 9 replicas -- which are healthy. So freno reports OK to the transition process. The transition doubles down. Now 9 replicas serve the traffic 10 replicas used to. As result, now another replica starts lagging. Within a few seconds HAProxy marks it as DOWN. So freno ignores it. So freno now only looks at the remaining 8 replicas. And says OK to the transition, which doubles down. Now only 8 replicas server a 10 replica's job. You can see where this is going, this loic will exhaust all replicas.

@shlomi-noach hello 👋, I hope you are well! This cascading lag scenario is one I didn't think of, thanks for commenting! 👍

I built this PR by copying the Vitess store in pkg/vitess, so I think that store has the same risk - something I'll follow up on separately in an upcoming PR for tablet health support. It would be great to bounce ideas with you/PlanetScale regarding this offline

Your question has made me realize I need to add requirements for this ProxySQL support in a pkg/proxysql/README.md - I've developed this assuming the ProxySQL monitor module is enabled and max_replication_lag is defined for backends. This module checks the health of the backends and dynamically changes the status column in stats.stats_mysql_connection_pool based on the check results

Currently this PR considers ONLINE nodes only, meaning a node that passes all monitor-plugin healthchecks (eg: connect + ping + replication lag). As you pointed out in the scenario you described, this has the risk of allowing cascading replication lag cluster-wide as nodes that lag will disappear from the list of servers and Freno will report things are ✅

Thankfully, the stats.stats_mysql_connection_pool table has a state called SHUNNED_REPLICATION_LAG, meaning a node that is reachable but has replication lag above the max_replication_lag field set in mysql_servers. ProxySQL won't send reads to nodes in this state but it is still ping/reach-able, similar to a lagging DOWN node under HaProxy

max_replication_lag - if greater than 0, ProxySQL will regularly monitor replication lag and if it goes beyond such threshold it will temporary shun the host until replication catch ups

I believe that changing the logic to probe servers with either ONLINE or SHUNNED_REPLICATION_LAG status would address the scenario you're describing. We would continue to ignore OFFLINE_SOFT (a node that is draining traffic and won't accept new connections - issued manually) and SHUNNED (a node ProxySQL cannot connect to or ping)

This solution avoids deploying http-based checks, something that isn't common on ProxySQL-based deployments

Does this solution check out @shlomi-noach?

@shlomi-noach
Copy link

I believe that changing the logic to probe servers with either ONLINE or SHUNNED_REPLICATION_LAG status would address the scenario you're describing. We would continue to ignore OFFLINE_SOFT (a node that is draining traffic and won't accept new connections - issued manually) and SHUNNED (a node ProxySQL cannot connect to or ping)

I'm again not very familiar with ProxySQL probe statuses, but this seems to make sense. If a server that is supposed to be serving can be in either ONLINE or SHUNNED_REPLICATION_LAG then yeah.

This solution avoids deploying http-based checks, something that isn't common on ProxySQL-based deployments

I'm merely guessing freno will continue to run http checks on backend's xinetd, regardless of ProxySQL, right? I think that's expected throughout some infra scripts.

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Aug 12, 2020

I'm merely guessing freno will continue to run http checks on backend's xinetd, regardless of ProxySQL, right? I think that's expected throughout some infra scripts.

@shlomi-noach thanks for all the reviews!

In my testbed environment I am not deploying any HTTP healthcheck server on MySQL replicas and it all just works ™️. I'll follow up on whether or not Freno is sending HTTP checks that are potentially failing

My understanding was HTTP checks was opt-in or in the HaProxy support only. Let's see! In a ProxySQL/Vitess use case(s) the HTTP check would probably be unnecessary

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Aug 15, 2020

My understanding was HTTP checks was opt-in or in the HaProxy support only. Let's see!

@shlomi-noach it turns out if the HttpCheckPort is <= 0 the HTTP checks are "skipped" 🎉 , see: https://github.com/github/freno/blob/master/pkg/mysql/mysql_http_check.go#L44-L47

Example:

tim@Tims-MacBook-Pro proxysql-freno % curl -s http://localhost:8111/debug/vars | grep httpcheck
"httpcheck.skip": 90,

This explains why everything just-works on my ProxySQL/Freno testbed where there is no HttpCheckPort defined. This is great because it lets the user decide if they want HTTP checks on top of the ProxySQL-level checks or not

Copy link

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timvaillancourt timvaillancourt merged commit de4f804 into master Aug 20, 2020
@timvaillancourt timvaillancourt deleted the proxysql_v0 branch August 20, 2020 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants