Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ In the event the first proxy fails, the second proxy will still be able to route

The proxies both will route traffic to the lowest-indexed healthy galera node, according to the galera index (not bosh index).

Traffic to the MySQL cluster is routed through one or more proxy nodes. The current proxy implementation is [Switchboard](https://github.com/cloudfoundry-incubator/switchboard). This proxy acts as an intermediary between the client and the MySQL server, providing failover between MySQL nodes. The number of nodes is configured by the proxy job instance count in the deployment manifest.
Traffic to the MySQL cluster is routed through one or more proxy nodes. The current proxy implementation is [Switchboard](https://github.com/cloudfoundry/pxc-release/tree/main/src/github.com/cloudfoundry-incubator/switchboard). This proxy acts as an intermediary between the client and the MySQL server, providing failover between MySQL nodes. The number of nodes is configured by the proxy job instance count in the deployment manifest.

**NOTE:** If the number of proxy nodes is set to zero, apps will be bound to the IP address of the first MySQL node in the cluster. If that IP address should change for any reason (e.g. loss of a VM) or a proxy was subsequently added, one would need to re-bind all apps to the IP address of the new node.

Expand Down
112 changes: 112 additions & 0 deletions docs/cluster-behavior.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Cluster scaling, node failure, and quorum

Documented here are scenarios in which the size of a cluster may change, how the cluster behaves, and how to restore service function when impacted. [Percona XtraDB Cluster](https://www.percona.com/mysql/software/percona-xtradb-cluster) is packaged within pxc-release, and is the mysql distribution running on the "mysql nodes" discussed below.

### Healthy Cluster

Galera documentation refers to nodes in a healthy cluster as being part of a [primary component](http://galeracluster.com/documentation-webpages/glossary.html#term-primary-component). These nodes will respond normally to all queries, reads, writes and database modifications.

If an individual node is unable to connect to the rest of the cluster (ex: network partition) it becomes non-primary (stops accepting writes and database modifications). In this case, the rest of the cluster should continue to function normally. A non-primary node may eventually regain connectivity and rejoin the primary component.

If more than half of the nodes in a cluster are no longer able to connect to each other, all of the remaining nodes lose quorum and become non-primary. In this case, the cluster must be manually restarted, as documented in the [bootstrapping docs](https://github.com/cloudfoundry/cf-mysql-release/blob/release-candidate/docs/bootstrapping.md).

### Graceful removal of a node
- Shutting down a node with monit (or decreasing cluster size by one) will cause the node to gracefully leave the cluster.
- Cluster size is reduced by one and maintains healthy state. Cluster will continue to operate, even with a single node, as long as other nodes left gracefully.

### Adding new nodes

When new nodes are added to or removed from a MySQL service, a top-level property is updated with the new nodes' IP addresses. As BOSH deploys, it will update the configuration and restart all of the mysql nodes **and** the proxy nodes (to inform them of the new IP addresses as well). Restarting the nodes will cause all connections to that node to be dropped while the node restarts.

### Scaling the cluster

#### Scaling up from 1 to N nodes
When a new mysql node comes online, it replicates data from the existing node in the cluster. Once replication is complete, the node will join the cluster. The proxy will continue to route all incoming connections to the primary node while it remains healthy.

If the proxy detects that this node becomes [unhealthy](proxy.md#unhealthy), it will sever existing connections, and route all new connections to a different, healthy node. If there are no healthy mysql nodes, the proxy will reject all subsequent connections.

While transitioning from one node to a cluster, there will be an undetermined period of performance degradation while the new node sync's all data from the original node.

Note: If you are planning to scale up mysql nodes, it is recommended to do so in different Availability Zones to maximize cluster availability. An Availability Zone is a network-distinct section of a given Region. Further details are available in [Amazon's documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).

#### Scaling down from N to 1 node
When scaling from multiple nodes to a single mysql node, the proxy will determine that the sole remaining node is the primary node (provided it remains healthy). The proxy routes incoming connections to the remaining mysql node.

### Rejoining the cluster (existing nodes)
Existing nodes restarted with monit should automatically join the cluster. If an existing node fails to join the cluster, it may be because its transaction record's (`seqno`) is higher than that of the nodes in the cluster with quorum (aka the primary component).

- If the node has a higher `seqno` it will be apparent in the error log `/var/vcap/sys/log/mysql/mysql.err.log`.
- If the healthy nodes of a cluster have a lower transaction record number than the failing node, it might be desirable to shut down the healthy nodes and bootstrap from the node with the more recent transaction record number. See the [bootstraping docs](https://github.com/cloudfoundry/cf-mysql-release/blob/release-candidate/docs/bootstrapping.md) for more details.
- Manual recovery may be possible, but is error-prone and involves dumping transactions and applying them to the running cluster (out of scope for this doc).
- Abandoning the data is also an option, if you're ok with losing the unsynced transactions. Follow the following steps to abandon the data (as root):
- Stop the mysql process with `monit stop galera-init`. (`galera-init` is the bosh job controlling the mysqld process.)
- Delete the galera state (`/var/vcap/store/mysql/grastate.dat`) and cache (`/var/vcap/store/mysql/galera.cache`) files from the persistent disk.
- Restart the node with `monit start galera-init`.

### Quorum
- In order for the cluster to continue accepting requests, a quorum must be reached by peer-to-peer communication. More than half of the nodes must be responsive to each other to maintain a quorum.
- If more than half of the nodes are unresponsive for a period of time the nodes will stop responding to queries, the cluster will fail, and bootstrapping will be required to re-enable functionality.

### Avoid an even number of nodes
- It is generally recommended to avoid an even number of nodes. This is because a partition could cause the entire cluster to lose quorum, as neither remaining component has more than half of the total nodes.
- A 2 node cluster cannot tolerate the failure of single node failure as this would cause loss of quorum. As such, the minimum number of nodes required to tolerate single node failure is 3.

### Unresponsive node(s)
- A node can become unresponsive for a number of reasons:
- network latency
- mysql process failure
- firewall rule changes
- vm failure
- Unresponsive nodes will stop responding to queries and, after timeout, leave the cluster.
- Nodes will be marked as unresponsive (innactive) either:
- If they fail to respond to one node within 15 seconds
- OR If they fail to respond to all other nodes within 5 seconds
- Unresponsive nodes that become responsive again will rejoin the cluster, as long as they are on the same IP which is pre-configured in the gcomm address on all the other running nodes, and a quorum was held by the remaining nodes.
- All nodes suspend writes once they notice something is wrong with the cluster (write requests hang). After a timeout period of 5 seconds, requests to non-quorum nodes will fail. Most clients return the error: `WSREP has not yet prepared this node for application use`. Some clients may instead return `unknown error`. Nodes who have reached quorum will continue fulfilling write requests.
- If deployed using a proxy, a continually inactive node will cause the proxy to fail over, selecting a different mysql node to route new queries to.

### Functioning node marked as failing
- There are certain situations in which a node will be running successfully but `monit`, and therefore bosh, will erroneously report the job as failing. These situations include:
- Operator reboots the VM with `sudo reboot`
- IAAS reboots the VM
- mysql process crashes: monit automatically restarts it
- In the event this happens, monit will report `Execution Failed` but the node successfully joins the cluster.
- To validate this:
- Observe monit status (as root) with `monit summary`
- Connect to the mysql process on that node (e.g. `mysql -uroot -hlocalhost -p`)
- To fix this, `monit reload` as root and observe that monit and bosh report the process and jobs as healthy.

### Re-bootstrapping the cluster after quorum is lost
- The start script will currently bootstrap node 0 only on initial deploy. If bootstrapping is necessary at a later date, it must be done manually. For more information on bootstrapping a cluster, see [Bootstrapping Galera](https://github.com/cloudfoundry/cf-mysql-release/blob/release-candidate/docs/bootstrapping.md).
- If the single node is bootstrapped, it will create a new one-node cluster that other nodes can join.

### Forcing a Node to Rejoin the Cluster (Unsafe Procedure)
**Note**: This errand resets all nodes to the state of the node with the highest sequence number. This will cause you to lose any data that is not on that node.

- In the event that a cluster becomes unhealthy (i.e. nodes are out of sync), as of v27, MySQL ships with an errand called `rejoin-unsafe`. This errand will:
- Check the nodes for the highest sequence number
- Start that node in Bootstrap mode
- Force the other nodes to SST, restoring the cluster to healthy state


### Simulating node failure
- To simulate a temporary single node failure, use `kill -9` on the pid of the mysql process. This will only temporarily disable the node because the process is being monitored by monit, which will restart the process if it is not running.
- To more permenantly disable the process, execute `monit unmonitor galera-init` before `kill -9`.
- To simulate multi-node failure without killing a node process, communication can be severed by changing the iptables config to dissallow communication:

```
iptables -F && # optional - flush existing rules \
iptables -A INPUT -p tcp --destination-port 4567 -j DROP && \
iptables -A INPUT -p tcp --destination-port 4568 -j DROP && \
iptables -A INPUT -p tcp --destination-port 4444 -j DROP && \
iptables -A INPUT -p tcp --destination-port 3306 && \
iptables -A OUTPUT -p tcp --destination-port 4567 -j DROP && \
iptables -A OUTPUT -p tcp --destination-port 4568 -j DROP && \
iptables -A OUTPUT -p tcp --destination-port 4444 -j DROP && \
iptables -A OUTPUT -p tcp --destination-port 3306
```

To recover from this, drop the partition by flushing all rules:
```
iptables -F
```
140 changes: 140 additions & 0 deletions docs/proxy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Proxy

In pxc-release, [Switchboard](https://github.com/cloudfoundry/pxc-release/tree/main/src/github.com/cloudfoundry-incubator/switchboard) is used to proxy TCP connections to healthy Percona XtraDB Cluster nodes.

A proxy is used to gracefully handle failure of Percona XtraDB Cluster nodes. Use of a proxy permits very fast, unambiguous failover to other nodes within the cluster in the event of a node failure.

When a node becomes unhealthy, the proxy re-routes all subsequent connections to a healthy node. All existing connections to the unhealthy node are closed.

## Consistent Routing

At any given time, each deployed proxy will only route to one active node. The proxy will select the node with the lowest `wsrep_local_index`. The `wsrep_local_index` is a Galera status variable indicating Galera's internal indexing of nodes. The index can change, and there is no guarantee that it corresponds to the BOSH index. The chosen active node will continue to be the only active node until it becomes unhealthy.

If multiple proxies are used in parallel (ex: behind a load-balancer) the proxies behave independently with no proxy to proxy coordination.
However, the logic to choose a node is identical in each proxy therefore the proxies will route connections to the same active Cluster node.

## Node Health

### Healthy

The proxy queries an HTTP healthcheck process, co-located on the database node, when determining where to route traffic.

If the healthcheck process returns HTTP status code of 200, the node is added to the pool of healthy nodes.

### Unhealthy

If the healthcheck returns HTTP status code 503, the node is considered unhealthy.

This happens when a node becomes non-primary, as specified by the [cluster-behavior docs](cluster-behavior.md). TODO

The proxy will sever all existing connections to newly unhealthy nodes. Clients are expected to handle reconnecting on connection failure. The proxy will route new connections to a healthy node, assuming such a node exists.

### Unresponsive

If node health cannot be determined due to an unreachable or unresponsive healthcheck endpoint, the proxy will consider the node unhealthy. This may happen if there is a network partition or if the VM containing the healthcheck and Percona XtraDB Cluster node died.

## Proxy Health

### Healthy
The proxy currently has no healthcheck check. Under normal operation connections can be requested from all available proxies.

### Unhealthy
If the proxy becomes unresponsive for any reason the other deployed proxies are able to accept all client connections.

## State Snapshot Transfer (SST)

When a new node is added to the cluster or rejoins the cluster, it synchronizes state with the primary component via a process called SST. A single node from the primary component is chosen to act as a state donor.
pxc-release is configured to use [Xtrabackup](https://docs.percona.com/percona-xtradb-cluster/8.4/state-snapshot-transfer.html#use-percona-xtrabackup), which allows the donor node to continue to accept reads and writes.

## Proxy count

If the operator sets the total number of proxies to 0 hosts in their manifest, then applications will start routing connections directly to one healthy Percona XtraDB Cluster node making that node a single point of failure for the cluster.

The recommended number of proxies is 2; this provides redundancy should one of the proxies fail.

## Setting a load balancer in front of the proxies

The proxy tier is responsible for routing connections from applications to healthy Percona XtraDB Cluster nodes, even in the event of node failure.

Multiple proxies are recommended for uninterrupted operation. Using a load balancer in front of the proxies ensures distributed connection requests and minimal disruption in the event of the unavailability of any proxy.

Configure a load balancer<sup>[[2]](#configuring-load-balancer)</sup> to route client connections to all proxy IPs, and configure the MySQL service<sup>[[3]](#configuring-pxc-release-to-give-applications-the-address-of-the-load-balancer)</sup> to give bound applications a hostname or IP address that resolves to the load balancer.

### Configuring load balancer

Configure the load balancer to route traffic for TCP port 3306 to the IPs of all proxy instances on TCP port 3306.

Next, configure the load balancer's healthcheck to use the proxy health port.
The proxies have an HTTP server listening on the health port. It returns 200 in all cases and for all endpoints. This can be used to configure a Load Balancer that requires HTTP healthchecks.

Because HTTP uses TCP connections, the port also accepts TCP requests, useful for configuring a Load Balancer with a TCP healthcheck.

By default, the health port is 1936 to maintain backwards compatibility with previous releases, but this port can be configured by adding the `cf_mysql.proxy.health_port` manifest property to the proxy job and deploying.

Use the operations file operations/add-proxy-load-balancer.yml to add a load balancer to the proxies.

### Configuring pxc-release to give applications the address of the load balancer
To ensure that bound applications will use the load balancer to reach bound databases, set `cf_mysql.host` in the cf-mysql-broker job to your load balancer's IP.

### AWS Route 53

To set up a Round Robin DNS across multiple proxy IPs using AWS Route 53,
follow the following instructions:

1. Log in to AWS.
2. Click Route 53.
3. Click Hosted Zones.
4. Select the hosted zone that contains the domain name to apply round robin routing to.
5. Click 'Go to Record Sets'.
6. Select the record set containing the desired domain name.
7. In the value input, enter the IP addresses of each proxy VM, separated by a newline.

Finally, update the manifest property `properties.mysql_node.host` for the cf-mysql-broker job, as described above.

## API

The proxy hosts a JSON API at `<bosh job index>-proxy-p-mysql.<system domain>/v0/`.

The API provides the following route:

Request:
* Method: GET
* Path: `/v0/backends`
* Params: ~
* Headers: Basic Auth

Response:

```
[
{
"name": "mysql-0",
"ip": "1.2.3.4",
"healthy": true,
"active": true,
"currentSessionCount": 2
},
{
"name": "mysql-1",
"ip": "5.6.7.8",
"healthy": false,
"active": false,
"currentSessionCount": 0
},
{
"name": "mysql-2",
"ip": "9.9.9.9",
"healthy": true,
"active": false,
"currentSessionCount": 0
}
]
```

## Dashboard

The proxy also provides a Dashboard UI to view the current status of the database nodes. This is hosted at `<bosh job index>-proxy-p-mysql.<system domain>`.

The Proxy Springboard page at `proxy-p-mysql.<system domain>` contains links to each of the Proxy Dashboard pages.

Failure Scenarios
6 changes: 3 additions & 3 deletions src/github.com/cloudfoundry-incubator/switchboard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ switchboard

A TCP router written on Golang.

Developed to replace HAProxy as the proxy tier enabling high availability for the [MySQL dbaas for Cloud Foundry](https://github.com/cloudfoundry/cf-mysql-release). Responsible for routing of client connections to a one node at a time of a backend cluster, and failover on cluster node failure. For more information, see the develop branch of [cf-mysql-release/docs/proxy.md](https://github.com/cloudfoundry/cf-mysql-release/blob/release-candidate/docs/proxy.md).
Developed to replace HAProxy as the proxy tier enabling high availability for the [MySQL dbaas for Cloud Foundry](https://github.com/cloudfoundry/cf-mysql-release). Responsible for routing of client connections to a one node at a time of a backend cluster, and failover on cluster node failure. For more information, see [pxc-release/docs/proxy.md](https://github.com/cloudfoundry/pxc-release/blob/main/docs/proxy.md).

### Why switchboard?

Expand All @@ -16,10 +16,10 @@ There are several other proxies out there: Nginx, HAProxy and even MariaDB's Max

Install **Go** by following the directions found [here](http://golang.org/doc/install)

Running the tests requires [Ginkgo](http://onsi.github.io/ginkgo/):
Running the tests requires [Ginkgo v2](http://onsi.github.io/ginkgo/):

```sh
go get github.com/onsi/ginkgo/ginkgo
go get github.com/onsi/ginkgo/v2/ginkgo
```

Run the tests using the following command:
Expand Down