You are here
MySQL Tech-Feed (en)
Great News: Galera Cluster v3.1 GA for MySQL 5.6 was released at Percona Live London (PLUK) 2013. The information is still a bit hidden...
You can find it here:
- The Plug-in: https://launchpad.net/galera/3.x/25.3.1
- The MySQL: https://launchpad.net/codership-mysql/5.6/5.6.14-25.1
Careful: Online-Upgrade from 5.5 to 5.6 will not work yet. We have to find a work-around...
We had a Galera Cluster support case recently. The customer was drenched in tears because his Galera Cluster did not work any more and he could not make it work any more.
Upsss! What has happened?
A bit of the background of this case: The customer wanted to do a rolling-restart of the Galera Cluster under load because of an Operating System upgrade which requires a reboot of the system.
Lets have a look at the MySQL error log to see what was going on. Customer restarted server with NodeC:12:20:42 NodeC: normal shutdown --> Group 2/2 12:20:46 NodeC: shutdown complete 12:22:09 NodeC: started 12:22:15 NodeC: start replication 12:22:16 NodeC: CLOSED -> OPEN 12:22:16 all : Group 2/3 component all PRIMARY 12:22:17 NodeC: Gap in state sequence. Need state transfer. 12:22:18 all : Node 1 (NodeC) requested state transfer from '*any*'. Selected 0 (NodeB)(SYNCED) as donor. 12:22:18 NodeB: Shifting SYNCED -> DONOR/DESYNCED (TO: 660966498) 12:22:19 NodeC: Shifting PRIMARY -> JOINER (TO: 660966498) 12:22:19 NodeC: Receiving IST: 14761 writesets, seqnos 660951493-660966254 12:22:21 NodeC: 0 (NodeB): State transfer to 1 (NodeC) complete.
Everything went fine so far NodeC came up again and did an IST as expected. But then the first operational error happened: The customer did not wait to reboot NodeB until NodeC was completely recovered. It seems like NodeC took some time for the IST recovery. This should be checked on all nodes with SHOW GLOBAL STATUS LIKE 'wsrep%';...12:22:21 NodeC: Member 0 (NodeB) synced with group. 12:22:21 NodeB: Shifting JOINED -> SYNCED (TO: 660966845) 12:22:21 NodeB: Synchronized with group, ready for connections --> NodeC seems not to be ready yet! 12:23:21 NodeB: Normal shutdown 12:23:21 all : Group 1/2 12:23:21 NodeC: Aborted (core dumped)
And now Murphy was acting already the first time: We hit a situation in the Galera Cluster which is not covered as expected. Now we have 2 nodes out of 3 not working. As a result the Cluster gets a quorum loss (non-Primary, more than 50% of nodes disappeared) and does not reply to any SQL queries any more. This is a bug because both nodes left the cluster gracefully. The third node should have stayed primary:12:23:21 NodeB: Received SELF-LEAVE. Closing connection. 12:23:23 NodeB: Shifting CLOSED -> DESTROYED (TO: 660973981) 12:23:25 NodeB: Shutdown complete 12:23:29 NodeC: mysqld_safe WSREP: sleeping 15 seconds before restart 12:23:37 NodeA: Received NON-PRIMARY. 12:23:44 NodeC: mysqld_safe mysqld restarted 12:23:48 NodeC: Shifting CLOSED -> OPEN 12:23:48 NodeC: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2 12:23:48 NodeC: Received NON-PRIMARY. 12:23:48 NodeA: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2 12:23:48 NodeA: Received NON-PRIMARY. 12:24:30 NodeB: mysqld_safe Starting mysqld daemon 12:24:36 NodeB: Start replication 12:24:37 NodeB: Received NON-PRIMARY.
As a result the customer decided to shutdown the whole cluster. Which was not necessary but is a acceptable approach:12:27:55 NodeB: /usr/sbin/mysqld: Normal shutdown 12:27:58 NodeB: /usr/sbin/mysqld: Shutdown complete 12:28:14 NodeA: /usr/sbin/mysqld: Normal shutdown 12:28:19 NodeA: /usr/sbin/mysqld: Shutdown complete 12:31:45 NodeC: /usr/sbin/mysqld: Normal shutdown 12:31:49 NodeC: /usr/sbin/mysqld: Shutdown complete
We experience a complete cluster outage now. An then the next operational error happened: The customer has chosen the node (NodeC) with the worst (= oldest) data as the starting node for the new Cluster:12:31:55 NodeC: Starting mysqld daemon 12:31:58 NodeC: PRIMARY, 1/1 12:31:58 NodeC: /usr/sbin/mysqld: ready for connections. 12:33:29 NodeB: mysqld_safe Starting mysqld daemon 12:33:33 NodeB: PRIMARY, 1/2
An alternative approach would have been to run the command SET GLOBAL wsrep_provider_options='pc.bootstrap=yes'; on the node (NodeA) with the most recent data...
After connecting NodeB (with the newer state) requested an state transfer from the older NodeC:
And now Mister Murphy is acting a second time: We hit another situation: The newer node requests an IST from the older node which has progressed in the meanwhile to an even newer state. So the newer joiner node receives data from the older donor node which causes an AUTO_INCREMENT Primary Key violation. As a consequence the node crashes:12:33:36 NodeB: receiving IST failed, node restart required: Failed to apply app buffer: äJR#, seqno: 660974010, status: WSREP_FATAL 12:33:36 NodeB: Closed send monitor. 12:33:37 NodeB: Closing slave action queue. 12:33:37 NodeB: Aborted (core dumped) 12:33:37 NodeC: PRIMARY 1/1 12:33:44 NodeC: Shifting DONOR/DESYNCED -> JOINED (TO: 660983204) 12:33:59 NodeB: mysqld_safe mysqld restarted 12:34:04 NodeB: Shifting CLOSED -> OPEN 12:34:07 NodeB: Aborted (core dumped) ... Loop
This situation keeps the node NodeB now in a crashing loop. Restarted by the mysqld_safe process requesting an IST. This is another bug which is fixed in a newer Galera MySQL (5.5.33). And now the next operational error happened: Instead of killing NodeB and forcing an SST by deleting the grastat.dat file They started the third node as well...12:37:12 NodeA: mysqld_safe Starting mysqld daemon ... --> code dumped ... Loop
NodeB and NodeA both have the same problem now...
As a result: Node NodeA and NodeB are now looping in a crash. But at least the node NodeC was up and running all the time.Learnings
- Most important: Have an ntpd service running on all Cluster nodes to not mess up the times on different nodes while investigating in errors. This makes problem solving much easier...
- In case of split-brain or quorum loss choose the node with the most recent data as your initial Cluster node.
- If you have a Cluster in split-brain you do not have to restart it. You can bring the node out of split-brain with pc.bootstrap=yes variable if you found out which node is the most recent one.
- Analyse error log files carefully to understand what went wrong. Forcing an SST only takes a few seconds.
- Upgrade your software regularly to not hit old known bugs. The rule Do not touch a running system! does not apply here because we are already touching the running system! So regular upgrade from time to time can be very helpful!
- Be familiar with operational issues of your Cluster software. A Cluster does not only mean high-availability. It means also you have to train your staff to handle it.
- It is always valuable to have support for your business critical systems.
In MySQL we have the typical behaviour that we open and close connections very often and rapidly. So we have very short-living connections to the server. This can lead in extreme cases to the situation that the maximum number of TCP ports are exhausted.
The maximum number of TCP ports we can find with:# cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000
In this example we can have in maximum (61000 - 32768 = 28232) connections concurrently open.
When a TCP connections closes the port cannot be reused immediately afterwards because the Operating System has to wait for the duration of the TIME_WAIT interval (maximum segment lifetime, MSL). This we can see with the command:# netstat -nat Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:10050 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:10051 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:10051 127.0.0.1:60756 TIME_WAIT tcp 0 0 127.0.0.1:10050 127.0.0.1:50191 TIME_WAIT tcp 0 0 127.0.0.1:10050 127.0.0.1:52186 ESTABLISHED tcp 0 0 127.0.0.1:10051 127.0.0.1:34445 TIME_WAIT
The reason for waiting is that packets may arrive out of order or be retransmitted after the connection has been closed. CLOSE_WAIT indicates that the other side of the connection has closed the connection. TIME_WAIT indicates that this side has closed the connection. The connection is being kept around so that any delayed packets can be matched to the connection and handled appropriately.
The Maximum Segment Lifetime can be found as follows:# cat /proc/sys/net/ipv4/tcp_fin_timeout 60
This basically means your system cannot guarantee more than ((61000 - 32768) / 60 = 470) ports at any given time.Solutions
There are several strategies out of this problem:
- Open less frequently connections to your MySQL database. Put more payload into one connection. Often Connection Pooling is used to achieve this.
- Increasing the port range. Setting the range to 15000 61000 is pretty common these days (extreme tuning: 1024 - 65535).
- Increase the availability by decreasing the FIN timeout.
Those values can be changed online with:# echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout # echo 15000 65000 > /proc/sys/net/ipv4/ip_local_port_range
Or permanently by adding it to /etc/sysctl.conf
An other possibility to change this behaviour is to use tcp_tw_recycle and tcp_tw_reuse. By default they are disabled:# cat /proc/sys/net/ipv4/tcp_tw_recycle 0 # cat /proc/sys/net/ipv4/tcp_tw_reuse 0
These parameters allow fast cycling of sockets in TIME_WAIT state and re-using them. But before you do this change make sure that this does not conflict with the protocols that you would use for the application that needs these ports.
The tcp_tw_recycle could cause some problems when using load balancers:tcp_tw_reuse Allow to reuse TIME_WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0.
It should not be changed without advice/request of technical experts. tcp_tw_recycle Enable fast recycling TIME_WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.
It took me quite a while to find out how the beast Galera Arbitrator (garbd) works. To safe your time here a short summary:How to start Galera Arbitrator (garbd) shell> ./garbd --address gcomm://192.168.13.1,192.168.13.2 --group "Our Galera Cluster" --log /tmp/garbd.log --daemon
How to stop Galera Arbitrator (gardb) shell> killall garbd
How to start Galera Arbitrator (garbd) with a configuration file shell>./garbd --cfg /tmp/garb.cnf --daemon
The configuration file looks as follows:# # /etc/mysql/garb.cnf # address = gcomm://127.0.0.1:5671,127.0.0.1:5672,127.0.0.1:5673 group = Our Galera Cluster options = gmcast.listen_addr=tcp://127.0.0.1:5674 log = /tmp/garbd.log
A service start/stop script can be found at: galera-src/garb/files/agrb.sh and galera-src/garb/files/garb.cnf