You are here

MariaDBs parallele Replikation zum Aufholen

Aufgrund eines applikatorischen Fehlers ist unsere Replikation während 5 Tagen stehen geblieben (über Ostern). Nachdem das Problem gelöst war, sollte die Replikation aufholen, was sich als sehr zäh herausstellte. All die üblichen Tricks (innodb_flush_log_at_trx_commit, sync_binlog, etc.) wurden bereits ausgereizt. Also haben wir uns an der parallelen Replikation des MariaDB Servers versucht.

Per default ist die parallel Replikation deaktiviert:

SQL> SHOW GLOBAL VARIABLES LIKE '%parallel%';
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| slave_domain_parallel_threads | 0          |
| slave_parallel_max_queued     | 131072     |
| slave_parallel_mode           | optimistic |
| slave_parallel_threads        | 0          |
| slave_parallel_workers        | 0          |
+-------------------------------+------------+

Durch setzen der Server Variablen slave_parallel_threads wird die parallele Replikation aktiviert:

SQL> SET GLOBAL slave_parallel_threads = 8;
ERROR 1198 (HY000): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first

Dies muss aber bei gestoppter Replikation erfolgen:

SQL> STOP SLAVE;
SQL> SET GLOBAL slave_parallel_threads = 8;
SQL> START SLAVE;

Anschliessend hat die Replikation etwas schneller aufgeholt. Da wir aber ungeduldig waren, haben wir versucht, das ganze noch schneller hin zu kriegen. Mit dem Befehl:

SQL> SHOW SLAVE STATUS\G
...
Slave_SQL_Running_State: Waiting for room in worker thread event queue
...

haben wir die folgende Meldung gefunden. Man würde sie auch über den Befehl SHOW PROCESSLIST sehen:

SQL> SHOW PROCESSLIST;
+--------+-------------+- ... -+-----------+------+-----------------------------------------------+- ... -+
| Id     | User        |  ...  | Command   | Time | State                                         |  ...  |
+--------+-------------+- ... -+-----------+------+-----------------------------------------------+- ... -+
...                       ...                                                                        ...
| 212496 | system user |  ...  | Slave_SQL |   16 | Waiting for room in worker thread event queue |  ...  |
+--------+-------------+- ... -+-----------+------+-----------------------------------------------+- ... -+

Gemäss Doku kann es in diesem Fall helfen, die Variable slave_parallel_max_queued etwas zu vergrössern (Achtung: Oom!).

SQL> STOP SLAVE;
SQL> SET GLOBAL slave_parallel_max_queued = 1*1024*1024;
SQL> SHOW GLOBAL VARIABLES LIKE '%parallel%';
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| slave_domain_parallel_threads | 0          |
| slave_parallel_max_queued     | 1048576    |
| slave_parallel_mode           | optimistic |
| slave_parallel_threads        | 8          |
| slave_parallel_workers        | 8          |
+-------------------------------+------------+
SQL> START SLAVE;

Wir haben mit den Werten slave_parallel_threads im Bereich von 4 bis 32 (bei 8 vCores) und bei slave_parallel_max_queued im Bereich von 128 kbyte bis 32 Mbyte rumgespielt.
Achtung: Nicht übertreiben: 32 Threads x 32 Mbyte = 1 Gbyte RAM (Oom)!

Um herauszufinden, welche Werte das Optimum sind, müsste man ausführlicher testen und messen. Jedenfalls hat die Replikation nach circa einer Stunde die 5 Tage Rückstand aufgeholt, gegen Schluss etwas mehr als zu Beginn, was hoffentlich durch unsere Konfigurationsanpassungen verursacht wurde.

Je nach dem, was gerade an DML Statements läuft sieht man, dass alle Threads genutzt werden können oder die einen Threads auf andere Threads warten müssen:

SQL> SHOW PROCESSLIST;
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
| Id     | User            | Command      | Time   | State                                                         | Info       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
|      2 | event_scheduler | Daemon       | 506179 | Waiting for next activation                                   | NULL       |
| 191154 | root            | Query        |      0 | starting                                                      | show pr... |
| 208669 | replication     | Binlog Dump  |    297 | Master has sent all binlog to slave; waiting for more updates | NULL       |
| 212495 | system user     | Slave_IO     |     20 | Waiting for master to send event                              | NULL       |
| 212497 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212498 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212499 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212500 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212501 | system user     | Slave_worker |      0 | Write_rows_log_event::write_row(-1) on table `history_uint`   | insert ... |
| 212502 | system user     | Slave_worker |      0 | Write_rows_log_event::write_row(-1) on table `history_uint`   | insert ... |
| 212503 | system user     | Slave_worker |      0 | Write_rows_log_event::write_row(-1) on table `history_str`    | insert ... |
| 212504 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212505 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212506 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212507 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212510 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212509 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212508 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212511 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212512 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212496 | system user     | Slave_SQL    |     16 | Waiting for room in worker thread event queue                 | NULL       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+

SQL> SHOW PROCESSLIST;
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
| Id     | User            | Command      | Time   | State                                                         | Info       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
|      2 | event_scheduler | Daemon       | 506197 | Waiting for next activation                                   | NULL       |
| 191154 | root            | Query        |      0 | starting                                                      | show pr... |
| 208669 | replication     | Binlog Dump  |    315 | Master has sent all binlog to slave; waiting for more updates | NULL       |
| 212495 | system user     | Slave_IO     |     37 | Waiting for master to send event                              | NULL       |
| 212497 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212498 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212499 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212500 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212501 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212502 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212503 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212504 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212505 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212506 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212507 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212510 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212509 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212508 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212511 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212512 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212496 | system user     | Slave_SQL    |     11 | Waiting for room in worker thread event queue                 | NULL       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+

Unser Monitoring hat uns auch schön angezeigt, dass die CPU Load hoch ging, das I/O-System mehr zu tun kriegte und mehr Rows modifiziert wurden...

Was ebenfalls auffiel ist, dass mit parallel Replication plötzlich Foreing Key Fehler auftraten, ein Phänomen, dass wir so vorher noch nicht beobachtet hatten:

FromDual.maas2.prod2 - Warning: InnoDB Foreign Key error detected

Trigger: InnoDB Foreign Key error detected
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL: https://fromdual.com/innodb-foreign-key-error-detected

Item values: 1

1. InnoDB new Foreign Key error (FromDual.maas2.prod2:FromDual.MySQL.innodb.ForeignKey_new): 1

Mit dem Befehl SHOW ENGINE INNODB STATUS\G kann man diese entsprechend inspizieren oder im Monitoring anschauen:

------------------------
LATEST FOREIGN KEY ERROR
------------------------
2024-04-02 10:36:39 0x7f36088ff640 Transaction:
TRANSACTION 7199599266, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
6 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 1
MariaDB thread id 228555, OS thread handle 139870048613952, query id 28453893 Write_rows_log_event::write_row(-1) on table `alerts`
insert into alerts (alertid,actionid,eventid,userid,clock,mediatypeid,sendto,subject,message,status,error,esc_step,alerttype,acknowledgeid,parameters) values (203687,4,471733,3,1712044003,1,'xxx@fromdual.com','Zabbix server - High: Too many processes on Zabbix server','Trigger: Too many processes on Zabbix server
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:

Item values: 309

1. Number of processes (Zabbix server:proc.num[]): 309',3,'',1,0,null,'{}')
Foreign key constraint fails for table `zabbix`.`alerts`:
,
  CONSTRAINT `c_alerts_2` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE in parent table, in index alerts_3 tuple:
DATA TUPLE: 2 fields;
 ...

But in parent table `zabbix`.`events`, in index PRIMARY,
the closest match we can find is record:
PHYSICAL RECORD: n_fields 12; compact format; info bits 0
 ...

Literatur/Quellen