Repair MySQL master-master replication

This is an howto that explains how to manage MySQL multimaster replication. Setting it up is fairly easy but managing it when something goes wrong can be a nightmare. Here is information that you need to know if you have a MySQL database in master master replication.

Introduction
Let assume we have two webservers that have databases in replication, web1.example.com and web2.example.com

In this example IP are the following, please change them according to your config :

web1.example.com : 192.168.0.104
web2.example.com : 192.168.0.105

From my experience, two problem can occur with a master-master replication setup :

1)Replication is out of sync for whatever reason (see chapter 1 to 6)
2)Replication is not working because an error happened in MySQL such as duplicate keys, error 1062 (see chapter 7 to 10)


1. Installing maatkit

First install maatkit :

apt-get install maatkit


2. Creating MySQL users

Then create a user that will have access from the other webserver to compare his table with the other webserver table (that are supposed to be exactly the same) :

mysql -u root -p

At the MySQL shell :

***On web1.example.com***

CREATE USER 'checksum'@'192.168.0.105' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
GRANT SELECT ON * . * TO 'checksum'@'192.168.0.105' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
CREATE USER 'checksum'@'localhost' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
GRANT SELECT ON * . * TO 'checksum'@'localhost' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
quit;

***On web2.example.com***

CREATE USER 'checksum'@'192.168.0.104' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
GRANT SELECT ON * . * TO 'checksum'@'192.168.0.104' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
CREATE USER 'checksum'@'localhost' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
GRANT SELECT ON * . * TO 'checksum'@'localhost' IDENTIFIED BY 'USER_CHECKSUM_PASSWORD';
quit;


3. Testing replication synchronization

Now we can check if replication is in sync with the following command (on either webserver, not both) :

***On web1.example.com***

 mk-table-checksum h=localhost,u=checksum,p=USER_CHECKSUM_PASSWORD h=192.168.0.105,u=checksum,p=USER_CHECKSUM_PASSWORD | mk-checksum-filter

***On web2.example.com***

 mk-table-checksum h=localhost,u=checksum,p=USER_CHECKSUM_PASSWORD h=192.168.0.104,u=checksum,p=USER_CHECKSUM_PASSWORD | mk-checksum-filter

It will tell you that some tables in the database mysql are out of sync which is normal :) You can ignore this with "--ignoredb mysql"


5. Bash script to verify replication checksum periodically

Now we can create a bash script (on web1.example.com) that will notify us by email when databases are out of sync and run it as a cron job (every x minutes) :

vi /root/mysql_checksum

And make it look like this (change passwords!)

#!/bin/bash
# MySQL replication consistency check
# Copyright (c) 2008 blogama.org
# This script is licensed under GNU GPL version 2.0 or above
# ---------------------------------------------------------------------
 
### This script does 1 verification ###
### 1) Check for replication consistency between 2 MySQL server in replication ###
 
### To be modified ###
EMAIL="admin@example.com"
MYSQLU="checksum"
MYSQLP="CHECK_PASSWORD"
MYSQLH="localhost"
MYSQLH2="192.168.0.105"
 
# Ignore this comma-separated list of databases #
IGNOREDB="mysql"
 
###### Do not make modifications below ######
 
### To restore to original when problem fixed ###
if [ $1 ]; then
  if [ $1=="fix" ]; then
    rm /root/checksum_problem.txt
    exit 1;
  fi
fi
 
### Binaries ###
MAIL=$(which mail)
CHECKSUMFILTER=$(which mk-checksum-filter)
TABLECHECKSUM=$(which mk-table-checksum)
 
###check if already notified###
cd /root
if [ -f checksum_problem.txt ]; then
  exit 1;
fi
 
### Check if host1 and host2 have the same checksum on tables
$TABLECHECKSUM --ignoredb $IGNOREDB h=$MYSQLH,u=$MYSQLU,p=$MYSQLP h=$MYSQLH2,u=$MYSQLU,p=$MYSQLP | $CHECKSUMFILTER | grep $MYSQLH > /dev/null 2>&1
 
if [ "$?" -ne "1" ]; then
  echo "Checksum failed" > /root/checksum_problem.txt
  $MAIL -s "Checksum problem" $EMAIL < /root/checksum_problem.txt
fi

chmod +x /root/mysql_checksum

When you get a notification, the script will stop running (wrote it like that because I get SMS notification). You must run :

/root/mysql_checksum fix

in order to make the script running again.

Add it to your crontab


6. Repairing databases that are out of sync

Let say we found that table1 on database1 was out of sync.

We can repair that with mk-table-sync by doing the following :

mk-table-sync --synctomaster h=localhost,u=root,p=ROOT_PASSWORD,D=database1,t=table1 --execute


7. Verifying replication is working

On both web1.example.com and web2.example.com do :

mysql -u root -p

at the MySQL shell type :

show slave status \G;

There is two lines that are very important, you should see :

[...]
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
[...]


8. Slave_SQL_Running: No

It can happen that the output of "show slave status" has :

[...]
Slave_SQL_Running: No
[...]

along with something like :

[...]
Last_Errno: 1062
Last_Error: Error 'Duplicate entry 'dekq5g820avnfdmar5qi9dkhv3' for key 1' on query. Default database:'session_sql'. Query: 'INSERT INTO sessi ons5 VALUES ('dekq5g820avnfdmar5qi9dkhv3', UNIX_TIMESTAMP(NOW()) + 18000, 'redir ect|i:1;')'
[...]

You can fix that by doing the following at the mysql shell :

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

This tells the slave to skip one query (which is the invalid one that caused the replication to stop). If you'd like to skip two queries, you'd use SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 2; instead and so on.

In some cases you can get a recurrent error (such as 1062). I suggest reading this page about MySQL error code : http://dev.mysql.com/doc/refman/5.0/en/error-messages-server.html

If you are 100% sure it doesnt affect the integrity of important data in your database, for example when its about page caching, you can skip this kind of error by adding the following in my.cnf .

vi /etc/mysql/my.cnf

[...]
slave-skip-errors = 1062
[...]


9. Slave_IO_Running: No

Something is definetely wrong ;)

Check for "Slave_IO_State" in the output of "show slave status"

This is probably due to the other server not responding.

If both servers are online and it still dont work you can try :

stop slave;
change master to master_log_file='mysql-bin.000003', master_log_pos=227;
start slave;

On both servers according to "show master status" of the other


10. Script for check replication status

First we must add a MySQL local user (because you dont want to see the root password in scripts!) :

mysql -u root -p

At the mysql script do the following :

CREATE USER 'check'@'localhost' IDENTIFIED BY 'USER_CHECK_PASSWORD';
GRANT SUPER , REPLICATION CLIENT ON * . * TO 'check'@'localhost' IDENTIFIED BY 'USER_CHECK_PASSWORD';

Now create this script :

vi /root/check_replication

Dont forget to change passwords... :

#!/bin/bash
# MySQL replication check
# Copyright (c) 2008 blogama.org
# This script is licensed under GNU GPL version 2.0 or above
# ---------------------------------------------------------------------
 
### This script does 2 verifications ###
### 1) Check replication slave IO running ###
### 2) Check replication slave SQL running ###
 
### To be modified ###
EMAIL="admin@example.com"
MYSQLU="check"
MYSQLP="check_password"
 
###### Do not make modifications below ######
 
### To restore to original when problem fixed ###
if [ $1 ]; then
  if [ $1=="fix" ]; then
    rm /root/slave_problem.txt
    exit 1;
  fi
fi
 
### Binaries ###
MAIL=$(which mail)
MYSQL=$(which mysql)
 
### To restore to original when problem fixed ###
if [ $1 ]; then
  if [ $1=="fix" ]; then
    rm /root/slave_problem.txt
  fi
fi
 
### Check if already notified ###
cd /root
if [ -f slave_problem.txt ]; then
  exit 1;
fi
 
### Check if slave running ###
(
echo "show slave status \G;"
) |  $MYSQL -u $MYSQLU -p$MYSQLP 2>&1 | grep "Slave_IO_Running: No"
if [ "$?" -ne "1" ]; then
        echo "Replication failed slave IO not running" > /root/slave_problem.txt
fi
 
(
echo "show slave status \G;"
) |  $MYSQL -u $MYSQLU -p$MYSQLP 2>&1 | grep "Slave_SQL_Running: No"
if [ "$?" -ne "1" ]; then
        echo "Replication failed slave SQL not running" >> /root/slave_problem.txt
fi
 
### Send notification if replication down ###
cd /root
if [ -f slave_problem.txt ]; then
  $MAIL -s "Replication problem #1" $EMAIL < /root/slave_problem.txt
fi

and make the script executable :

chmod +x /root/check_replication

When you get a notification, the script will stop running. You must run the following command for the script can run again :

/root/check_replication fix

Now it can be added to your crontab