Recently we had a small hickup in our service; it was caused by our mysql replication configuration and a small mistake which was made long time before. The lesson was learned, and now I show that to you.
MySQL replication is simple. The replica set’s primary server (aka. replication master) writes the events which affects the database to a binary log file, and slaves will be apply that log on their datasets. The binary log can be statement based, row based or in mixed mode, the events can be either DML or DDL queries. Simple enough.
Of course we can filter out some statements from replication, for example when there are tables which are only read or write on master, and didn’t needed to have on slaves. On kinja, we have certain tables where this rule applies, for example a sessions table, which contains session related data. We have to ensure that the data will be there asap when it was written, and because of that, there is no reason to replicate that data to the slaves – there won’t be any threads which try to read from that table, but the table is big enough to consider filter its data from the slaves. To filter out some tables from replication, there are many ways, you can filter them on master, or filter them on slaves, both ways have their pros and cons. The binlog-ignore-[table|db|wild] or replicate-ignore-[table|db|wild] configuration options can be used to ensure this behaviour, the binlog-* refers on master (will it be write the statements to binlog, or not) and the replicate-* has to be configured on slaves (to step over certain statements on slave while applying relay logs). At kinja, we have to use replicate-ignore statements on MySQL slaves, because the tables which are filtered out from the replication on slaves still have to replicated to standby master, and to backup servers. There need be one thing to share with you too: we have two different datacenters to ensure reliability.
So, when the second datacenter was populated with servers, the mistake was made: when the servers was built up, I made it from the master databases backup file, and after the hosts were built, I dropped the tables which are filtered out from replication, because the tables are big enough to eat up too many precious diskspace.
The site went nice for weeks without any hickups, but there came a day, when some maintenace stuff needed to be done. The main reason of this periodic maintenance can be read here, but in a nutshell: if data deleted from innodb, the space will be not freed up sometimes manually need to do this. This can be performed whith ‘OPTIMIZE TABLE’ command, which will drop the free space from table, and set the rows order by the primary key. So far, so good.
The only problem is with this way, that the table will be locked during the operation. Err… we don’t want to lock a session related table for long time on a high traffic website, so we have to find an anoter way to solve our problem. Of course, there is a solution, named pt-online-schema-change (pt-osc). This tool is desinged for our needs, you can alter the database online only with short period of locks. Its documentation talks by itself, but in a few words: it creates and empty table, copies over the data from the original table, make the changes on the new table, (keeps track of changes in original table, and applies them during this), and at the end, swaps the tables. Voilà.
So, I was run my pt-online-schema change command on db, and whoops, there was broke something. At this point I got no idea, what happened, but on charts I’ve seen that the replication lags are crawling up slowly, (I’ll write a few words about the mambo collector on a future post) but only in some servers. For luck, we could isolate the problem, it was happened in the new datacenter, the old was unaffected. This was the point when the site was in the hickup state, but after we realized, that the problem persists only in one dc, we could disable that servers from loadbalancers. The site was up and running, but the half of the sql slaves had broken replication. Hm.
I started to investigate what was happened, and it was clean enough to find: the ‘SHOW SLAVE STATUS’ command showed that the replication was stopped because there was an error: the pt-osc command tried to put some triggers on nonexistent table…the table, what was filtered out from replication… So, there was a need to run a ‘SHOW CREATE TABLE’ command on table where it existed, and create it on the slaves empty. After this I could issue on the slaves a ‘START SLAVE’ command, what was run without error, and the replication was fixed. After the slaves reached the masters position, the servers could be put back to service. (In fact there is an another way to solve this: ‘SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1 ; START SLAVE’, or run pt-slave-restart in background, which does the same, but neither way is future proof, next time they will break again.)
So the lesson to learn is the following: you can filter out tables from replication, but then the replication can break if something tries to use that table anyhow, because the replication filters only applies when the tables are called directly. So, I you decide to filter out tables from replication, you have to create them on slaves empty, or truncate them instead of dropping. (This applies to filter out entire databases too. For example, if you filter out database1 from replication, but you try to run a query in database2 which are pointing to the database1, it will be break. (For example: USE database2, INSERT INTO database1.table1 …))
Managing MySQL is always fun, but always be cautious, it is very simple. Just think twice.