Tuesday, March 2, 2010

Semi-sync Replication Testing

I have recently been trying out semisynchronous replication. This is a new feature added in MySQL 5.5 based on the original google patch.

Installing was very simple and has been covered elsewhere in detail. Suffice to say it was relatively simple to do.

While test I was a bit surprised by some behavior I saw that turned out correct. What I wanted to examine what was semi-synchronous actually does and use cases for it.

The manual defines this feature correctly in very careful language:
If semisynchronous replication is enabled on the master side and there is at least one semisynchronous slave, a thread that performs a transaction commit on the master blocks after the commit is done and waits until at least one semisynchronous slave acknowledges that it has received all events for the transaction, or until a timeout occurs.


There is a subtle difference to how this is described in other places, for example from Guiseppe's blog (not picking on him, just an example since I have seen in many places):
That is, before committing, the master waits until at least one slave has acknowledged that it has received the portion of binary log necessary to reproduce the transaction.


What the difference is the fact that the blocking and relay to the remote slave occurs after the transaction is actually committed to disk.

This means that if the master crashes it is possible that it may have committed transactions on disk that are not on the slave. However, the client will get back an error saying that the commit failed.

What is the use case for this then?

For failover purposes, this is generally exactly what you need. The client can know it has failed and can redo the transaction on the slave. There is no data loss and everything is nicely guaranteed.

What doesn't it help?

Where it isn't useful is for recovery after the crash. When you finally get your master restarted, it may already have some transactions that were later replayed on the slave. This will naturally cause replication to break and things fail.

3 comments:

  1. Harrison,
    Thanks for checking on my description. I got it wrong, and checking with TCPdump confirms what you say. I added a note in my blog.

    Giuseppe

    ReplyDelete
  2. What if the crash of the master happened after the acknowledgment was sent by the slave, but before the client received a response back?

    The client will not know the slave has received it and might retry the transaction but it's already there, no?

    ReplyDelete
  3. Kenny, that is correct.

    With transactions (even on the local node), getting an ACK back means that it successfully completed. Not getting an ACK back is not a guarantee that it failed however. On the local server it is possible for this to happen as well.

    ReplyDelete