Wednesday, December 2, 2009

Ext4 with MySQL binary logs oddity

I was working with a customer recently that kept seeing 10-12 second hangs in MySQL. Everything would be working well and suddenly all data changing statements would stop working.

Once the 10-12 seconds passed, the system would recover and everything would be fine for a while. This would repeat every 30-45 minutes normally, and would occur more frequently when the system was doing some heavy ETL type activity (data loads, big updates, etc...), with the ETL ones being longer and worse generally.

One of the steps taken while investigating was to check out vmstat/iostat. From this we noticed that during the spike there was very intense disk activity. I suspected something wrong with InnoDB such as the famous purge hiccup or something with the log files checkpointing, etc... However we then found out that it was the disk system where the binary logs were stored and not the InnoDB files.

While continuing to investigate, it was noticed that the times it occurred happened to exactly match up with the times that the binary log rotations were occurring. Using strace, we were able to notice that the fdatasync() call that accompanied the rotation was taking a very long time (the majority of the delay time).

Finally we were able to pin it down to ext4 and how it delays data writes for a very long time (30 minutes). Compare this to ext3 which will flush things every 5 seconds or so. I am told that ext4 recently changed this behavior (this was in 2.6.30), so hopefully this won't hit more people.

What was happening was as follows:

  • Binary log data gets written to over time

  • ext4 in an attempt to increase performance does not write the data to disk

  • ext4 continues to not write data to disk, even as hundreds of MB of binary log sits in memory and the disk is mostly idle

  • Binary log gets full (1024MB) and rotates while holding the log mutex

  • Rotation calls an fdatasync prior to closing the file

  • ext4 now has to write the data out and takes 10-12 seconds to do so

  • log mutex prevents any commits while it writes

  • Write finishes, log rotates, things repeat



To alleviate this, we ended up setting sync_binlog=100. This will force the binary log to fsync periodically and hence not allow ext4 to wait forever to write the data out to the file. There is also the commit mount option for ext4 which should give similar benefits by forcing it to write more often but I didn't test it.