Wednesday, December 2, 2009

Ext4 with MySQL binary logs oddity

I was working with a customer recently that kept seeing 10-12 second hangs in MySQL. Everything would be working well and suddenly all data changing statements would stop working.

Once the 10-12 seconds passed, the system would recover and everything would be fine for a while. This would repeat every 30-45 minutes normally, and would occur more frequently when the system was doing some heavy ETL type activity (data loads, big updates, etc...), with the ETL ones being longer and worse generally.

One of the steps taken while investigating was to check out vmstat/iostat. From this we noticed that during the spike there was very intense disk activity. I suspected something wrong with InnoDB such as the famous purge hiccup or something with the log files checkpointing, etc... However we then found out that it was the disk system where the binary logs were stored and not the InnoDB files.

While continuing to investigate, it was noticed that the times it occurred happened to exactly match up with the times that the binary log rotations were occurring. Using strace, we were able to notice that the fdatasync() call that accompanied the rotation was taking a very long time (the majority of the delay time).

Finally we were able to pin it down to ext4 and how it delays data writes for a very long time (30 minutes). Compare this to ext3 which will flush things every 5 seconds or so. I am told that ext4 recently changed this behavior (this was in 2.6.30), so hopefully this won't hit more people.

What was happening was as follows:

  • Binary log data gets written to over time

  • ext4 in an attempt to increase performance does not write the data to disk

  • ext4 continues to not write data to disk, even as hundreds of MB of binary log sits in memory and the disk is mostly idle

  • Binary log gets full (1024MB) and rotates while holding the log mutex

  • Rotation calls an fdatasync prior to closing the file

  • ext4 now has to write the data out and takes 10-12 seconds to do so

  • log mutex prevents any commits while it writes

  • Write finishes, log rotates, things repeat

To alleviate this, we ended up setting sync_binlog=100. This will force the binary log to fsync periodically and hence not allow ext4 to wait forever to write the data out to the file. There is also the commit mount option for ext4 which should give similar benefits by forcing it to write more often but I didn't test it.

Friday, May 15, 2009

Small features being pushed

So back in 2004 I created a small feature for mysqladmin allowing it to prompt for a password to set, rather than needing to specify it on the command line. It was a feature that came up in a training course I was teaching, so I wrote it up while at the hotel one day. I had filed this as Bug #5724.

The reason it got filed as a feature request bug is due to the fact that being a non-developer at MySQL AB meant I followed the normal community contributions process. I had several patches accepted via the community model (mysql prompt, some mysql readline enhancements, etc...) previous to this patch. However, for some reason around the time that this patch was submitted, patch acceptance slowed down.

Apparently the community acceptance process is actually working again, because today it got queued and pushed to a team tree which will be merged into MySQL 6.0. Only took 4.5 years to get in. Hopefully, my next patch will be included in a much shorter time now with the new process.

Note that I haven't done anything with this bug recently, jimw apparently found it himself or was prodded by someone to handle it.

Wednesday, May 13, 2009

Production example of Dtrace + MySQL (Part 1)

I had read a lot about using dtrace with MySQL, but recently had the chance to use it and thought it worked extremely well, so I wanted to share. Domas mentioned some of this in the presentation he did on dtrace + MySQL at the MySQL conference.

The basics of this are true, somethings have been changed to help protect the identity of the customer.

I am going to spread this out over a few postings to show how dtrace can be used to incrementally find out things that aren't available otherwise.

The customer was reporting that queries were extremely fast, taking a few milliseconds normally, but sometimes they would get queries that took 500+ milliseconds. They wanted to know why and how this was occurring. They happened to be on Solaris 10, so dtrace to the rescue!

So the first task was to find out what were the slow queries. In MySQL 5.1, we could use the slow query log with millisecond precision, but naturally this was MySQL 5.0. The first script was designed just to capture the slow queries and see if they had any sort of pattern. The initial script looked like:

#!/usr/sbin/dtrace -s

#pragma D option quiet

/* how much memory to use maximum
* if you start getting errors regarding dropped data from the script, then
* you might get incorrect information
#pragma D option dynvarsize=512m

/* how much of the query to copy, large takes up more memory */
#pragma D option strsize=256

value in nanoseconds of when to log a query as slow and information
about it
1000000 is 1 ms
50000000 is 50 ms
250000000 is 250 ms
inline int long_query_time = 250000000;

/* initialize our aggregates in case there aren't any */
@total["Total queries"] = count();
@total["Total slow queries"] = count();
self->in_query = 0;
self->query = "";
printf("Capturing any queries longer than %d milliseconds.\n", long_query_time /
printf("Please press Ctrl+C to exit when done...\n\n");

* catches the beginning of the execution of our query

/ self->in_query==0 /
@total["Total queries"] = count();
self->start_time = timestamp;

* Find our slow queries

/ (timestamp - self->start_time) > long_query_time && self->in_query==1 /
self->total_time = timestamp - self->start_time;

@total["Total slow queries"] = count();
printf("Date/time: %Y\n", walltimestamp);
printf("Query: %s\n",self->query);
printf("Query time: %d microseconds\n", self->total_time / 1000);


* executes for every query to reset them

/ self->in_query == 1/

* catch the query string

/ self->in_query == 1/
self->query = copyinstr(arg1);

printa("%s: %@d\n", @total);

This then results in output such as:

# ./test.d -p 27609
Capturing any queries longer than 250 milliseconds.
Please press Ctrl+C to exit when done...

Date/time: 2009 May 13 23:11:52
Query: select sleep(2)
Query time: 2004913 microseconds

Total queries: 1048
Total slow queries: 1

I did a few tests on the overhead to determine if it would slow down there system (they do several thousand requests a second), and to my surprise it had virtually no overhead once running. There was a bit of extra overhead when the script was first initiated and when ending it and it would cause a very slight hiccup in processing (< 1 second). It was decided this was okay, and we scheduled it to run during some lower load times so this hiccup wasn't a big deal.

This caught a few thousand slow queries over the course of an hour. Some were easy to discard due to the nature of the queries, but some others needed further investigation. The next post will discuss what the next steps were and will start getting data that we can't see in the slow query log.

Thursday, February 19, 2009

My Favorite New Feature of MySQL 5.1: Less InnoDB Locking

MySQL 5.1 has a lot of nice new features, partitioning, event scheduler, dynamic logging, etc... Most of these require changing some application code or design logic to make use of them. However, there is a few new features that almost every user of InnoDB can take advantage of right away and that is the ability to set MySQL so that less locking is required for many statements in InnoDB. There are a few changes which make this possible.

The first one is the new auto_increment locking for InnoDB. There is a good article which talks about this here and a large section the in the manual.

In MySQL 5.0 and previous, InnoDB used a special query-duration table level lock to acquire the AUTO-INC lock. This basically caused all inserts into the same table to serialize. With single row inserts it normally wasn't too bad but could prevent some scalability with multiple threads inserting. However, where it was quite bad was large data insertions, such as INSERT...SELECT and LOAD DATA INFILE. While these were running, all other insertions into the table would be prevented due to this table level lock.

MySQL 5.1 has changed this behavior. There is a new option innodb_autoinc_lock_mode which controls how InnoDB will handle this. The default is a value of 1 which works well if you are doing only inserts where MySQL knows the amount of rows being inserted. For example, a single row insert or a multi-row insert would be fine in this mode. In these cases, it will use a very short term mutex which will be released immediately after acquiring the needed values. It should remove all contention issues if you are using only these statements.

Statements where the number of rows being inserted are unknown, such as INSERT...SELECT and LOAD DATA INFILE both still use the AUTO-INC table level lock for this and will cause other single row statements to lock on it. To fix this you need to set innodb_autoinc_lock_mode=2. This will make each row in these long running statements get a new value, rather than holding it for the duration. The problem with this is that now the values generated won't be consecutive anymore. Other transactions could grab values to use in the middle of the sequence.

For some applications this might be a problem (but shouldn't if you are just using the auto_increment for uniqueness). The other issues comes from the binary log. In statement level binary logging, the auto_increment starting point is logged and then the statement. However, since it is no longer consecutive, the slave can't create the appropriate values and replication will break. The solution to this also comes in 5.1 in the form of row level binary logging. You can enable all row based binary logging and it will then handle this without issue.

So to summarize, to pretty much completely remove auto_increment locking in InnoDB, you will need to set both innodb_autoinc_lock_mode=2 and binlog_format=row.

There is a second locking improvement in InnoDB related to the row level binary logging. To enable this, you will need to use READ COMMITTED isolation level. When you do this in 5.1, you will then cause InnoDB to reduce locking for DML statements a great deal. It will not lock any rows searched upon, but not changed with the DML statements.

In older versions every row that was searched in a data changing statement would be locked. For example, take the following statement:
DELETE FROM tbl WHERE key < 1000 AND non_key = 500
In 5.0, assuming the key was used, every row less than 1000 would be exclusively locked, even if only a few were changed. In 5.1 and READ COMMITTED, only the rows that are actually changed will be locked now. As you can imagine, this potentially can reduce locking contention a great deal.

One final topic I want to mention innodb_locks_unsafe_for_binlog option. People have asked if they should enable this when binlog_format=row. It makes sense to do so, since with row level binary logging, the restrictions for the binary log are gone. The answer is that this variable does not matter anymore in 5.1 if you set READ COMMITTED. In all cases where innodb_locks_unsafe_for_binlog used to reduce locking, READ COMMITTED now does as well.

From the manual:
This also means that if new items are added to the database, InnoDB does not guarantee serializability. Therefore, if innodb_locks_unsafe_for_binlog is enabled, InnoDB guarantees at most an isolation level of READ COMMITTED. (Conflict serializability is still guaranteed.)
This means that you also can not run higher than READ COMMITTED while using innodb_locks_unsafe_for_binlog, so there is no point in setting innodb_locks_unsafe_for_binlog.

So to summarize and get the best locking out of InnoDB in 5.1, you want to set the following options:

Keep in mind that row based binary logging does potentially have additional disk overhead compared to statement binary logging, so as always, please test these values before using them in production.

Wednesday, February 18, 2009

MySQL, Windows 32-bit, and /3GB

One question that comes up relatively commonly in support is how much memory can MySQL use on my system. As you may know, MySQL runs in a single process with many threads. So the memory limit it has is based upon the operating system limits for a single process. Different operating systems have different limits, but the one I want to discuss is Windows 32-bit (such as 2000, XP, 2003, vista).

The short answer is that your mysqld-nt.exe can use up to ~1.8G of RAM. The question is why is that so. Since we have a 32-bit address space, in theory there is 2^32 mapping for the address space (4G). However, Windows uses what is called a 2G/2G split in that addressable space. 2G are allowed for the actual application to use and 2G is for the kernel mappings.

This 2G is then the address space of points, which normally ends up with around 1.8G of memory really being usable. If you try to use more than this, then you end up getting Out of Memory errors from MySQL.

Now Windows 2000 and above has a way you can adjust this split. Not all versions of Windows within these families support this, so it depends upon what version you are using. The specific feature is called '4 GT RAM Tuning' by Microsoft.

To adjust this, you will need to do a few steps, to enable it on Windows and then within MySQL.

First, you need to boot Windows with the /3GB flag given to it. You will need to edit the boot.ini, find the instance you are booting off of, and add that flag to it.

Second, you will need to enable for any program that wants to use this split. MySQL does not ship with this done by default (it is normally a compile time option), so we will need to manually set it (or compile your own mysqld-nt.exe on Windows).

There is a program that comes with Visual Studio C++ called 'editbin'. I use the express version which is free. There is a /LARGEADDRESSAWARE flag you can pass to it to enable the option for your server. I am running this via the Visual Studio Command Prompt.

C:\Program Files\MySQL\MySQL Server 5.0\bin>editbin /LARGEADDRESSAWARE mysqld-nt.exe
Microsoft (R) COFF/PE Editor Version 8.00.50727.42
Copyright (C) Microsoft Corporation. All rights reserved.

To confirm that this works, you can use dumpbin /HEADERS and there is an option in the very top which says if it is enabled or not.
C:\Program Files\MySQL\MySQL Server 5.0\bin>dumpbin /headers mysqld-nt.exe
Microsoft (R) COFF/PE Dumper Version 8.00.50727.42
Copyright (C) Microsoft Corporation. All rights reserved.

Dump of file mysqld-nt.exe

PE signature found


14C machine (x86)
6 number of sections
487FA5AC time date stamp Thu Jul 17 16:03:56 2008
0 file pointer to symbol table
0 number of symbols
E0 size of optional header
12F characteristics
Relocations stripped
Line numbers stripped
Symbols stripped
Application can handle large (>2GB) addresses
32 bit word machine

Notice the second to last line, "Application can handle large (>2GB) addresses". This means it is enabled for the binary, if it is missing, then the binary does not have it enabled.

Once these two things are done you will now have a 3G address space for MySQL, and you should be able to go up to ~2.6G or so. You are still limited to smaller allocations (ie, you can't set innodb_buffer_pool_size=2.5G), but it can give you more memory for handling connections and per-connection buffers which can effectively allow you to increase the large global buffers.

For more details, I would like to recommend you read the technet blog about the topic, it is quite a good explanation.

Thursday, February 5, 2009

My take on MySQL 5.1

So being that I am a support engineer with Sun/MySQL, I figured I would chime in with my opinion of MySQL 5.1 GA so far since I have been using it for a while. For history, I have been with MySQL AB/Sun since the 4.0 release cycle. This post does not represent any official views of my employer, Sun Microsystems, and all such disclaimers.

First off, I want to say that Sun assisted greatly with MySQL 5.1. I think it is a much better release than it would have been with MySQL AB doing the release. I do think this will continue in the future from everything I have experienced and seen so far. Sun knows how to make solid software.

So do I think that 5.1 was released too early? Not at all from my experiences of the MySQL releases. 5.1 is fairing quite well, and I think everyone that has a reason to upgrade to it (such as less locking for InnoDB auto_increment, partitioning, dynamic logging, etc...) should certainly start the process of doing so.

MySQL 5.1 is in quite good shape compared to my experiences with MySQL 4.0 to 5.0. As a chart of my experiences:

ReleaseGA VersionMy first recommended version

These versions aren't based on exact bugs fixed, but based on interactions with many customers using the versions and various issues being experienced. It has a bit of gut feeling involved too, so it certainly is not a pure scientific method for determining this.

Also by no means is MySQL 5.1 perfect. There have been some pretty big bugs found and some annoying ones too. However, overall it has been working pretty well for most users.

I still do recommend you do test out all new functionality to ensure it works for you as there are many limitations. However, I would say that for any major software upgrade with any database system, this isn't specific to MySQL or the 5.1 release.

So go ahead and give 5.1 a try-out if you have been on the fence. Have a look around and kick the tires and be sure to report any bugs you find.

Thursday, January 29, 2009

XA and Persistent InnoDB Locks

Recently a customer mentioned that they were seeing corruption rarely when they copied InnoDB files using LVM to setup a new slave. Investigating further, it turns out that replication would start, but would then hit a lock wait timeout. This locking issue occurred across restarts causing replication to always fail. They would solve this by taking a new LVM snapshot and resetting it up.

This is a classic case of an XA transaction in a prepared state persisting across restarts and the LVM sometimes taking a snapshot at the exact instant when this can occur.

Most people don't use actually XA transactions, so they may not consider this possibility. However, it can still occur even if you do not use them due to XA being used internally for the binary log and InnoDB. We will be able to identify that case later.

You can tell this is occurring due to a mention of prepared transactions in the error log:
090128 11:09:09 [Note] Starting crash recovery...
090128 11:09:09 InnoDB: Starting recovery for XA transactions...
090128 11:09:09 InnoDB: Transaction 0 18707 in prepared state after recovery
090128 11:09:09 InnoDB: Transaction contains changes to 1 rows
090128 11:09:09 InnoDB: 1 transactions in prepared state after recovery
090128 11:09:09 [Note] Found 1 prepared transaction(s) in InnoDB
090128 11:09:09 [Warning] Found 1 prepared XA transactions

In addition, you will see a transaction listed in PREPARE without any MySQL or OS thread id in the output of SHOW INNODB STATUS:
---TRANSACTION 0 18707, ACTIVE (PREPARED) 11 sec, OS thread id 0
, undo log entries 1

So how do you solve this when it occurs? The key is using the XA commands which allow you to manipulate XA transactions in the prepared state.

The first command is XA RECOVER which will show you a list of the possible transactions to handle. The data column has the XID of the transaction to rollback. If the XID starts with MySQLXid then it is an internal generated XID. The XID can contain unprintable characters for your terminal, so keep that in mind it might be a bit different than it initially looks.
mysql> xa recover;
| formatID | gtrid_length | bqual_length | data |
| 1 | 3 | 0 | foo |
1 row in set (0.00 sec)

After that, you can then use XA ROLLBACK to remove the transaction or XA COMMIT to save the change. Both of these require that you specify the XID that you got from the previous one.

I wrote a one-line shell script which might work when you have a single XID to commit. I haven't tested it very thoroughly, so don't get mad at me if it doesn't work in all cases.
echo xa commit \'`mysql --skip-column-names -e 'xa recover' | awk '{ print $4 }'`\' | mysql

Friday, January 16, 2009

Enabling InnoDB Large Pages on Linux

In MySQL 5.0, InnoDB gained the ability to use Linux Large Page support for allocating memory for the buffer pool and additional memory pool.

A few customers have asked about using it and there is virtually no documentation on what is required on Linux to enable it. I actually ended up having to read some of the Linux kernel source code to figure out some of this.

This uses the API as documented at:

To set this up and use it, you first need a kernel that supports it. All of the recent RHEL kernels do by default from what I can tell. On my Ubuntu systems, I'm not seeing it enabled normally.

Then on the OS level you will need to do the following procedures:

# Set the number of pages to be used
# Each page is normally 2MB, so this would be 40 MB
# This actually allocates memory, so it requires that much memory to be available
echo 20 > /proc/sys/vm/nr_hugepages

# Set the group number that is allowed to access this memory
# The mysql user must be a member of this group
echo 102 > /proc/sys/vm/hugetlb_shm_group

# Increase the amount of shmem allowed per segment
# 256MB in this case
echo 268435456 > /proc/sys/kernel/shmmax

# Increase total amount of shared memory
# This is 4KB pages, ie. 16GB below
echo 4194304 > /proc/sys/kernel/shmall

For MySQL usage, you would normally want the shmmax to be close to the shmall.

You would normally want to put these into an rc file or similar to do it at every boot sequence (early in the boot sequence normally, prior to MySQL starting).

To verify it works with:
cat /proc/meminfo | grep -i huge
The final step is in order to make use of the hugetlb_shm_group, you need to give the mysql user 'unlimited' value for the memlock limit. This can either by done by editing /etc/security/limits.conf or by adding the following to your mysqld_safe:
ulimit -l unlimited
This will cause the root user to set it to unlimited before switching to the mysql user.

Finally, you will want to add the large-pages option to your my.cnf:
With this option, InnoDB will use it automatically for the two memory pools. If it can not, it will fail back and use traditional memory and output a warning to the error log.

You can verify it is being used by looking at:
cat /proc/meminfo | grep -i huge

Solaris also has the ability to use large pages (of different sizes as well), but MySQL doesn't support that yet.