Monday, May 17, 2010

Poor Man's Profiler using Solaris' pstack

Recently I was working with the output of pstack from a hung MySQL server and wanted to use Poor Man's Profiler in order to combine stack traces. Unfortunately, the awk magic expects the output from gdb's thread apply all bt output.

gdb output:

Thread 10 (Thread 0xa644db90 (LWP 26275)):
#0 0xb7f47410 in __kernel_vsyscall ()
#1 0xb7f33b1a in do_sigwait () from /lib/tls/i686/cmov/libpthread.so.0
#2 0xb7f33bbf in sigwait () from /lib/tls/i686/cmov/libpthread.so.0
#3 0x081cc4fc in signal_hand ()
#4 0xb7f2b4fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#5 0xb7d25e5e in clone () from /lib/tls/i686/cmov/libc.so.6

Thread 9 (Thread 0xa641cb90 (LWP 26273)):
#0 0xb7f47410 in __kernel_vsyscall ()
#1 0xb7d1e881 in select () from /lib/tls/i686/cmov/libc.so.6
#2 0x081d1190 in handle_connections_sockets ()
#3 0x081d1ef3 in main ()
...

pstack output:

----------------- lwp# 56 / thread# 56 --------------------
fffffd7ffed7bb7a sigtimedwait (fffffd7ffe3aee10, fffffd7ffe3aee20, 0)
fffffd7ffed6aced sigwait () + d
fffffd7ffed62740 __posix_sigwait () + 40
0000000000712dcd signal_hand () + 12d
fffffd7ffed7704b _thr_setup () + 5b
fffffd7ffed77280 _lwp_start ()
----------------- lwp# 53 / thread# 53 --------------------
fffffd7ffed7c6ca pollsys (fffffd732fe31dd0, 0, fffffd732fe31e60, 0)
fffffd7ffed234c4 pselect () + 154
fffffd7ffed23792 select () + 72
fffffd7ffe866008 os_thread_sleep () + 50
fffffd7ffea6dafb srv_lock_timeout_and_monitor_thread () + 1b3
fffffd7ffed7704b _thr_setup () + 5b
fffffd7ffed77280 _lwp_start ()
...

However, we can see how powerful and flexible PMP is by making a small change to the awk script and poof, we had the same output:


awk '
BEGIN { s = ""; }
/thread#/ { print s; s = ""; }
/^ [0-9a-f]/ { if (s != "" ) { s = s "," $2} else { s = $2 } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1


Yet another win for PMP.

Tuesday, March 2, 2010

Semi-sync Replication Testing

I have recently been trying out semisynchronous replication. This is a new feature added in MySQL 5.5 based on the original google patch.

Installing was very simple and has been covered elsewhere in detail. Suffice to say it was relatively simple to do.

While test I was a bit surprised by some behavior I saw that turned out correct. What I wanted to examine what was semi-synchronous actually does and use cases for it.

The manual defines this feature correctly in very careful language:
If semisynchronous replication is enabled on the master side and there is at least one semisynchronous slave, a thread that performs a transaction commit on the master blocks after the commit is done and waits until at least one semisynchronous slave acknowledges that it has received all events for the transaction, or until a timeout occurs.


There is a subtle difference to how this is described in other places, for example from Guiseppe's blog (not picking on him, just an example since I have seen in many places):
That is, before committing, the master waits until at least one slave has acknowledged that it has received the portion of binary log necessary to reproduce the transaction.


What the difference is the fact that the blocking and relay to the remote slave occurs after the transaction is actually committed to disk.

This means that if the master crashes it is possible that it may have committed transactions on disk that are not on the slave. However, the client will get back an error saying that the commit failed.

What is the use case for this then?

For failover purposes, this is generally exactly what you need. The client can know it has failed and can redo the transaction on the slave. There is no data loss and everything is nicely guaranteed.

What doesn't it help?

Where it isn't useful is for recovery after the crash. When you finally get your master restarted, it may already have some transactions that were later replayed on the slave. This will naturally cause replication to break and things fail.

Thursday, January 14, 2010

State of MySQL Search

Recently I took part in the first meeting of the MySQL Search Team. You can read more about the team here.

My task was to represent the customer interests regarding fulltext search. My report had a few main points which I made:

  • MySQL Fulltext Search (FTS) has several big issues with it compared to other solutions. I have listed these in the order of importance that I see. Numbers 1 and 2 are very close in importance however, so I can see those going either way.

    1. MyISAM only: Limits the usefulness in many very busy environments due to table level locking and crash safety

    2. Performance: The design of FTS makes performance hurt as the index size exceeds RAM. Combining multiple second queries with MyISAM table level locking is a recipe for failure.

    3. Features: There are a lot of features lacking in FTS that exist in other products. Stemming, synonyms, dictionary usage, CJK support, etc... are all potential avenues of investigation.

  • Third-party solutions are commonly used for searching data in MySQL, but does require extra work to deploy or implement. Tools like Lucene, Sphinx, etc... are very good at what they do, and are normally faster and more fully featured compared to MySQL FTS. The big issue is that normally there is some effort involved in implementing these, so the ease of use in implementing and maintaining is lower than MySQL FTS.



So the forward looking question is how to combine the best attributes of these two solutions to get one easy to deploy and use robust solution. Personally, I can see a few possible routes forward:
  • Improve MySQL FTS! There is a lot of research and known best practices that exist for search. This would leverage those in order to make the MySQL solution more robust and featureful. From the current state of FTS, it would be a pretty decent amount of work, but could have good returns.
  • Improve the glue code between the third parties and MySQL. This could take a variety of forms. One idea I had was in the the form of index plugins where as rows are inserted/updated/deleted, the external datastores can be automatically updated as well. This is very similar to triggers, but is a bit different since there can be more integration and easier deployment.
It's possible the best solution may be a mix of the two options to allow for more flexibility or even something completely new that I haven't thought of!

Regardless, I am happy that people at MySQL are finally taking a look at the poor state of the MySQL fulltext search environment and trying to figure out new good solutions for this space.

Wednesday, December 2, 2009

Ext4 with MySQL binary logs oddity

I was working with a customer recently that kept seeing 10-12 second hangs in MySQL. Everything would be working well and suddenly all data changing statements would stop working.

Once the 10-12 seconds passed, the system would recover and everything would be fine for a while. This would repeat every 30-45 minutes normally, and would occur more frequently when the system was doing some heavy ETL type activity (data loads, big updates, etc...), with the ETL ones being longer and worse generally.

One of the steps taken while investigating was to check out vmstat/iostat. From this we noticed that during the spike there was very intense disk activity. I suspected something wrong with InnoDB such as the famous purge hiccup or something with the log files checkpointing, etc... However we then found out that it was the disk system where the binary logs were stored and not the InnoDB files.

While continuing to investigate, it was noticed that the times it occurred happened to exactly match up with the times that the binary log rotations were occurring. Using strace, we were able to notice that the fdatasync() call that accompanied the rotation was taking a very long time (the majority of the delay time).

Finally we were able to pin it down to ext4 and how it delays data writes for a very long time (30 minutes). Compare this to ext3 which will flush things every 5 seconds or so. I am told that ext4 recently changed this behavior (this was in 2.6.30), so hopefully this won't hit more people.

What was happening was as follows:

  • Binary log data gets written to over time

  • ext4 in an attempt to increase performance does not write the data to disk

  • ext4 continues to not write data to disk, even as hundreds of MB of binary log sits in memory and the disk is mostly idle

  • Binary log gets full (1024MB) and rotates while holding the log mutex

  • Rotation calls an fdatasync prior to closing the file

  • ext4 now has to write the data out and takes 10-12 seconds to do so

  • log mutex prevents any commits while it writes

  • Write finishes, log rotates, things repeat



To alleviate this, we ended up setting sync_binlog=100. This will force the binary log to fsync periodically and hence not allow ext4 to wait forever to write the data out to the file. There is also the commit mount option for ext4 which should give similar benefits by forcing it to write more often but I didn't test it.

Friday, May 15, 2009

Small features being pushed

So back in 2004 I created a small feature for mysqladmin allowing it to prompt for a password to set, rather than needing to specify it on the command line. It was a feature that came up in a training course I was teaching, so I wrote it up while at the hotel one day. I had filed this as Bug #5724.

The reason it got filed as a feature request bug is due to the fact that being a non-developer at MySQL AB meant I followed the normal community contributions process. I had several patches accepted via the community model (mysql prompt, some mysql readline enhancements, etc...) previous to this patch. However, for some reason around the time that this patch was submitted, patch acceptance slowed down.

Apparently the community acceptance process is actually working again, because today it got queued and pushed to a team tree which will be merged into MySQL 6.0. Only took 4.5 years to get in. Hopefully, my next patch will be included in a much shorter time now with the new process.

Note that I haven't done anything with this bug recently, jimw apparently found it himself or was prodded by someone to handle it.

Wednesday, May 13, 2009

Production example of Dtrace + MySQL (Part 1)

I had read a lot about using dtrace with MySQL, but recently had the chance to use it and thought it worked extremely well, so I wanted to share. Domas mentioned some of this in the presentation he did on dtrace + MySQL at the MySQL conference.

The basics of this are true, somethings have been changed to help protect the identity of the customer.

I am going to spread this out over a few postings to show how dtrace can be used to incrementally find out things that aren't available otherwise.


The customer was reporting that queries were extremely fast, taking a few milliseconds normally, but sometimes they would get queries that took 500+ milliseconds. They wanted to know why and how this was occurring. They happened to be on Solaris 10, so dtrace to the rescue!

So the first task was to find out what were the slow queries. In MySQL 5.1, we could use the slow query log with millisecond precision, but naturally this was MySQL 5.0. The first script was designed just to capture the slow queries and see if they had any sort of pattern. The initial script looked like:


#!/usr/sbin/dtrace -s

#pragma D option quiet

/* how much memory to use maximum
* if you start getting errors regarding dropped data from the script, then
* you might get incorrect information
*/
#pragma D option dynvarsize=512m


/* how much of the query to copy, large takes up more memory */
#pragma D option strsize=256

/*
value in nanoseconds of when to log a query as slow and information
about it
1000000 is 1 ms
50000000 is 50 ms
250000000 is 250 ms
*/
inline int long_query_time = 250000000;

BEGIN
{
/* initialize our aggregates in case there aren't any */
@total["Total queries"] = count();
@total["Total slow queries"] = count();
clear(@total);
self->in_query = 0;
self->query = "";
printf("Capturing any queries longer than %d milliseconds.\n", long_query_time /
1000000);
printf("Please press Ctrl+C to exit when done...\n\n");
}

/*
* catches the beginning of the execution of our query
*/

pid$target::*dispatch_command*:entry
/ self->in_query==0 /
{
@total["Total queries"] = count();
self->start_time = timestamp;
self->in_query=1;
}

/*
* Find our slow queries
*/

pid$target::*dispatch_command*:return
/ (timestamp - self->start_time) > long_query_time && self->in_query==1 /
{
self->total_time = timestamp - self->start_time;

@total["Total slow queries"] = count();
printf("Date/time: %Y\n", walltimestamp);
printf("Query: %s\n",self->query);
printf("Query time: %d microseconds\n", self->total_time / 1000);

printf("\n");
}

/*
* executes for every query to reset them
*/

pid$target::*dispatch_command*:return
/ self->in_query == 1/
{
self->query="";
self->in_query=0;
self->start_time=0;
self->total_time=0;
}

/*
* catch the query string
*/

pid$target::*mysql_parse*:entry
/ self->in_query == 1/
{
self->query = copyinstr(arg1);
}


END
{
printa("%s: %@d\n", @total);
}


This then results in output such as:


# ./test.d -p 27609
Capturing any queries longer than 250 milliseconds.
Please press Ctrl+C to exit when done...

Date/time: 2009 May 13 23:11:52
Query: select sleep(2)
Query time: 2004913 microseconds

^C
Total queries: 1048
Total slow queries: 1


I did a few tests on the overhead to determine if it would slow down there system (they do several thousand requests a second), and to my surprise it had virtually no overhead once running. There was a bit of extra overhead when the script was first initiated and when ending it and it would cause a very slight hiccup in processing (< 1 second). It was decided this was okay, and we scheduled it to run during some lower load times so this hiccup wasn't a big deal.

This caught a few thousand slow queries over the course of an hour. Some were easy to discard due to the nature of the queries, but some others needed further investigation. The next post will discuss what the next steps were and will start getting data that we can't see in the slow query log.

Thursday, February 19, 2009

My Favorite New Feature of MySQL 5.1: Less InnoDB Locking

MySQL 5.1 has a lot of nice new features, partitioning, event scheduler, dynamic logging, etc... Most of these require changing some application code or design logic to make use of them. However, there is a few new features that almost every user of InnoDB can take advantage of right away and that is the ability to set MySQL so that less locking is required for many statements in InnoDB. There are a few changes which make this possible.

The first one is the new auto_increment locking for InnoDB. There is a good article which talks about this here and a large section the in the manual.

In MySQL 5.0 and previous, InnoDB used a special query-duration table level lock to acquire the AUTO-INC lock. This basically caused all inserts into the same table to serialize. With single row inserts it normally wasn't too bad but could prevent some scalability with multiple threads inserting. However, where it was quite bad was large data insertions, such as INSERT...SELECT and LOAD DATA INFILE. While these were running, all other insertions into the table would be prevented due to this table level lock.

MySQL 5.1 has changed this behavior. There is a new option innodb_autoinc_lock_mode which controls how InnoDB will handle this. The default is a value of 1 which works well if you are doing only inserts where MySQL knows the amount of rows being inserted. For example, a single row insert or a multi-row insert would be fine in this mode. In these cases, it will use a very short term mutex which will be released immediately after acquiring the needed values. It should remove all contention issues if you are using only these statements.

Statements where the number of rows being inserted are unknown, such as INSERT...SELECT and LOAD DATA INFILE both still use the AUTO-INC table level lock for this and will cause other single row statements to lock on it. To fix this you need to set innodb_autoinc_lock_mode=2. This will make each row in these long running statements get a new value, rather than holding it for the duration. The problem with this is that now the values generated won't be consecutive anymore. Other transactions could grab values to use in the middle of the sequence.

For some applications this might be a problem (but shouldn't if you are just using the auto_increment for uniqueness). The other issues comes from the binary log. In statement level binary logging, the auto_increment starting point is logged and then the statement. However, since it is no longer consecutive, the slave can't create the appropriate values and replication will break. The solution to this also comes in 5.1 in the form of row level binary logging. You can enable all row based binary logging and it will then handle this without issue.

So to summarize, to pretty much completely remove auto_increment locking in InnoDB, you will need to set both innodb_autoinc_lock_mode=2 and binlog_format=row.


There is a second locking improvement in InnoDB related to the row level binary logging. To enable this, you will need to use READ COMMITTED isolation level. When you do this in 5.1, you will then cause InnoDB to reduce locking for DML statements a great deal. It will not lock any rows searched upon, but not changed with the DML statements.

In older versions every row that was searched in a data changing statement would be locked. For example, take the following statement:
DELETE FROM tbl WHERE key < 1000 AND non_key = 500
In 5.0, assuming the key was used, every row less than 1000 would be exclusively locked, even if only a few were changed. In 5.1 and READ COMMITTED, only the rows that are actually changed will be locked now. As you can imagine, this potentially can reduce locking contention a great deal.


One final topic I want to mention innodb_locks_unsafe_for_binlog option. People have asked if they should enable this when binlog_format=row. It makes sense to do so, since with row level binary logging, the restrictions for the binary log are gone. The answer is that this variable does not matter anymore in 5.1 if you set READ COMMITTED. In all cases where innodb_locks_unsafe_for_binlog used to reduce locking, READ COMMITTED now does as well.

From the manual:
This also means that if new items are added to the database, InnoDB does not guarantee serializability. Therefore, if innodb_locks_unsafe_for_binlog is enabled, InnoDB guarantees at most an isolation level of READ COMMITTED. (Conflict serializability is still guaranteed.)
This means that you also can not run higher than READ COMMITTED while using innodb_locks_unsafe_for_binlog, so there is no point in setting innodb_locks_unsafe_for_binlog.


So to summarize and get the best locking out of InnoDB in 5.1, you want to set the following options:
[mysqld]
innodb_autoinc_lock_mode=2
binlog_format=row
tx_isolation=READ-COMMITTED



Keep in mind that row based binary logging does potentially have additional disk overhead compared to statement binary logging, so as always, please test these values before using them in production.