Wednesday, May 13, 2009

Production example of Dtrace + MySQL (Part 1)

I had read a lot about using dtrace with MySQL, but recently had the chance to use it and thought it worked extremely well, so I wanted to share. Domas mentioned some of this in the presentation he did on dtrace + MySQL at the MySQL conference.

The basics of this are true, somethings have been changed to help protect the identity of the customer.

I am going to spread this out over a few postings to show how dtrace can be used to incrementally find out things that aren't available otherwise.


The customer was reporting that queries were extremely fast, taking a few milliseconds normally, but sometimes they would get queries that took 500+ milliseconds. They wanted to know why and how this was occurring. They happened to be on Solaris 10, so dtrace to the rescue!

So the first task was to find out what were the slow queries. In MySQL 5.1, we could use the slow query log with millisecond precision, but naturally this was MySQL 5.0. The first script was designed just to capture the slow queries and see if they had any sort of pattern. The initial script looked like:


#!/usr/sbin/dtrace -s

#pragma D option quiet

/* how much memory to use maximum
* if you start getting errors regarding dropped data from the script, then
* you might get incorrect information
*/
#pragma D option dynvarsize=512m


/* how much of the query to copy, large takes up more memory */
#pragma D option strsize=256

/*
value in nanoseconds of when to log a query as slow and information
about it
1000000 is 1 ms
50000000 is 50 ms
250000000 is 250 ms
*/
inline int long_query_time = 250000000;

BEGIN
{
/* initialize our aggregates in case there aren't any */
@total["Total queries"] = count();
@total["Total slow queries"] = count();
clear(@total);
self->in_query = 0;
self->query = "";
printf("Capturing any queries longer than %d milliseconds.\n", long_query_time /
1000000);
printf("Please press Ctrl+C to exit when done...\n\n");
}

/*
* catches the beginning of the execution of our query
*/

pid$target::*dispatch_command*:entry
/ self->in_query==0 /
{
@total["Total queries"] = count();
self->start_time = timestamp;
self->in_query=1;
}

/*
* Find our slow queries
*/

pid$target::*dispatch_command*:return
/ (timestamp - self->start_time) > long_query_time && self->in_query==1 /
{
self->total_time = timestamp - self->start_time;

@total["Total slow queries"] = count();
printf("Date/time: %Y\n", walltimestamp);
printf("Query: %s\n",self->query);
printf("Query time: %d microseconds\n", self->total_time / 1000);

printf("\n");
}

/*
* executes for every query to reset them
*/

pid$target::*dispatch_command*:return
/ self->in_query == 1/
{
self->query="";
self->in_query=0;
self->start_time=0;
self->total_time=0;
}

/*
* catch the query string
*/

pid$target::*mysql_parse*:entry
/ self->in_query == 1/
{
self->query = copyinstr(arg1);
}


END
{
printa("%s: %@d\n", @total);
}


This then results in output such as:


# ./test.d -p 27609
Capturing any queries longer than 250 milliseconds.
Please press Ctrl+C to exit when done...

Date/time: 2009 May 13 23:11:52
Query: select sleep(2)
Query time: 2004913 microseconds

^C
Total queries: 1048
Total slow queries: 1


I did a few tests on the overhead to determine if it would slow down there system (they do several thousand requests a second), and to my surprise it had virtually no overhead once running. There was a bit of extra overhead when the script was first initiated and when ending it and it would cause a very slight hiccup in processing (< 1 second). It was decided this was okay, and we scheduled it to run during some lower load times so this hiccup wasn't a big deal.

This caught a few thousand slow queries over the course of an hour. Some were easy to discard due to the nature of the queries, but some others needed further investigation. The next post will discuss what the next steps were and will start getting data that we can't see in the slow query log.

No comments:

Post a Comment