MySQL Thoughts

Poor Man's Profiler using Solaris' pstack

2010-05-17T06:37:00.000-07:00

Recently I was working with the output of pstack from a hung MySQL server and wanted to use Poor Man's Profiler in order to combine stack traces. Unfortunately, the awk magic expects the output from gdb's thread apply all bt output.

gdb output:


Thread 10 (Thread 0xa644db90 (LWP 26275)):
#0  0xb7f47410 in __kernel_vsyscall ()
#1  0xb7f33b1a in do_sigwait () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7f33bbf in sigwait () from /lib/tls/i686/cmov/libpthread.so.0
#3  0x081cc4fc in signal_hand ()
#4  0xb7f2b4fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#5  0xb7d25e5e in clone () from /lib/tls/i686/cmov/libc.so.6

Thread 9 (Thread 0xa641cb90 (LWP 26273)):
#0  0xb7f47410 in __kernel_vsyscall ()
#1  0xb7d1e881 in select () from /lib/tls/i686/cmov/libc.so.6
#2  0x081d1190 in handle_connections_sockets ()
#3  0x081d1ef3 in main ()
...

pstack output:


-----------------  lwp# 56 / thread# 56  --------------------
 fffffd7ffed7bb7a sigtimedwait (fffffd7ffe3aee10, fffffd7ffe3aee20, 0)
 fffffd7ffed6aced sigwait () + d
 fffffd7ffed62740 __posix_sigwait () + 40
 0000000000712dcd signal_hand () + 12d
 fffffd7ffed7704b _thr_setup () + 5b
 fffffd7ffed77280 _lwp_start ()
-----------------  lwp# 53 / thread# 53  --------------------
 fffffd7ffed7c6ca pollsys  (fffffd732fe31dd0, 0, fffffd732fe31e60, 0)
 fffffd7ffed234c4 pselect () + 154
 fffffd7ffed23792 select () + 72
 fffffd7ffe866008 os_thread_sleep () + 50
 fffffd7ffea6dafb srv_lock_timeout_and_monitor_thread () + 1b3
 fffffd7ffed7704b _thr_setup () + 5b
 fffffd7ffed77280 _lwp_start ()
...

However, we can see how powerful and flexible PMP is by making a small change to the awk script and poof, we had the same output:


  awk '
  BEGIN { s = ""; } 
  /thread#/ { print s; s = ""; } 
  /^ [0-9a-f]/ { if (s != "" ) { s = s "," $2} else { s = $2 } } 
  END { print s }' | sort | uniq -c | sort -r -n -k 1,1

Yet another win for PMP.

Semi-sync Replication Testing

2010-03-02T13:02:00.000-08:00

I have recently been trying out semisynchronous replication. This is a new feature added in MySQL 5.5 based on the original google patch.

Installing was very simple and has been covered elsewhere in detail. Suffice to say it was relatively simple to do.

While test I was a bit surprised by some behavior I saw that turned out correct. What I wanted to examine what was semi-synchronous actually does and use cases for it.

The manual defines this feature correctly in very careful language:

If semisynchronous replication is enabled on the master side and there is at least one semisynchronous slave, a thread that performs a transaction commit on the master blocks after the commit is done and waits until at least one semisynchronous slave acknowledges that it has received all events for the transaction, or until a timeout occurs.

There is a subtle difference to how this is described in other places, for example from Guiseppe's blog (not picking on him, just an example since I have seen in many places):

That is, before committing, the master waits until at least one slave has acknowledged that it has received the portion of binary log necessary to reproduce the transaction.

What the difference is the fact that the blocking and relay to the remote slave occurs after the transaction is actually committed to disk.

This means that if the master crashes it is possible that it may have committed transactions on disk that are not on the slave. However, the client will get back an error saying that the commit failed.

What is the use case for this then?

For failover purposes, this is generally exactly what you need. The client can know it has failed and can redo the transaction on the slave. There is no data loss and everything is nicely guaranteed.

What doesn't it help?

Where it isn't useful is for recovery after the crash. When you finally get your master restarted, it may already have some transactions that were later replayed on the slave. This will naturally cause replication to break and things fail.

State of MySQL Search

2010-01-14T10:35:00.000-08:00

Recently I took part in the first meeting of the MySQL Search Team. You can read more about the team here.

My task was to represent the customer interests regarding fulltext search. My report had a few main points which I made:

MySQL Fulltext Search (FTS) has several big issues with it compared to other solutions. I have listed these in the order of importance that I see. Numbers 1 and 2 are very close in importance however, so I can see those going either way.
1. MyISAM only: Limits the usefulness in many very busy environments due to table level locking and crash safety
2. Performance: The design of FTS makes performance hurt as the index size exceeds RAM. Combining multiple second queries with MyISAM table level locking is a recipe for failure.
3. Features: There are a lot of features lacking in FTS that exist in other products. Stemming, synonyms, dictionary usage, CJK support, etc... are all potential avenues of investigation.
Third-party solutions are commonly used for searching data in MySQL, but does require extra work to deploy or implement. Tools like Lucene, Sphinx, etc... are very good at what they do, and are normally faster and more fully featured compared to MySQL FTS. The big issue is that normally there is some effort involved in implementing these, so the ease of use in implementing and maintaining is lower than MySQL FTS.

So the forward looking question is how to combine the best attributes of these two solutions to get one easy to deploy and use robust solution. Personally, I can see a few possible routes forward:

Improve MySQL FTS! There is a lot of research and known best practices that exist for search. This would leverage those in order to make the MySQL solution more robust and featureful. From the current state of FTS, it would be a pretty decent amount of work, but could have good returns.
Improve the glue code between the third parties and MySQL. This could take a variety of forms. One idea I had was in the the form of index plugins where as rows are inserted/updated/deleted, the external datastores can be automatically updated as well. This is very similar to triggers, but is a bit different since there can be more integration and easier deployment.

It's possible the best solution may be a mix of the two options to allow for more flexibility or even something completely new that I haven't thought of!

Regardless, I am happy that people at MySQL are finally taking a look at the poor state of the MySQL fulltext search environment and trying to figure out new good solutions for this space.

Ext4 with MySQL binary logs oddity

2009-12-02T12:47:00.000-08:00

I was working with a customer recently that kept seeing 10-12 second hangs in MySQL. Everything would be working well and suddenly all data changing statements would stop working.

Once the 10-12 seconds passed, the system would recover and everything would be fine for a while. This would repeat every 30-45 minutes normally, and would occur more frequently when the system was doing some heavy ETL type activity (data loads, big updates, etc...), with the ETL ones being longer and worse generally.

One of the steps taken while investigating was to check out vmstat/iostat. From this we noticed that during the spike there was very intense disk activity. I suspected something wrong with InnoDB such as the famous purge hiccup or something with the log files checkpointing, etc... However we then found out that it was the disk system where the binary logs were stored and not the InnoDB files.

While continuing to investigate, it was noticed that the times it occurred happened to exactly match up with the times that the binary log rotations were occurring. Using strace, we were able to notice that the fdatasync() call that accompanied the rotation was taking a very long time (the majority of the delay time).

Finally we were able to pin it down to ext4 and how it delays data writes for a very long time (30 minutes). Compare this to ext3 which will flush things every 5 seconds or so. I am told that ext4 recently changed this behavior (this was in 2.6.30), so hopefully this won't hit more people.

What was happening was as follows:

Binary log data gets written to over time

ext4 in an attempt to increase performance does not write the data to disk

ext4 continues to not write data to disk, even as hundreds of MB of binary log sits in memory and the disk is mostly idle

Binary log gets full (1024MB) and rotates while holding the log mutex

Rotation calls an fdatasync prior to closing the file

ext4 now has to write the data out and takes 10-12 seconds to do so

log mutex prevents any commits while it writes

Write finishes, log rotates, things repeat

To alleviate this, we ended up setting sync_binlog=100. This will force the binary log to fsync periodically and hence not allow ext4 to wait forever to write the data out to the file. There is also the commit mount option for ext4 which should give similar benefits by forcing it to write more often but I didn't test it.

Small features being pushed

2009-05-15T14:52:00.000-07:00

So back in 2004 I created a small feature for mysqladmin allowing it to prompt for a password to set, rather than needing to specify it on the command line. It was a feature that came up in a training course I was teaching, so I wrote it up while at the hotel one day. I had filed this as Bug #5724.

The reason it got filed as a feature request bug is due to the fact that being a non-developer at MySQL AB meant I followed the normal community contributions process. I had several patches accepted via the community model (mysql prompt, some mysql readline enhancements, etc...) previous to this patch. However, for some reason around the time that this patch was submitted, patch acceptance slowed down.

Apparently the community acceptance process is actually working again, because today it got queued and pushed to a team tree which will be merged into MySQL 6.0. Only took 4.5 years to get in. Hopefully, my next patch will be included in a much shorter time now with the new process.

Note that I haven't done anything with this bug recently, jimw apparently found it himself or was prodded by someone to handle it.

Production example of Dtrace + MySQL (Part 1)

2009-05-13T12:50:00.000-07:00

I had read a lot about using dtrace with MySQL, but recently had the chance to use it and thought it worked extremely well, so I wanted to share. Domas mentioned some of this in the presentation he did on dtrace + MySQL at the MySQL conference.

The basics of this are true, somethings have been changed to help protect the identity of the customer.

I am going to spread this out over a few postings to show how dtrace can be used to incrementally find out things that aren't available otherwise.

The customer was reporting that queries were extremely fast, taking a few milliseconds normally, but sometimes they would get queries that took 500+ milliseconds. They wanted to know why and how this was occurring. They happened to be on Solaris 10, so dtrace to the rescue!

So the first task was to find out what were the slow queries. In MySQL 5.1, we could use the slow query log with millisecond precision, but naturally this was MySQL 5.0. The first script was designed just to capture the slow queries and see if they had any sort of pattern. The initial script looked like:


#!/usr/sbin/dtrace -s

#pragma D option quiet

/* how much memory to use maximum
 * if you start getting errors regarding dropped data from the script, then 
 * you might get incorrect information
*/
#pragma D option dynvarsize=512m


/* how much of the query to copy, large takes up more memory */
#pragma D option strsize=256

/*
    value in nanoseconds of when to log a query as slow and information
    about it
    1000000 is 1 ms
    50000000 is 50 ms
    250000000 is 250 ms
  */
inline int long_query_time = 250000000;

BEGIN
{
  /* initialize our aggregates in case there aren't any */
  @total["Total queries"] = count();
  @total["Total slow queries"] = count();
  clear(@total);
  self->in_query = 0;
  self->query = "";
  printf("Capturing any queries longer than %d milliseconds.\n", long_query_time /
  1000000);
  printf("Please press Ctrl+C to exit when done...\n\n");
}

/*
 * catches the beginning of the execution of our query
 */

pid$target::*dispatch_command*:entry
/ self->in_query==0 /
{
        @total["Total queries"] = count();
        self->start_time = timestamp;
        self->in_query=1;
}

/*
 * Find our slow queries
 */

pid$target::*dispatch_command*:return
/ (timestamp - self->start_time) > long_query_time && self->in_query==1 /
{
        self->total_time = timestamp - self->start_time;

        @total["Total slow queries"] = count();
        printf("Date/time: %Y\n", walltimestamp);
        printf("Query: %s\n",self->query);
 printf("Query time: %d microseconds\n", self->total_time / 1000);

 printf("\n");
}

/*
 * executes for every query to reset them
 */

pid$target::*dispatch_command*:return
/ self->in_query == 1/
{
        self->query="";
        self->in_query=0;
        self->start_time=0;
        self->total_time=0;
}

/*
 * catch the query string
 */

pid$target::*mysql_parse*:entry
/ self->in_query == 1/
{
        self->query = copyinstr(arg1);      
}


END 
{
        printa("%s: %@d\n", @total);        
}

This then results in output such as:


# ./test.d -p 27609
Capturing any queries longer than 250 milliseconds.
Please press Ctrl+C to exit when done...

Date/time: 2009 May 13 23:11:52
Query: select sleep(2)
Query time: 2004913 microseconds

^C
Total queries: 1048
Total slow queries: 1

I did a few tests on the overhead to determine if it would slow down there system (they do several thousand requests a second), and to my surprise it had virtually no overhead once running. There was a bit of extra overhead when the script was first initiated and when ending it and it would cause a very slight hiccup in processing (< 1 second). It was decided this was okay, and we scheduled it to run during some lower load times so this hiccup wasn't a big deal.

This caught a few thousand slow queries over the course of an hour. Some were easy to discard due to the nature of the queries, but some others needed further investigation. The next post will discuss what the next steps were and will start getting data that we can't see in the slow query log.

My Favorite New Feature of MySQL 5.1: Less InnoDB Locking

2009-02-19T12:37:00.000-08:00

MySQL 5.1 has a lot of nice new features, partitioning, event scheduler, dynamic logging, etc... Most of these require changing some application code or design logic to make use of them. However, there is a few new features that almost every user of InnoDB can take advantage of right away and that is the ability to set MySQL so that less locking is required for many statements in InnoDB. There are a few changes which make this possible.

The first one is the new auto_increment locking for InnoDB. There is a good article which talks about this here and a large section the in the manual.

In MySQL 5.0 and previous, InnoDB used a special query-duration table level lock to acquire the AUTO-INC lock. This basically caused all inserts into the same table to serialize. With single row inserts it normally wasn't too bad but could prevent some scalability with multiple threads inserting. However, where it was quite bad was large data insertions, such as INSERT...SELECT and LOAD DATA INFILE. While these were running, all other insertions into the table would be prevented due to this table level lock.

MySQL 5.1 has changed this behavior. There is a new option innodb_autoinc_lock_mode which controls how InnoDB will handle this. The default is a value of 1 which works well if you are doing only inserts where MySQL knows the amount of rows being inserted. For example, a single row insert or a multi-row insert would be fine in this mode. In these cases, it will use a very short term mutex which will be released immediately after acquiring the needed values. It should remove all contention issues if you are using only these statements.

Statements where the number of rows being inserted are unknown, such as INSERT...SELECT and LOAD DATA INFILE both still use the AUTO-INC table level lock for this and will cause other single row statements to lock on it. To fix this you need to set innodb_autoinc_lock_mode=2. This will make each row in these long running statements get a new value, rather than holding it for the duration. The problem with this is that now the values generated won't be consecutive anymore. Other transactions could grab values to use in the middle of the sequence.

For some applications this might be a problem (but shouldn't if you are just using the auto_increment for uniqueness). The other issues comes from the binary log. In statement level binary logging, the auto_increment starting point is logged and then the statement. However, since it is no longer consecutive, the slave can't create the appropriate values and replication will break. The solution to this also comes in 5.1 in the form of row level binary logging. You can enable all row based binary logging and it will then handle this without issue.

So to summarize, to pretty much completely remove auto_increment locking in InnoDB, you will need to set both innodb_autoinc_lock_mode=2 and binlog_format=row.

There is a second locking improvement in InnoDB related to the row level binary logging. To enable this, you will need to use READ COMMITTED isolation level. When you do this in 5.1, you will then cause InnoDB to reduce locking for DML statements a great deal. It will not lock any rows searched upon, but not changed with the DML statements.

In older versions every row that was searched in a data changing statement would be locked. For example, take the following statement:

DELETE FROM tbl WHERE key < 1000 AND non_key = 500

In 5.0, assuming the key was used, every row less than 1000 would be exclusively locked, even if only a few were changed. In 5.1 and READ COMMITTED, only the rows that are actually changed will be locked now. As you can imagine, this potentially can reduce locking contention a great deal.

One final topic I want to mention innodb_locks_unsafe_for_binlog option. People have asked if they should enable this when binlog_format=row. It makes sense to do so, since with row level binary logging, the restrictions for the binary log are gone. The answer is that this variable does not matter anymore in 5.1 if you set READ COMMITTED. In all cases where innodb_locks_unsafe_for_binlog used to reduce locking, READ COMMITTED now does as well.

From the manual:

This also means that if new items are added to the database, InnoDB does not guarantee serializability. Therefore, if innodb_locks_unsafe_for_binlog is enabled, InnoDB guarantees at most an isolation level of READ COMMITTED. (Conflict serializability is still guaranteed.)

This means that you also can not run higher than READ COMMITTED while using innodb_locks_unsafe_for_binlog, so there is no point in setting innodb_locks_unsafe_for_binlog.

So to summarize and get the best locking out of InnoDB in 5.1, you want to set the following options:

[mysqld]
innodb_autoinc_lock_mode=2
binlog_format=row
tx_isolation=READ-COMMITTED

Keep in mind that row based binary logging does potentially have additional disk overhead compared to statement binary logging, so as always, please test these values before using them in production.

MySQL, Windows 32-bit, and /3GB

2009-02-18T12:13:00.001-08:00

One question that comes up relatively commonly in support is how much memory can MySQL use on my system. As you may know, MySQL runs in a single process with many threads. So the memory limit it has is based upon the operating system limits for a single process. Different operating systems have different limits, but the one I want to discuss is Windows 32-bit (such as 2000, XP, 2003, vista).

The short answer is that your mysqld-nt.exe can use up to ~1.8G of RAM. The question is why is that so. Since we have a 32-bit address space, in theory there is 2^32 mapping for the address space (4G). However, Windows uses what is called a 2G/2G split in that addressable space. 2G are allowed for the actual application to use and 2G is for the kernel mappings.

This 2G is then the address space of points, which normally ends up with around 1.8G of memory really being usable. If you try to use more than this, then you end up getting Out of Memory errors from MySQL.

Now Windows 2000 and above has a way you can adjust this split. Not all versions of Windows within these families support this, so it depends upon what version you are using. The specific feature is called '4 GT RAM Tuning' by Microsoft.

To adjust this, you will need to do a few steps, to enable it on Windows and then within MySQL.

First, you need to boot Windows with the /3GB flag given to it. You will need to edit the boot.ini, find the instance you are booting off of, and add that flag to it.

Second, you will need to enable for any program that wants to use this split. MySQL does not ship with this done by default (it is normally a compile time option), so we will need to manually set it (or compile your own mysqld-nt.exe on Windows).

There is a program that comes with Visual Studio C++ called 'editbin'. I use the express version which is free. There is a /LARGEADDRESSAWARE flag you can pass to it to enable the option for your server. I am running this via the Visual Studio Command Prompt.

C:\Program Files\MySQL\MySQL Server 5.0\bin>editbin /LARGEADDRESSAWARE mysqld-nt.exe
Microsoft (R) COFF/PE Editor Version 8.00.50727.42
Copyright (C) Microsoft Corporation.  All rights reserved.

To confirm that this works, you can use dumpbin /HEADERS and there is an option in the very top which says if it is enabled or not.

C:\Program Files\MySQL\MySQL Server 5.0\bin>dumpbin /headers mysqld-nt.exe
Microsoft (R) COFF/PE Dumper Version 8.00.50727.42
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file mysqld-nt.exe

PE signature found

File Type: EXECUTABLE IMAGE

FILE HEADER VALUES
             14C machine (x86)
               6 number of sections
        487FA5AC time date stamp Thu Jul 17 16:03:56 2008
               0 file pointer to symbol table
               0 number of symbols
              E0 size of optional header
             12F characteristics
                   Relocations stripped
                   Executable
                   Line numbers stripped
                   Symbols stripped
                   Application can handle large (>2GB) addresses
                   32 bit word machine

Notice the second to last line, "Application can handle large (>2GB) addresses". This means it is enabled for the binary, if it is missing, then the binary does not have it enabled.

Once these two things are done you will now have a 3G address space for MySQL, and you should be able to go up to ~2.6G or so. You are still limited to smaller allocations (ie, you can't set innodb_buffer_pool_size=2.5G), but it can give you more memory for handling connections and per-connection buffers which can effectively allow you to increase the large global buffers.

For more details, I would like to recommend you read the technet blog about the topic, it is quite a good explanation.

My take on MySQL 5.1

2009-02-05T14:02:00.000-08:00

So being that I am a support engineer with Sun/MySQL, I figured I would chime in with my opinion of MySQL 5.1 GA so far since I have been using it for a while. For history, I have been with MySQL AB/Sun since the 4.0 release cycle. This post does not represent any official views of my employer, Sun Microsystems, and all such disclaimers.

First off, I want to say that Sun assisted greatly with MySQL 5.1. I think it is a much better release than it would have been with MySQL AB doing the release. I do think this will continue in the future from everything I have experienced and seen so far. Sun knows how to make solid software.

So do I think that 5.1 was released too early? Not at all from my experiences of the MySQL releases. 5.1 is fairing quite well, and I think everyone that has a reason to upgrade to it (such as less locking for InnoDB auto_increment, partitioning, dynamic logging, etc...) should certainly start the process of doing so.

MySQL 5.1 is in quite good shape compared to my experiences with MySQL 4.0 to 5.0. As a chart of my experiences:

Release	GA Version	My first recommended version
MySQL 4.0	4.0.12	4.0.15
MySQL 4.1	4.1.7	4.1.12
MySQL 5.0	5.0.15	5.0.36
MySQL 5.1	5.1.30	5.1.31

These versions aren't based on exact bugs fixed, but based on interactions with many customers using the versions and various issues being experienced. It has a bit of gut feeling involved too, so it certainly is not a pure scientific method for determining this.

Also by no means is MySQL 5.1 perfect. There have been some pretty big bugs found and some annoying ones too. However, overall it has been working pretty well for most users.

I still do recommend you do test out all new functionality to ensure it works for you as there are many limitations. However, I would say that for any major software upgrade with any database system, this isn't specific to MySQL or the 5.1 release.

So go ahead and give 5.1 a try-out if you have been on the fence. Have a look around and kick the tires and be sure to report any bugs you find.

XA and Persistent InnoDB Locks

2009-01-29T06:38:00.000-08:00

Recently a customer mentioned that they were seeing corruption rarely when they copied InnoDB files using LVM to setup a new slave. Investigating further, it turns out that replication would start, but would then hit a lock wait timeout. This locking issue occurred across restarts causing replication to always fail. They would solve this by taking a new LVM snapshot and resetting it up.

This is a classic case of an XA transaction in a prepared state persisting across restarts and the LVM sometimes taking a snapshot at the exact instant when this can occur.

Most people don't use actually XA transactions, so they may not consider this possibility. However, it can still occur even if you do not use them due to XA being used internally for the binary log and InnoDB. We will be able to identify that case later.

You can tell this is occurring due to a mention of prepared transactions in the error log:

090128 11:09:09 [Note] Starting crash recovery...
090128 11:09:09  InnoDB: Starting recovery for XA transactions...
090128 11:09:09  InnoDB: Transaction 0 18707 in prepared state after recovery
090128 11:09:09  InnoDB: Transaction contains changes to 1 rows
090128 11:09:09  InnoDB: 1 transactions in prepared state after recovery
090128 11:09:09 [Note] Found 1 prepared transaction(s) in InnoDB
090128 11:09:09 [Warning] Found 1 prepared XA transactions

In addition, you will see a transaction listed in PREPARE without any MySQL or OS thread id in the output of SHOW INNODB STATUS:

---TRANSACTION 0 18707, ACTIVE (PREPARED) 11 sec, OS thread id 0
, undo log entries 1

So how do you solve this when it occurs? The key is using the XA commands which allow you to manipulate XA transactions in the prepared state.

The first command is XA RECOVER which will show you a list of the possible transactions to handle. The data column has the XID of the transaction to rollback. If the XID starts with MySQLXid then it is an internal generated XID. The XID can contain unprintable characters for your terminal, so keep that in mind it might be a bit different than it initially looks.

mysql> xa recover;
+----------+--------------+--------------+------+
| formatID | gtrid_length | bqual_length | data |
+----------+--------------+--------------+------+
|        1 |            3 |            0 | foo  | 
+----------+--------------+--------------+------+
1 row in set (0.00 sec)

After that, you can then use XA ROLLBACK to remove the transaction or XA COMMIT to save the change. Both of these require that you specify the XID that you got from the previous one.

I wrote a one-line shell script which might work when you have a single XID to commit. I haven't tested it very thoroughly, so don't get mad at me if it doesn't work in all cases.

echo xa commit \'`mysql --skip-column-names -e 'xa recover' | awk '{ print $4 }'`\' | mysql

Enabling InnoDB Large Pages on Linux

2009-01-16T11:41:00.000-08:00

In MySQL 5.0, InnoDB gained the ability to use Linux Large Page support for allocating memory for the buffer pool and additional memory pool.

A few customers have asked about using it and there is virtually no documentation on what is required on Linux to enable it. I actually ended up having to read some of the Linux kernel source code to figure out some of this.

This uses the API as documented at:

http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage.txt

To set this up and use it, you first need a kernel that supports it. All of the recent RHEL kernels do by default from what I can tell. On my Ubuntu systems, I'm not seeing it enabled normally.

Then on the OS level you will need to do the following procedures:


 # Set the number of pages to be used
 # Each page is normally 2MB, so this would be 40 MB
 # This actually allocates memory, so it requires that much memory to be available
 echo 20 > /proc/sys/vm/nr_hugepages

 # Set the group number that is allowed to access this memory
 # The mysql user must be a member of this group
 echo 102 > /proc/sys/vm/hugetlb_shm_group

 # Increase the amount of shmem allowed per segment
 # 256MB in this case
 echo 268435456 > /proc/sys/kernel/shmmax

 # Increase total amount of shared memory
 # This is 4KB pages, ie. 16GB below
 echo 4194304 > /proc/sys/kernel/shmall

For MySQL usage, you would normally want the shmmax to be close to the shmall.

You would normally want to put these into an rc file or similar to do it at every boot sequence (early in the boot sequence normally, prior to MySQL starting).

To verify it works with:

cat /proc/meminfo | grep -i huge

The final step is in order to make use of the hugetlb_shm_group, you need to give the mysql user 'unlimited' value for the memlock limit. This can either by done by editing /etc/security/limits.conf or by adding the following to your mysqld_safe:

ulimit -l unlimited

This will cause the root user to set it to unlimited before switching to the mysql user.

Finally, you will want to add the large-pages option to your my.cnf:

 [mysqld]
 large-pages

With this option, InnoDB will use it automatically for the two memory pools. If it can not, it will fail back and use traditional memory and output a warning to the error log.

You can verify it is being used by looking at:

cat /proc/meminfo | grep -i huge

Solaris also has the ability to use large pages (of different sizes as well), but MySQL doesn't support that yet.