Why does an insert query occasionally take so long to complete?

This is a pretty simple problem. Inserting data into the table normally works fine, except for a few times where the insert query takes a few seconds. (I am not trying to bulk insert data.) So I setup a simulation for the insert process to find out why the insert query occasionally takes more than 2 seconds to run. Joshua suggested that the index file may be being adjusted; I removed the id (primary key field), but the delay still happens.

I have a MyISAM table: daniel_test_insert (this table starts completely empty):

create table if not exists daniel_test_insert ( 
    id int unsigned auto_increment not null, 
    value_str varchar(255) not null default '', 
    value_int int unsigned default 0 not null, 
    primary key (id) 
)

I insert data into it, and sometimes a insert query takes > 2 seconds to run. There are no reads on this table - Only writes, in serial, by a single threaded program.

I ran the exact same query 100,000 times to find why the query occasionall takes a long time. So far, it appears to be a random occurrence.

This query for example took 4.194 seconds (a very long time for an insert):

Query: INSERT INTO daniel_test_insert SET value_int=12345, value_str='afjdaldjsf aljsdfl ajsdfljadfjalsdj fajd as f' - ran for 4.194 seconds
status               | duration | cpu_user  | cpu_system | context_voluntary | context_involuntary | page_faults_minor
starting             | 0.000042 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
checking permissions | 0.000024 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Opening tables       | 0.000024 | 0.001000  | 0.000000   | 0                 | 0                   | 0                
System lock          | 0.000022 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Table lock           | 0.000020 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
init                 | 0.000029 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
update               | 4.067331 | 12.151152 | 5.298194   | 204894            | 18806               | 477995           
end                  | 0.000094 | 0.000000  | 0.000000   | 8                 | 0                   | 0                
query end            | 0.000033 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
freeing items        | 0.000030 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
closing tables       | 0.125736 | 0.278958  | 0.072989   | 4294              | 604                 | 2301             
logging slow query   | 0.000099 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
logging slow query   | 0.000102 | 0.000000  | 0.000000   | 7                 | 0                   | 0                
cleaning up          | 0.000035 | 0.000000  | 0.000000   | 7                 | 0                   | 0

(This is an abbreviated version of the SHOW PROFILE command, I threw out the columns that were all zero.)

Now the update has an incredible number of context switches and minor page faults. Opened_Tables increases about 1 per 10 seconds on this database (not running out of table_cache space)

Stats:

MySQL 5.0.89
Hardware: 32 Gigs of ram / 8 cores @ 2.66GHz; raid 10 SCSI harddisks (SCSI II???)
I have had the hard drives and raid controller queried: No errors are being reported. CPUs are about 50% idle.
iostat -x 5 (reports less than 10% utilization for harddisks) top report load average about 10 for 1 minute (normal for our db machine)
Swap space has 156k used (32 gigs of ram)

I'm at a loss to find out what is causing this performance lag. This does NOT happen on our low-load slaves, only on our high load master. This also happens with memory and innodb tables. Does anyone have any suggestions? (This is a production system, so nothing exotic!)

Solution

I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.

I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.

Also Ruled Out

Server load -- no correlation with high load
Engine -- happens with InnoDB/MyISAM/Memory
MySQL Query Cache -- happens whether it's on or off
Log rotations -- no correlation in events

The only other observation I have at this point is derived from the fact I'm running the same db on multiple machines. I have a heavy read application so I'm using an environment with replication -- most of the load is on the slaves. I've noticed that even though there is minimal load on the master, the phenomenon occurs more there. Even though I see no locking issues, maybe it's Innodb/Mysql having trouble with (thread) concurrency? Recall that the updates on the slave will be single threaded.

MySQL Verion 5.1.48

Update

I think I have a lead for the problem on my case. On some of my servers, I noticed this phenomenon on more than the others. Seeing what was different between the different servers, and tweaking things around, I was lead to the MySQL innodb system variable innodb_flush_log_at_trx_commit.

I found the doc a bit awkward to read, but innodb_flush_log_at_trx_commit can take the values of 1,2,0:

For 1, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk for every commit.
For 2, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk approximately every 1-2 seconds.
For 0, the log buffer is flushed to the log file every second, and the log file is flushed to disk every second.

Effectively, in the order (1,2,0), as reported and documented, you're supposed to get with increasing performance in trade for increased risk.

Having said that, I found that the servers with innodb_flush_log_at_trx_commit=0 were performing worse (i.e. having 10-100 times more "long updates") than the servers with innodb_flush_log_at_trx_commit=2. Moreover, things immediately improved on the bad instances when I switched it to 2 (note you can change it on the fly).

So, my question is, what is yours set to? Note that I'm not blaming this parameter, but rather highlighting that it's context is related to this issue.