As this is my first post it seems I can only post 1 link so I have listed the sites I'm referring to at the bottom. In a nutshell my goal is to make the database return the results faster, I have tried to include as much relevant information as I could think of to help frame the questions at the bottom of the post.
8 processors
model name : Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
cache size : 6144 KB
cpu cores : 4
top - 17:11:48 up 35 days, 22:22, 10 users, load average: 1.35, 4.89, 7.80
Tasks: 329 total, 1 running, 328 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 87.4%id, 12.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8173980k total, 5374348k used, 2799632k free, 30148k buffers
Swap: 16777208k total, 6385312k used, 10391896k free, 2615836k cached
However we are looking at moving the mysql installation to a different machine in the cluster that has 256 GB of ram
My MySQL Table looks like
CREATE TABLE ClusterMatches
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
cluster_index INT,
matches LONGTEXT,
tfidf FLOAT,
INDEX(cluster_index)
);
It has approximately 18M rows, there are 1M unique cluster_index's and 6K unique matches. The sql query I am generating in PHP looks like.
$sql_query="SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters."))
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";
where $cluster contains a string of approximately 3,000 comma separated cluster_index's. This query makes use of approximately 50,000 rows and takes approximately 15s to run, when the same query is run again it takes approximately 1s to run.
Based on this post [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] and the improvement in query time I believe my subquery can be indexed.
mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000)
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| 1 | PRIMARY | derived2 | ALL | NULL | NULL | NULL | NULL | 48528 | Using temporary; Using filesort |
| 2 | DERIVED | ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 53689 | Using where |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
According to this older article [Optimizing MySQL: Queries and Indexes][2] in Extra info - the bad ones to see here are "using temporary" and "using filesort"
Query cache is available, but effectively turned off as the size is currently set to zero
mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name | Value |
+---------------------------------+----------------------+
| bdb_cache_size | 8384512 |
| binlog_cache_size | 32768 |
| expire_logs_days | 0 |
| have_query_cache | YES |
| flush | OFF |
| flush_time | 0 |
| innodb_additional_mem_pool_size | 1048576 |
| innodb_autoextend_increment | 8 |
| innodb_buffer_pool_awe_mem_mb | 0 |
| innodb_buffer_pool_size | 8388608 |
| join_buffer_size | 131072 |
| key_buffer_size | 8384512 |
| key_cache_age_threshold | 300 |
| key_cache_block_size | 1024 |
| key_cache_division_limit | 100 |
| max_binlog_cache_size | 18446744073709547520 |
| sort_buffer_size | 2097144 |
| table_cache | 64 |
| thread_cache_size | 0 |
| query_cache_limit | 1048576 |
| query_cache_min_res_unit | 4096 |
| query_cache_size | 0 |
| query_cache_type | ON |
| query_cache_wlock_invalidate | OFF |
| read_rnd_buffer_size | 262144 |
+---------------------------------+----------------------+
Based on this article on [Mysql Database Performance turning][3] I believe that the values I need to tweak are
matches
occurs faster, based on the statement ["You should probably create indices for any field on which you are selecting, grouping, ordering, or joining."][5]To tweak perform I plan to use
The goal is to build a system that can have 1M unique cluster_index values 1M unique match values, approx 3,000,000,000 table rows with a response time to the query of around 0.5s (we can add more ram as necessary and distribute the database across the cluster)
Based on the advice in this post on How to pick indexes for order by and group by queries the table now looks like
CREATE TABLE ClusterMatches
(
cluster_index INT UNSIGNED,
match_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup
(
match_index INT UNSIGNED NOT NULL PRIMARY KEY,
image_match TINYTEXT
);
The query without sorting the results by the SUM(tfidf) looks like
SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
Which eliminates using temporary and using filesort
explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 14938 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
However if i add the ORDER BY SUM(tfdif) in
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total |
+-------------+--------------------+
| 868 | 0.11126546561718 |
| 4182 | 0.0238558370620012 |
| 2162 | 0.0216601379215717 |
| 1406 | 0.0191618576645851 |
| 4239 | 0.0168981291353703 |
| 1437 | 0.0160425212234259 |
| 2599 | 0.0156466849148273 |
| 394 | 0.0155945559963584 |
| 3116 | 0.0151005545631051 |
| 4028 | 0.0149106932803988 |
+-------------+--------------------+
10 rows in set (0.03 sec)
The result is suitably fast at this scale BUT having the ORDER BY SUM(tfidf) means it uses temporary and filesort
explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 65369 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
Im looking for a solution that doesn't use temporary or filesort, along the lines of
where I dont need to hardcode a threshold for total, any ideas?SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index
HAVING total>0.01 ORDER BY cluster_index;