dimanche 28 juin 2015

mysql query optimiser better performance from subquery with partitions than left join with index

when using mysql version 5.6.14-enterprise-commercial-advanced-log on an ubuntu 12.04 LTS, I encounter the following behaviour when querying data from these tables:

CREATE TABLE `a` (
          `id` varchar(32) DEFAULT NULL,
          `request_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
        ) ENGINE=MyISAM DEFAULT CHARSET=latin1
        !50100 PARTITION BY RANGE (UNIX_TIMESTAMP(request_time))
        (PARTITION p15062116 VALUES LESS THAN (1434895200) ENGINE = MyISAM,
         PARTITION p15062117 VALUES LESS THAN (1434898800) ENGINE = MyISAM,
         PARTITION p15062118 VALUES LESS THAN (1434902400) ENGINE = MyISAM,
        ...
        PARTITION rest VALUES LESS THAN MAXVALUE ENGINE = MyISAM)

CREATE TABLE `b` (
          `id` varchar(50) NOT NULL,
          `start_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
          `item` int(11) DEFAULT NULL,
          `item2` int(11) DEFAULT NULL,
          PRIMARY KEY (`id`,`start_time`)
        ) ENGINE=MyISAM DEFAULT CHARSET=utf8
        !50100 PARTITION BY RANGE (UNIX_TIMESTAMP(start_time))
        (PARTITION p15062516 VALUES LESS THAN (1435240800) ENGINE = MyISAM,
         PARTITION p15062517 VALUES LESS THAN (1435244400) ENGINE = MyISAM
        ....
        PARTITION rest VALUES LESS THAN MAXVALUE ENGINE = MyISAM)

using this query results in a 1 second runtime:

SELECT SQL_NO_CACHE count(*)  FROM a left join (select * from b where start_time between '2015-06-28 10:00:00' and '2015-06-28 11:00:00') c using(id) where request_time between '2015-06-28 10:00:00' and '2015-06-28 11:00:00';

explain output:

+----+-------------+------------+------+---------------+-------------+---------+------+--------+-------------+
| id | select_type | table      | type | possible_keys | key         | key_len | ref  | rows   | Extra       |
+----+-------------+------------+------+---------------+-------------+---------+------+--------+-------------+
|  1 | PRIMARY     | a   | ALL  | NULL          | NULL        | NULL    | NULL | 336972 | Using where |
|  1 | PRIMARY     | <derived2> | ref  | <auto_key0>   | <auto_key0> | 152     | func |     10 | Using where |
|  2 | DERIVED     | b  | ALL  | NULL          | NULL        | NULL    | NULL |  39508 | Using where |
+----+-------------+------------+------+---------------+-------------+---------+------+--------+-------------+
3 rows in set (0.00 sec)

and using this query results in 30 seconds runtime:

SELECT SQL_NO_CACHE count(*)  FROM a  left join b using(id) where request_time between '2015-06-28 10:00:00' and '2015-06-28 11:00:00' and start_time between '2015-06-28 10:00:00' and '2015-06-28 11:00:00';

explain output:

+----+-------------+-----------+------+---------------+---------+---------+------+--------+--------------------------+
| id | select_type | table     | type | possible_keys | key     | key_len | ref  | rows   | Extra                    |
+----+-------------+-----------+------+---------------+---------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | a  | ALL  | NULL          | NULL    | NULL    | NULL | 336972 | Using where              |
|  1 | SIMPLE      | b | ref  | PRIMARY       | PRIMARY | 152     | func |    395 | Using where; Using index |
+----+-------------+-----------+------+---------------+---------+---------+------+--------+--------------------------+
2 rows in set (12.17 sec)

I expected similar or better results from the second query based on the usage of the id index and the 1 hour partition. both tables have 1000000~ records each.

can you explain why is the first query much more efficient than the second one?

can we refactor the first query so that it might become a view or a reusable query instead of rebuilding the subquery for each join?

thanks

Aucun commentaire:

Enregistrer un commentaire