Very slow subquery due to HAVING
Wednesday, September 12th, 2007
Today, I was quite mystified by a very slow running query. There was a table named ‘bar’ with about 3000 rows. I wanted to list all the rows that had a duplicate value for a certain field (‘foo’), and only those rows. The solution was to build a query that selected the rows where the value of the ‘foo’ field was in the results of a subquery that selected ‘foo’ for duplicate values of foo. The query finally looked something like this:
SELECT
foo
FROM bar
WHERE foo IN (
SELECT
foo
FROM bar
GROUP BY foo
HAVING count(foo) > 1
)
The inner query (explained in this post) was very fast, and returned only two rows. The outer query, when I ran it like this:
SELECT foo FROM bar WHERE foo IN (1, 2);
also ran very fast. However the combination of the two was extremely slow. I thought this was weird, since there were only two results in the inner query. A colleague of mine and me took a look at the EXPLAIN of the query, and found out it was actually doing a full join of 3000×3000 rows. The use of HAVING threw me off because it appeared in the inner join. But HAVING is always applied very late in the execution process, just before the results are sent to the client. This means MySQL doesn’t even look at the HAVING to optimize queries. From the manual:
The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization.
Putting an index on the ‘foo’ column solved the speed problem, though it’s still not as fast as it could be because it’s still doing a JOIN on ‘foo’ with itself, only this time only doing 2×3000 rows.