To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
This answer was written in 2009. Some of the query optimization tricks described here are obsolete by now, others can be made more efficient, yet others still apply. The statements about feature support by different database systems apply to versions that existed at the time of this writing.
That's exactly what optimizers do for a living (not that I'm saying they always do this well).
Since SQL is a set based language, there are usually more than one way to transform one query to other.
Like this query:
SELECT *
FROM mytable
WHERE col1 > @value1 OR col2 < @value2
can be transformed into this one (provided that mytable
has a primary key):
SELECT *
FROM mytable
WHERE col1 > @value1
UNION
SELECT *
FROM mytable
WHERE col2 < @value2
or this one:
SELECT mo.*
FROM (
SELECT id
FROM mytable
WHERE col1 > @value1
UNION
SELECT id
FROM mytable
WHERE col2 < @value2
) mi
JOIN mytable mo
ON mo.id = mi.id
, which look uglier but can yield better execution plans.
One of the most common things to do is replacing this query:
SELECT *
FROM mytable
WHERE col IN
(
SELECT othercol
FROM othertable
)
with this one:
SELECT *
FROM mytable mo
WHERE EXISTS
(
SELECT NULL
FROM othertable o
WHERE o.othercol = mo.col
)
In some RDBMS's (like PostgreSQL 8.4), DISTINCT
and GROUP BY
use different execution plans, so sometimes it's better to replace the one with the other:
SELECT mo.grouper,
(
SELECT SUM(col)
FROM mytable mi
WHERE mi.grouper = mo.grouper
)
FROM (
SELECT DISTINCT grouper
FROM mytable
) mo
vs.
SELECT mo.grouper, SUM(col)
FROM mytable
GROUP BY
mo.grouper
In PostgreSQL, DISTINCT
sorts and GROUP BY
hashes.
MySQL 5.6 lacks FULL OUTER JOIN
, so it can be rewritten as following:
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT OUTER JOIN
table2 t2
ON t1.id = t2.id
vs.
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT JOIN
table2 t2
ON t1.id = t2.id
UNION ALL
SELECT NULL, t2.col2
FROM table1 t1
RIGHT JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.id IS NULL
, but see this article in my blog on how to do this more efficiently in MySQL:
This hierarchical query in Oracle 11g:
SELECT DISTINCT(animal_id) AS animal_id
FROM animal
START WITH
animal_id = :id
CONNECT BY
PRIOR animal_id IN (father, mother)
ORDER BY
animal_id
can be transformed to this:
SELECT DISTINCT(animal_id) AS animal_id
FROM (
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
START WITH
animal_id = :id
CONNECT BY
parent = PRIOR animal_id
ORDER BY
animal_id
, the latter one being more efficient.
See this article in my blog for the execution plan details:
To find all ranges that overlap the given range, you can use the following query:
SELECT *
FROM ranges
WHERE end_date >= @start
AND start_date <= @end
, but in SQL Server this more complex query yields same results faster:
SELECT *
FROM ranges
WHERE (start_date > @start AND start_date <= @end)
OR (@start BETWEEN start_date AND end_date)
, and believe it or not, I have an article in my blog on this too:
SQL Server 2008 also lacks an efficient way to do cumulative aggregates, so this query:
SELECT mi.id, SUM(mo.value) AS running_sum
FROM mytable mi
JOIN mytable mo
ON mo.id <= mi.id
GROUP BY
mi.id
can be more efficiently rewritten using, Lord help me, cursors (you heard me right: "cursors", "more efficiently" and "SQL Server" in one sentence).
See this article in my blog on how to do it:
There is a certain kind of query, commonly met in financial applications, that pulls effective exchange rate for a currency, like this one in Oracle 11g:
SELECT TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM t_transaction x
JOIN t_rate r
ON (rte_currency, rte_date) IN
(
SELECT xac_currency, MAX(rte_date)
FROM t_rate
WHERE rte_currency = xac_currency
AND rte_date <= xac_date
)
This query can be heavily rewritten to use an equality condition which allows a HASH JOIN
instead of NESTED LOOPS
:
WITH v_rate AS
(
SELECT cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
FROM (
SELECT cur_id, dte_date,
(
SELECT MAX(rte_date)
FROM t_rate ri
WHERE rte_currency = cur_id
AND rte_date <= dte_date
) AS rte_effdate
FROM (
SELECT (
SELECT MAX(rte_date)
FROM t_rate
) - level + 1 AS dte_date
FROM dual
CONNECT BY
level <=
(
SELECT MAX(rte_date) - MIN(rte_date)
FROM t_rate
)
) v_date,
(
SELECT 1 AS cur_id
FROM dual
UNION ALL
SELECT 2 AS cur_id
FROM dual
) v_currency
) v_eff
LEFT JOIN
t_rate
ON rte_currency = cur_id
AND rte_date = rte_effdate
)
SELECT TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM (
SELECT xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
FROM t_transaction x
GROUP BY
xac_currency, TRUNC(xac_date)
)
JOIN v_rate
ON eff_currency = xac_currency
AND eff_date = xac_date
Despite being bulky as hell, the latter query is six times as fast.
The main idea here is replacing <=
with =
, which requires building an in-memory calendar table to join with.