General rules for simplifying SQL statements
Asked Answered
Y

9

74

I'm looking for some "inference rules" (similar to set operation rules or logic rules) which I can use to reduce a SQL query in complexity or size. Does there exist something like that? Any papers, any tools? Any equivalencies that you found on your own? It's somehow similar to query optimization, but not in terms of performance.

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

So, I'm looking for equivalent transformations of SQL statements like the fact that most SUBSELECTs can be rewritten as a JOIN.

Yashmak answered 1/7, 2009 at 14:14 Comment(1)
My approach is to learn relational theory in general and relational algebra in particular. Then learn to spot the constructs used in SQL to implement operators from the relational algebra (e.g. universal quantification a.k.a. division) and calculus (e.g. existential quantification). The gotcha is that SQL has features not found in the relational model e.g. nulls, which are probably best refactored away anyhow. Recommended reading: SQL and Relational Theory: How to Write Accurate SQL Code By C. J. Date.Operetta
L
63

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?


This answer was written in 2009. Some of the query optimization tricks described here are obsolete by now, others can be made more efficient, yet others still apply. The statements about feature support by different database systems apply to versions that existed at the time of this writing.


That's exactly what optimizers do for a living (not that I'm saying they always do this well).

Since SQL is a set based language, there are usually more than one way to transform one query to other.

Like this query:

SELECT  *
FROM    mytable
WHERE   col1 > @value1 OR col2 < @value2

can be transformed into this one (provided that mytable has a primary key):

SELECT  *
FROM    mytable
WHERE   col1 > @value1
UNION
SELECT  *
FROM    mytable
WHERE   col2 < @value2

or this one:

SELECT  mo.*
FROM    (
        SELECT  id
        FROM    mytable
        WHERE   col1 > @value1
        UNION
        SELECT  id
        FROM    mytable
        WHERE   col2 < @value2
        ) mi
JOIN    mytable mo
ON      mo.id = mi.id

, which look uglier but can yield better execution plans.

One of the most common things to do is replacing this query:

SELECT  *
FROM    mytable
WHERE   col IN
        (
        SELECT  othercol
        FROM    othertable
        )

with this one:

SELECT  *
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    othertable o
        WHERE   o.othercol = mo.col
        )

In some RDBMS's (like PostgreSQL 8.4), DISTINCT and GROUP BY use different execution plans, so sometimes it's better to replace the one with the other:

SELECT  mo.grouper,
        (
        SELECT  SUM(col)
        FROM    mytable mi
        WHERE   mi.grouper = mo.grouper
        )
FROM    (
        SELECT  DISTINCT grouper
        FROM    mytable
        ) mo

vs.

SELECT  mo.grouper, SUM(col)
FROM    mytable
GROUP BY
        mo.grouper

In PostgreSQL, DISTINCT sorts and GROUP BY hashes.

MySQL 5.6 lacks FULL OUTER JOIN, so it can be rewritten as following:

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT OUTER JOIN
        table2 t2
ON      t1.id = t2.id

vs.

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT JOIN
        table2 t2
ON      t1.id = t2.id
UNION ALL
SELECT  NULL, t2.col2
FROM    table1 t1
RIGHT JOIN
        table2 t2
ON      t1.id = t2.id
WHERE   t1.id IS NULL

, but see this article in my blog on how to do this more efficiently in MySQL:

This hierarchical query in Oracle 11g:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    animal
START WITH
        animal_id = :id
CONNECT BY
        PRIOR animal_id IN (father, mother)
ORDER BY
        animal_id

can be transformed to this:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
START WITH
        animal_id = :id
CONNECT BY
        parent = PRIOR animal_id
ORDER BY
        animal_id

, the latter one being more efficient.

See this article in my blog for the execution plan details:

To find all ranges that overlap the given range, you can use the following query:

SELECT  *
FROM    ranges
WHERE   end_date >= @start
        AND start_date <= @end

, but in SQL Server this more complex query yields same results faster:

SELECT  *
FROM    ranges
WHERE   (start_date > @start AND start_date <= @end)
        OR (@start BETWEEN start_date AND end_date)

, and believe it or not, I have an article in my blog on this too:

SQL Server 2008 also lacks an efficient way to do cumulative aggregates, so this query:

SELECT  mi.id, SUM(mo.value) AS running_sum
FROM    mytable mi
JOIN    mytable mo
ON      mo.id <= mi.id
GROUP BY
        mi.id

can be more efficiently rewritten using, Lord help me, cursors (you heard me right: "cursors", "more efficiently" and "SQL Server" in one sentence).

See this article in my blog on how to do it:

There is a certain kind of query, commonly met in financial applications, that pulls effective exchange rate for a currency, like this one in Oracle 11g:

SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM    t_transaction x
JOIN    t_rate r
ON      (rte_currency, rte_date) IN
        (
        SELECT  xac_currency, MAX(rte_date)
        FROM    t_rate
        WHERE   rte_currency = xac_currency
                AND rte_date <= xac_date
        )

This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )
SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM    (
        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)
        )
JOIN    v_rate
ON      eff_currency = xac_currency
        AND eff_date = xac_date

Despite being bulky as hell, the latter query is six times as fast.

The main idea here is replacing <= with =, which requires building an in-memory calendar table to join with.

Leavings answered 1/7, 2009 at 14:17 Comment(8)
Bug in your first example: UNION does an OR, not an AND.Teressaterete
+1 Those are some great examples of query transformations. It also shows that some of the optimised queries are not actually the simple looking ones e.g. first query vs. third one, which is a pity as one could assume that the "simple" query would be easier to analyse by the optimiser. In other words it seems as optimising is not necessary equal to simplifyingAerostation
Patriot ;), I disagree with this, because UNION eliminates duplicates, theses are not equivalent: Like this query: SELECT * FROM mytable WHERE col1 > @value1 OR col2 < @value2 can be transformed into this: SELECT * FROM mytable WHERE col1 > @value1 UNION SELECT * FROM mytable WHERE col2 < @value2Denumerable
@Alex: as long as the table has a PRIMARY KEY defined, they are equivalent. A row that satisfies both OR'ed conditions will be selected exactly once, be it with an OR or with a UNION. If the table has exact duplicates (which implies having no PRIMARY KEY), then yes, they will be eliminated with UNION but not with OR.Leavings
@Quassnoi: The first part: rewriting OR into UNION. After that I got Index Seeks instead of Clustered Index Scans (on SQL Server 2000) ... pretty good.Yashmak
I love that you pointed out that in SQl, ugly code is often the best for performance. It drives me crazy when people want to take well-performing code and make it more "elegant" and kill performance.Naughty
The first UNION query is equivalent to a query using both DISTINCT and OR. Without DISTINCT, it might remove too many duplicates.Erasmoerasmus
@LukasEder: I've cleaned it up a little bit and moved the UNION thing from the comments discussion above into the post bodyLeavings
D
9

Here's a few from working with Oracle 8 & 9 (of course, sometimes doing the opposite might make the query simpler or faster):

Parentheses can be removed if they are not used to override operator precedence. A simple example is when all the boolean operators in your where clause are the same: where ((a or b) or c) is equivalent to where a or b or c.

A sub-query can often (if not always) be merged with the main query to simplify it. In my experience, this often improves performance considerably:

select foo.a,
       bar.a
  from foomatic  foo,
       bartastic bar
 where foo.id = bar.id and
       bar.id = (
         select ban.id
           from bantabulous ban
          where ban.bandana = 42
       )
;

is equivalent to

select foo.a,
       bar.a
  from foomatic    foo,
       bartastic   bar,
       bantabulous ban
 where foo.id = bar.id and
       bar.id = ban.id and
       ban.bandana = 42
;

Using ANSI joins separates a lot of "code monkey" logic from the really interesting parts of the where clause: The previous query is equivalent to

select foo.a,
       bar.a
  from foomatic    foo
  join bartastic   bar on bar.id = foo.id
  join bantabulous ban on ban.id = bar.id
 where ban.bandana = 42
;

If you want to check for the existence of a row, don't use count(*), instead use either rownum = 1 or put the query in a where exists clause to fetch only one row instead of all.

Daren answered 1/7, 2009 at 15:3 Comment(1)
Wow, nice suggestion at the end. I never thought to pull the join logic out of the where clause and put it with the table defs, and I haven't seen it used commonly before but it does make a lot of sense.Win
G
6
  • I suppose the obvious one is look for any Cursors that can be replaced with a SQL 'Set' based operation.
  • Next on my list, is look for any correlated sub-queries that can be re-written as a un-correlated query
  • In long stored procedures, break out separate SQL statements into their own stored procedures. That way they will get there own cached query plan.
  • Look for transactions that can have their scope shortened. I regularly find statements inside a transaction that can safely be outside.
  • Sub-selects can often be re-written as straight forward joins (modern optimisers are good at spotting simple ones)

As @Quassnoi mentioned, the Optimiser often does a good job. One way to help it is to ensure indexes and statistics are up to date, and that suitable indexes exist for your query workload.

Guitarist answered 1/7, 2009 at 14:19 Comment(2)
about breaking stored procedures into more: do not do that when you use temporary tables: then SqlServer (don't know about others) will re-calculate the queryplan on each execution, thus hurting performance!Salient
@Hans Kesting: I don't think that is true if all the DDL creation statement for all your temp tables are the first statements in your stored procedure.Guitarist
D
5

I like everyone on a team to follow a set of standards to make code readable, maintainable, understandable, washable, etc.. :)

  • everyone uses the same alias
  • no cursors. no loops
  • why even think of IN when you can EXISTS
  • INDENT
  • Consistency in coding style

there is some more stuff here What are some of your most useful database standards?

Diarrhoea answered 1/7, 2009 at 14:53 Comment(1)
agree. Having standards in a team boosts readability, maintainability and often performance too. At least for readability there are a couple of tools available like e.g. SQLinForm formatter / beautifierPatent
A
4

I like to replace all sort of subselect by join query.

This one is obvious :

SELECT  *
FROM    mytable mo
WHERE   EXISTS
        (
          SELECT  *
          FROM    othertable o
          WHERE   o.othercol = mo.col
        )

by

SELECT  mo.*
FROM    mytable mo inner join othertable o on o.othercol = mo.col

And this one is under estimate :

SELECT  *
FROM    mytable mo
WHERE   NOT EXISTS
        (
          SELECT  *
          FROM    othertable o
          WHERE   o.othercol = mo.col
        )

by

SELECT  mo.*
FROM    mytable mo left outer join othertable o on o.othercol = mo.col
WHERE   o.othercol is null

It could help the DBMS to choose the good execution plan in a big request.

Adrell answered 1/7, 2009 at 14:33 Comment(2)
These won't necessarily always give exactly the same results: JOINing on a table will cause duplicates if there is more than one match in the "right" table for any particular value being joined in the "left" table. EXISTS and NOT EXISTS don't have this issue. (It could be resolved by using DISTINCT but that reduces efficiency.)Crosse
Replacing a semi join by an inner join is usually wrong, and only correct in some cases where the inner joined table is guaranteed to produce 0-1 rows. But these days, hardly any optimiser needs users to do that transformation for them.Erasmoerasmus
K
4

Given the nature of SQL, you absolutely have to be aware of the performance implications of any refactoring. Refactoring SQL Applications is a good resource on refactoring with a heavy emphasis on performance (see Chapter 5).

Keitel answered 1/7, 2009 at 15:58 Comment(0)
P
3

Although simplification may not equal optimization, simplification can be important in writing readable SQL code, which is in turn critical to being able to check your SQL code for conceptual correctness (not syntactic correctness, which your development environment should check for you). It seems to me that in an ideal world, we would write the most simple, readable SQL code and then the optimizer would rewrite that SQL code to be in whatever form (perhaps more verbose) would run the fastest.

I have found that thinking of SQL statements as based on set logic is very useful, particularly if I need to combine where clauses or figure out a complex negation of a where clause. I use the laws of boolean algebra in this case.

The most important ones for simplifying a where clause are probably DeMorgan's Laws (note that "·" is "AND" and "+" is "OR"):

  • NOT (x · y) = NOT x + NOT y
  • NOT (x + y) = NOT x · NOT y

This translates in SQL to:

NOT (expr1 AND expr2) -> NOT expr1 OR NOT expr2
NOT (expr1 OR expr2) -> NOT expr1 AND NOT expr2

These laws can be very useful in simplifying where clauses with lots of nested AND and OR parts.

It is also useful to remember that the statement field1 IN (value1, value2, ...) is equivalent to field1 = value1 OR field1 = value2 OR ... . This allows you to negate the IN () one of two ways:

NOT field1 IN (value1, value2)  -- for longer lists
NOT field1 = value1 AND NOT field1 = value2  -- for shorter lists

A sub-query can be thought of this way also. For example, this negated where clause:

NOT (table1.field1 = value1 AND EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2))

can be rewritten as:

NOT table1.field1 = value1 OR NOT EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2))

These laws do not tell you how to transform a SQL query using a subquery into one using a join, but boolean logic can help you understand join types and what your query should be returning. For example, with tables A and B, an INNER JOIN is like A AND B, a LEFT OUTER JOIN is like (A AND NOT B) OR (A AND B) which simplifies to A OR (A AND B), and a FULL OUTER JOIN is A OR (A AND B) OR B which simplifies to A OR B.

Pameliapamelina answered 13/3, 2012 at 15:52 Comment(1)
I also find I use the implication rewrite rule a lot i.e. ( P => Q ) <=> ( NOT ( P ) OR Q )Operetta
E
1

jOOQ supports pattern based transformation, which can be used in the online SQL parser and translator (look for the "patterns" dropdown), or as a parser CLI, or programmatically.

Since you're mainly looking for ways to turn your query into something simpler, not necessarily faster (which may depend on the target RDBMS), jOOQ could help you here.

Some examples include:

CASE to CASE abbreviation

-- Original
SELECT
  CASE WHEN x IS NULL THEN y ELSE x END,
  CASE WHEN x = y THEN NULL ELSE x END,
  CASE WHEN x IS NOT NULL THEN y ELSE z END,
  CASE WHEN x IS NULL THEN y ELSE z END,
  CASE WHEN x = 1 THEN y WHEN x = 2 THEN z END,
FROM tab;

-- Transformed
SELECT
  NVL(x, y),       -- If available in the target dialect, otherwise COALESCE
  NULLIF(x, y),
  NVL2(x, y, z),   -- If available in the target dialect
  NVL2(x, z, y),   -- If available in the target dialect
  CHOOSE(x, y, z)  -- If available in the target dialect
FROM tab;

COUNT(*) scalar subquery comparison

-- Original
SELECT (SELECT COUNT(*) FROM tab) > 0;

-- Transformed
SELECT EXISTS (SELECT 1 FROM tab) 

Flatten CASE

-- Original
SELECT
  CASE 
    WHEN a = b THEN 1 
    ELSE CASE
      WHEN c = d THEN 2
    END 
  END
FROM tab;

-- Transformed
SELECT
  CASE 
    WHEN a = b THEN 1 
    WHEN c = d THEN 2
  END
FROM tab;

NOT AND (De Morgan's rules)

-- Original
SELECT
  NOT (x = 1 AND y = 2),
  NOT (x = 1 AND y = 2 AND z = 3)
FROM tab;

-- Transformed
SELECT
  NOT (x = 1) OR NOT (y = 2),
  NOT (x = 1) OR NOT (y = 2) OR NOT (z = 3)
FROM tab;

Unnecessary EXISTS subquery clauses

-- Original
SELECT EXISTS (SELECT DISTINCT a, b FROM t);

-- Transformed
SELECT EXISTS (SELECT 1 FROM t);

There's a lot more.

Disclaimer: I work for the company behind jOOQ

Erasmoerasmus answered 18/11, 2022 at 13:44 Comment(0)
O
0

My approach is to learn relational theory in general and relational algebra in particular. Then learn to spot the constructs used in SQL to implement operators from the relational algebra (e.g. universal quantification a.k.a. division) and calculus (e.g. existential quantification). The gotcha is that SQL has features not found in the relational model e.g. nulls, which are probably best refactored away anyhow. Recommended reading: SQL and Relational Theory: How to Write Accurate SQL Code By C. J. Date.

In this vein, I'm not convinced "the fact that most SUBSELECTs can be rewritten as a JOIN" represents a simplification.

Take this query for example:

SELECT c 
  FROM T1 
 WHERE c NOT IN ( SELECT c FROM T2 );

Rewrite using JOIN

SELECT DISTINCT T1.c 
  FROM T1 NATURAL LEFT OUTER JOIN T2 
 WHERE T2.c IS NULL;

The join is more verbose!

Alternatively, recognize the construct is implementing an antijoin on the projection of c e.g. pseudo algrbra

T1 { c } antijoin T2 { c }

Simplification using relational operators:

SELECT c FROM T1 EXCEPT SELECT c FROM T2;
Operetta answered 13/3, 2012 at 16:33 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.