Measuring the complexity of SQL statements
Asked Answered
H

13

32

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?

It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?

[Edit/Note] While getting the execution plan is useful, that is not necessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.

[Edit/Note 2] Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.

Hulbig answered 28/7, 2010 at 14:2 Comment(8)
Is it the complexity of the source code, or the complexity of the processing it requires, that you're interested in measuring?Gimble
I am acutally wondering if there is a way to measure the complexity of the source code. With C#/C++/Java, I've often used Cyclomatic Complexity to determine which methods should be tested first. Here, I'd like to know which queiries need the most attention from test.Hulbig
Do you intend to expand our VIEW definitions into queries? Does the use of a SQL user defined function make code less complex or does its definition need to be expanded out too?Marr
Note that some of my more complex queries are actually simply a 'copy & paste' of an established SQL 'pattern' e.g. a relational division construct.Marr
@onedaywhen: I don't know how useful it would be to compare a query to a stored procedure, but you could compare the complexity of multiple stored procedures to see which on is the most complex.Hulbig
I had not heard of cyclomatic complexity until you posted this question. Thanks for forcing me to educate myself.Explain
@Brian Gideon You might also want to look into the related metric, Halsted complexity. See Ira Baxter answer below.Hulbig
@Marr I think that using views (instead of inline subqueries) or functions are ways to reduce readability complexity of a particular SQL query. So they must not be expanded.Malines
F
16

Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).

The "control flow" in a SQL query is best related to "and" and "or" operators in query.

The "computational complexity" is best related to operators such as SUM or implicit JOINS.

Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.

What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.

Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.

Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.

Factitive answered 31/7, 2010 at 4:27 Comment(5)
As far as you know, does any such tool exist?Hulbig
Well, sort of yes. My company offers a Source Code Search Engine (SCSE) (semanticdesigns.com/Products/SearchEngine) which scans over a set of files to prepare an index for searching. The SCSE happens to compute a number of simple metrics (SLOC, CommentCount, Cyclomatic, Halstead) over each file as a whole during the scan, and it will process many languages, including PLSQL. PLSQL of course has SQL as a sublanguage, and IIRC, SCSE computes software complexity numbers pretty much as I have described above. If you put your SQL fragments into files, the SCSE would probably do it.Factitive
... There is always the question of where are your SQL fragments? If they are embedded in string fragments in ODBC calls, extracting them and measuring them is going to be difficult because the parts are scattered across the code and it isn't instantly obvious that any particular string literal is part of a query or if so where it goes. If your SQL queries are embedded in stored procedure language such as PLSQL, they are obviously much easier to "extract". But the ideal tool in that case is one that measures the SQL queries seperately in situ so you don't have extract them by hand or hack.Factitive
... in this latter case, what you need is a tool to compute the complexity of fragements of the stored procedure file. My company also offers Metrics tools for many languages (but not presently and stored procedure language) that computes metrics over program elements (e.g. functions/methods) and summary rollups, based on a framework for computing such metrics over abstract syntax trees produced by parsing the source code. That metrics machinery could be focused on PLSQL or TSQL to produce probably exactly what you want, but it is custom work.Factitive
I've found that once you decide on the "scoring" it's extremely easy to implement and measure metrics such as cyclomatic complexity! I was actually thinking about a metric which took COCOMO / cyclomatic complexity. My reasoning behind this was that if you had a 1000 line query which had a complexity measure of say 20, it may be that this is less complicated than a 20 line query of the same complexity... just a thought.Sava
B
11

I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".

Ultimately it's about the chances of making a mistake, or not understanding the code I've written?

Something like:

  • number of tables times (1
  • +1 per join expression (+1 per outer join?)
  • +1 per predicate after WHERE or HAVING
  • +1 per GROUP BY expression
  • +1 per UNION or INTERSECT
  • +1 per function call
  • +1 per CASE expression
  • )
Butterflies answered 28/7, 2010 at 14:36 Comment(4)
This is exactly the kind of thing I am looking for. If I can't find one, I might brew mine own similar to this.Hulbig
You also could remove some points(half a point?) for doing a search on an indexed field. And don't forget your Order By's too.Kathline
As someone mentionned, this measure would not be about the efficiency of the SQL statements. It's about their complexity, or the risk they present to the tests (e.g. missing a predicate, or using an inner instead of a left join, or the infamous why is my simple query taking forever to execute?, aka the missing join). In that sense, I don't see why the presence of an index should be taken into account.Butterflies
Also, LEFT/RIGHT JOINs should have a higher complexity score than INNER JOIN.Malines
B
4

Please feel free to try my script that gives an overview of the stored procedure size, the number of object dependencies and the number of parameters -

Calculate TSQL Stored Procedure Complexity

Bacteriophage answered 23/12, 2012 at 9:20 Comment(0)
T
1

SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.

Troat answered 28/7, 2010 at 14:5 Comment(3)
"SQL queries are declarative rather than procedural" -- which is why you can't consider the SQL DML in isolation from the SQL DDL.Marr
In principle then cyclomatic complexity could be calculated for an execution plan thus giving an indirect measure of the complexity of the SQL source that produced it. The problem is that execution plans are typically an amalgamation containing the execution described by all of the 'sub-routines' (in this case, views, table valued functions, etc.), so that wouldn't work either!Plectognath
One part of the job of the query optimizer of an SQL engine is to simplify queries and to choose the best plan according to its knowledge of the data. Two different SQL queries can lead to the same execution plan. Here the question is more about readability of the query, independently of the data.Malines
B
1

Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by: the number of joins the number of where conditions the number of functions the number of subqueries the number of casts to differnt datatypes the number of case statements the number of loops or cursors the number of steps in a transaction

However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.

Beshore answered 28/7, 2010 at 15:32 Comment(1)
Like any static code analysis, the usefulness would be limited. So I agree with what you are saying. But lets consider a situation where a single developer or three equally skilled developers wrote 20 queries. If it were possible to determine which queries were the most complex and hence most likly to contain defects, testing could focus first and/or most one those queries. Static code analyzers are never indicators or correctness, they are just indicators. They give you something else to sniff for 'code smells'.Hulbig
P
1

In the absence of any tools that will do this, a pragmatic approach would be to ensure that the queries being analysed are consistently formatted and to then count the lines of code.

Alternatively use the size of the queries in bytes when saved to file (being careful that all queries are saved using the same character encoding).

Not brilliant but a reasonable proxy for complexity in the absence of anything else I think.

Plectognath answered 9/8, 2012 at 13:16 Comment(1)
Very pragmatic approach and easy to implement once you have a good formatter!Malines
M
0

Here is an idea for a simple algorithm to compute a complexity score related to readability of the query:

  1. Apply a simple lexer on the query (like ones used for syntax coloring in text editors or here on SO) to split the query in tokens and give each token a class:
    • SQL keywords
    • SQL function names
    • string literals with character escapes
    • string literals without character escape
    • string literals which are dates or date+time
    • numeric literals
    • comma
    • parenthesis
    • SQL comments (--, /* ... */)
    • quoted user words
    • non quoted user words: everything else
  2. Give a score to each token, using different weights for each class (and differents weights for SQL keywords).
  3. Add the scores of each token.
  4. Done.

This should work quite well as for example counting sub queries is like counting the number of SELECT and FROM keywords.

By using this algorithm with different weight tables you can even measure the complexity in different dimensions. For example to have nuanced comparison between queries. Or to score higher the queries which use keywords or functions specific to an SQL engine (ex: GROUP_CONCAT on MySQL).

The algorithm can also be tweaked to take in account the case of SQL keywords: increase complexity if they are not consistently upper case. Or to account for indent (carriage return, position of keywords on a line)

Note: I have been inspired by @redcalx answer that suggested applying a standard formatter and counting lines of code. My solution is simpler however as it doesn't to build a full AST (abstract syntax tree).

Malines answered 12/3, 2020 at 8:33 Comment(0)
M
0

In programming languages we have several methods to compute the time complexity or space complexity.

Similarly we could compare with sql as well like in a procedure the no of lines you have with loops similar to a programming language but unlike only input usually in programming language in sql it would along with input will totally depend on the data in the table/view etc to operate plus the overhead complexity of the query itself.

Like a simple row by row query

   Select * from table ; 
  // This will totally depend on no of 
       records say n hence O(n)

   Select max(input) from table;
   // here max would be an extra 
   overhead added to each 
   Therefore t*O(n) where t is max 
   Evaluation time
Mascagni answered 12/3, 2020 at 8:40 Comment(0)
G
0

Toad has a built-in feature for measuring McCabe cyclomatic complexity on SQL: https://blog.toadworld.com/what-is-mccabe-cyclomatic-complexity

Gratis answered 14/7, 2022 at 13:58 Comment(0)
C
-1

Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).

Here is a link that goes over some of the things you should look at in the execution plan.

Catcall answered 28/7, 2010 at 14:5 Comment(0)
G
-1

Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.

SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).

Golden answered 28/7, 2010 at 14:10 Comment(0)
S
-1

A good question. The problem is that for a SQL query like:

SELECT * FROM foo;

the complexity may depend on what "foo" is and on the database implementation. For a function like:

int f( int n ) {
   if ( n == 42 ) {
      return 0;
   }
   else {
      return n;
   }
}

there is no such dependency.

However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.

Serin answered 28/7, 2010 at 14:12 Comment(2)
I somewhat disagree about the foo example. That would be like factoring in the complexity of the called functions, when measuring the complexity of a procedural code.Butterflies
Agreed. Cyclomatic complexity for example tells you about the number of possible paths through a section of source code, and in the normal usage does not calculate the additional sub-paths with sub-routines that are being called. It's about the complexity of the section of code at hand, i.e. how readable and thus maintainable is it.Plectognath
P
-3

It's reasonably enough to consider complexity as what it would be if you coded the query yourself. If the table has N rows then,

  1. A simple SELECT would be O(N)
  2. A ORDER BY is O(NlogN)
  3. A JOIN is O(N*M)
  4. A DROP TABLE is O(1)
  5. A SELECT DISTINCT is O(N^2)
  6. A Query1 NOT IN/IN Query2 would be O( O1(N) * O2(N) )
Pythian answered 22/8, 2019 at 10:22 Comment(1)
Some of those might be the case if there were no indices, but they're certainly not true in general.Ithyphallic

© 2022 - 2024 — McMap. All rights reserved.