What is the "filtered" column in MySQL EXPLAIN telling me, and how can I make use of it?
Asked Answered
T

6

10

The MySQL 5.7 documentation states:

The filtered column indicates an estimated percentage of table rows that will be filtered by the table condition. That is, rows shows the estimated number of rows examined and rows × filtered / 100 shows the number of rows that will be joined with previous tables.

To attempt to understand this better, I tried it out on a query using the MySQL Sakila Sample Database. The table in question has the following structure:

mysql> SHOW CREATE TABLE film \G
*************************** 1. row ***************************
       Table: film
Create Table: CREATE TABLE `film` (
  `film_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(255) NOT NULL,
  `description` text,
  `release_year` year(4) DEFAULT NULL,
  `language_id` tinyint(3) unsigned NOT NULL,
  `original_language_id` tinyint(3) unsigned DEFAULT NULL,
  `rental_duration` tinyint(3) unsigned NOT NULL DEFAULT '3',
  `rental_rate` decimal(4,2) NOT NULL DEFAULT '4.99',
  `length` smallint(5) unsigned DEFAULT NULL,
  `replacement_cost` decimal(5,2) NOT NULL DEFAULT '19.99',
  `rating` enum('G','PG','PG-13','R','NC-17') DEFAULT 'G',
  `special_features` set('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
  `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`film_id`),
  KEY `idx_title` (`title`),
  KEY `idx_fk_language_id` (`language_id`),
  KEY `idx_fk_original_language_id` (`original_language_id`),
  CONSTRAINT `fk_film_language` FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`) ON UPDATE CASCADE,
  CONSTRAINT `fk_film_language_original` FOREIGN KEY (`original_language_id`) REFERENCES `language` (`language_id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1001 DEFAULT CHARSET=utf8

And this is the EXPLAIN plan for the query:

mysql> EXPLAIN SELECT * FROM film WHERE release_year=2006 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: film
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1000
     filtered: 10.00
        Extra: Using where

This table's sample dataset has 1,000 total rows, and all of them have release_year set to 2006. Using the formula in the MySQL documentation:

rows x filtered / 100 = "number of rows that will be joined with previous tables

So,

1,000 x 10 / 100 = 100 = "100 rows will be joined with previous tables"

Huh? What "previous table"? There is no JOIN going on here.

What about the first portion of the quote from the documentation? "Estimated percentage of table rows that will be filtered by the table condition." Well, the table condition is release_year = 2006, and all records have that value, so shouldn't filtered be either 0.00 or 100.00 (depending on what they mean by "filtered")?

Maybe it's behaving strangely because there's no index on release_year? So I created one:

mysql> CREATE INDEX test ON film(release_year);

The filtered column now shows 100.00. So, shouldn't it have shown 0.00 before I added the index? Hm. What if I make half the table have release_year be 2006, and the other half not?

mysql> UPDATE film SET release_year=2017 ORDER BY RAND() LIMIT 500;
Query OK, 500 rows affected (0.03 sec)
Rows matched: 500  Changed: 500  Warnings: 0

Now the EXPLAIN looks like this:

mysql> EXPLAIN SELECT * FROM film WHERE release_year=2006 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: film
   partitions: NULL
         type: ref
possible_keys: test
          key: test
      key_len: 2
          ref: const
         rows: 500
     filtered: 100.00
        Extra: Using index condition

And, since I decided to confuse myself even further:

mysql> EXPLAIN SELECT * FROM film WHERE release_year!=2006 \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: film
   partitions: NULL
         type: ALL
possible_keys: test
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1000
     filtered: 50.10
        Extra: Using where

So, an estimate of 501 rows will be filtered by the table condition and "joined with previous tables"?

I simply do not understand.

I realize it's an "estimate", but on what is this estimate based? If an index being present moves the estimate to 100.00, shouldn't its absence be 0.00, not 10.00? And what's with that 50.10 result in the last query?

Is filtered at all useful in determining if a query can be optimized further, or how to optimize it further, or is it generally just "noise" that can be ignored?

Towny answered 8/1, 2017 at 16:16 Comment(3)
This is from v8.0 refman. "For example, if rows is 1000 and filtered is 50.00 (50%), the number of rows to be joined with the following table is 1000 × 50% = 500" Please NOTE the word FOLLOWING not PREVIOUS as documented in someone comments.Dismuke
@WilsonHauck Good point. It seems that older versions had documented the word previous and almost all posts I've seen used this to describe filtered which obviously has a lot of ambiguity.Gracegraceful
@michael-moussa Please consider my Aug 6 Answer to be your most accurate and Accepted Answer.Dismuke
J
17

…number of rows that will be joined with previous tables…

In the absence of any joins, I believe this can be taken to mean number of rows

UPDATE - the documentation, now at least, says "following tables" but the point still stands, thanks @WilsonHauck


To take each of your examples in turn

1000 rows, all from 2006, no index…

EXPLAIN SELECT * FROM film WHERE release_year = 2006

key: NULL
rows: 1000
filtered: 10.00
Extra: Using where

Here the engine expects to visit 1000 rows, and expects to return around 10% of these

As the query is not using an index, it makes sense to predict that every row will be checked, but unfortunately the filtered estimate is inaccurate. I don't know how the engine makes this prediction, but as it doesn't know all the rows are from 2006 (until it checks them).. it's not the craziest thing in the world

Perhaps in the absence of further information, the engine expects any simple = condition to reduce the result set to 10% of the available rows

1000 rows, half from 2006, with index…

EXPLAIN SELECT * FROM film WHERE release_year = 2006

key: test
rows: 500
filtered: 100.00
Extra: Using index condition

Here the engine expects to visit 500 rows and expects to return all of them

Now the query is using the new index, the engine can make more accurate predictions. It can very quickly see that 500 rows match the condition, and will have to visit only and exactly these to satisfy the query

EXPLAIN SELECT * FROM film WHERE release_year != 2006

key: NULL
rows: 1000
filtered: 50.10
Extra: Using where

Here the engine expects to visit 1000 rows and return 50.10% of them

The engine has opted not to use the index, maybe the != operation is not quite as simple as = in this case, and therefore it makes sense to predict that every row will be visited

The engine has, however, made a fairly accurate prediction on how many of these visited rows will be returned. I don't know where the .10% comes from, but perhaps the engine has used the index or the results of previous queries to recognise that around 50% of the rows will match the condition


It's a bit of a dark art, but the filtered value does give you some fairly useful information, and some insight into why the engine has made certain decisions

If the number of rows is high and the filtered rows estimate is low (and accurate), it may be a good indication that a carefully applied index could speed up the query

Jerold answered 5/8, 2019 at 11:0 Comment(4)
What does it mean then if rows = 1 and filtered = 44.57?Gracegraceful
@Gracegraceful I don't know, what gave you that result? Perhaps the engine has estimated that one row will be visited and has estimated that there is a 44.57% chance that that row will be returnedJerold
That's from a third row (a dependent subquery) in explain extended but the question is that whether or not this interpreting is right 44.57% chance that that row will be returned?Gracegraceful
@revo, it's an estimate, the engine expects either 1 or 0 rows to be returned, and has used that estimate to build the query plan. Without further information, it's hard to see what that means for your query.. perhaps ask in another qu?Jerold
A
4

how can I make use of it?

High numbers (ideally filtered: 100.00) indicate, that the query is using a "good" index, or an index would be useless.

Consider a table with a deleted_at TIMESTAMP NULL column (soft deletion) without an index on it, and like 99% of rows contain NULL (are not deleted). Now with a query like

SELECT * FROM my_table WHERE deleted_at IS NULL

you might see

filtered: 99.00

In this case an index on deleted_at would be useless, due to the overhead of a second lookup (finding the filtered rows in the clustered index). In worst case the index might even hurt the performance, if the optimizer decides to use it.

But if you query for "deleted" rows with

SELECT * FROM my_table WHERE deleted_at IS NOT NULL

you should get something like

filtered: 1.00

The low number indicates, that the query could benefit from an index. If you now create the index on (deleted_at), EXPLAIN will show you

filtered: 100.00

I would say: Anything >= 10% is not worth creating an index. That at least for single-column conditions.

A different story, is when you have a condition on multiple columns like

WHERE a=1 AND b=2

Assuming 1M rows in the table and a cardinality of 10 for both columns (each column contains 10 distinct values) randomly distributed, with an index on (a) the engine would analize 100K rows (10% due to the index on a) and return 10K rows (10% of 10% due to condition on b). EXPLAIN should show you rows: 100000, filtered: 10.00. In this case extending the single column index on (a) to a composite index on (a, b) should improve the query time by factor 10. And EXPLAIN sould show you rows: 10000, filtered: 100.00.

However - That all is more a theory. The reason: I often see filtered: 100.00 when it should be rather 1.00, at least for low cardinality columns and at least on MariaDB. That might be different for MySQL (I can't test that right now), but your example shows a similar behavior (10.00 instead of 100.00). Actually I don't remember when the filtered value has ever helped me. First things I look at are: The order of the tables (if it's a JOIN), the used key, the used key length and the number of examined rows.

Adams answered 7/8, 2019 at 19:15 Comment(0)
D
2

From existing 5.7 documentation today at url https://dev.mysql.com/doc/refman/5.7/en/explain-output.html

filtered (JSON name: filtered)

The filtered column indicates an estimated percentage of table rows that will be filtered by the table condition. The maximum value is 100, which means no filtering of rows occurred. Values decreasing from 100 indicate increasing amounts of filtering. rows shows the estimated number of rows examined and rows × filtered shows the number of rows that will be joined with the following table. For example, if rows is 1000 and filtered is 50.00 (50%), the number of rows to be joined with the following table is 1000 × 50% = 500.

Dismuke answered 6/8, 2019 at 15:9 Comment(1)
Note the end of the last sentence is FOLLOWING, not previous as documented in SOME comments.Dismuke
S
0

So you have to write one of these to understand perfectly but the estimate is based not on the contents but meta data about the contents and statistics.

Let me give you a specific made up example I'm not saying any sql platform does what I describe here this is just an example:

You have a table with 1000 rows and max value for year column is 2010 and min value for year column is 2000 -- without any other information you can "guess" that where year = 2007 will take 10% of all items assuming an average distribution.

In this case it would return 1000 and 10.

To answer your final question filtered might be useful if (as shown above) you only have one "default" value that is throwing everything off -- you might decide to use say null instead of a default to get your queries to perform better. Or you might see that statistics needs to be run on your tables more often because the ranges change a lot. This depends a lot on a given platform and your data model.

Spirituous answered 8/1, 2017 at 16:44 Comment(4)
Your example makes sense for that hypothetical made up database, but doesn't really help me understand how MySQL does it or what it means. The release_year column is nullable and there is no default value. The 1,000 rows in that table right now all have release_year = 2006, but I still get filtered: 10.00 regardless of what value I use in the query (2006, 2016, IS NULL, = NULL, etc). Only variation I get is if I do !=, which results in an equally cryptic filtered: 90.00.Towny
@MichaelMoussa -- I can come up with a reason why it would give these results or we could look at the source code to find out. But this seems like a non-useful exercise to me.Spirituous
@MichaelMoussa -- I just want to check and make sure -- you did ANALYZE TABLE on this table, right? (dev.mysql.com/doc/refman/5.7/en/analyze-table.html). If you haven't then whatever it is doing is not based on an analysis of the table contents but some sort of "default" value.Spirituous
@MichaelMoussa, if the engine estimates that a generic = condition will return 10% of the rows, it makes sense that it would estimate that a != condition would return 90% of the rowsJerold
T
0

I find the "filtered" column to be useless.

EXPLAIN (today) uses crude statistics to derive many of the numbers it shows. "Filtered" is an example of how bad they can be.

To get even deeper into numbers, run EXPLAIN FORMAT=JSON SELECT ... This, in newer versions of MySQL, will provide the "cost" for each possible execution plan. Hence, it gives you clues of what options it thought about and the "cost basis" for the plan that was picked. Unfortunately, it uses a constant for fetching a row -- giving no weighting to whether the row came from disk or was already cached.

A more precise metric of what work was done can be derived after the fact via the STATUS "Handler%" values. I discuss that, plus simple optimization techniques in http://mysql.rjweb.org/doc.php/index_cookbook_mysql .

Histograms exist in 8.0 and 10.0; they will provide more precision. They probably help make "filtered" be somewhat useful.

Technicality answered 5/8, 2019 at 23:15 Comment(0)
T
0

The factor filtered is the proportionality factor between rows_examined_per_scan and rows_produced_per_join from the EXPLAIN FORMAT=JSON SELECT ... query.

Torrey answered 28/11, 2023 at 16:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.