How can I improve query performance for 200+ million records
Asked Answered
C

3

8

Background

I have a MySQL test environment with a table which contains over 200 million rows. On this table have to execute two types of queries;

  1. Do certain rows exists. Given a client_id and a list of sgtins, which can hold up to 50.000 items, I need to know which sgtins are present in the table.
  2. Select those rows. Given a client_id and a list of sgtins, which can hold up to 50.000 items, I need to fetch the full row. (store, gtin...)

The table can grow to 200+ millions record for a single 'client_id'.

Test environment

Xeon E3-1545M / 32GB RAM / SSD. InnoDB buffer pool 24GB. (Production will be a larger server with 192GB RAM)

Table

CREATE TABLE `sgtins` (
  `client_id` INT UNSIGNED NOT NULL,
  `sgtin` varchar(255) NOT NULL,
  `store` varchar(255) NOT NULL,
  `gtin` varchar(255) NOT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  INDEX (`client_id`, `store`, `sgtin`),
  INDEX (`client_id`),
  PRIMARY KEY (`client_id`,`sgtin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Tests

First I generated random sgtin values spread over 10 'client_id's to fill the table with 200 million rows.

I created a benchmark tool which executes various queries I tried. Also I used the explain plan to find out which performance best. This tool will read, for every test, new random data from the data I used to fill the database. To ensure every query is different.

For this post I will use 28 sgtins. Temp table

CREATE TEMPORARY TABLE sgtins_tmp_table (`sgtin` varchar(255) primary key)
 engine=MEMORY;

Exist query

I use this query for find out if the sgtins exist. Also this is the fastest query I found. For 50K sgtins this query will take between 3 and 9 seconds.

-- cost = 17 for 28 sgtins loaded in the temp table.
SELECT sgtin
FROM sgtins_tmp_table
WHERE EXISTS 
  (SELECT sgtin FROM sgtins 
  WHERE sgtins.client_id = 4 
  AND sgtins.sgtin = sgtins_tmp_table.sgtin);

Explain plan

Select queries

-- cost = 50.60 for 28 sgtins loaded in the temp table. 50K not usable.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins_tmp_table, sgtins
WHERE sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

Explain plan

-- cost = 64 for 28 sgtins loaded in the temp table.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins
WHERE sgtins.client_id = 4
AND sgtins.sgtin IN ( SELECT sgtins_tmp_table.sgtin
 FROM sgtins_tmp_table);

Explain plan

-- cost = 50.60 for 28 sgtins loaded in the temp table.
SELECT sgtins_tmp_table.epc, sgtins.store
FROM sgtins_tmp_table, sgtins
WHERE exists (SELECT organization_id, sgtin FROM sgtins WHERE client_id = 4 AND sgtins.sgtin = sgtins_tmp_table.sgtin)
AND sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

Explain plan

Summary

The exist query is usable but the selects are to slow. What can I do about it? And any advice is welcome :)

Congress answered 13/6, 2019 at 8:21 Comment(18)
Can you do an EXPLAIN on your queries so we can see the execution plan? At first sight, I would say that the first index INDEX (client_id, store, sgtin), is useless as I don't see any query that has these 3 params in the where clause in this specific order. however i you can provide the execution plans we can understand better what is happeningGalatea
"The table can grow to 200+ millions record for a single 'client_id'." Consider to deploy range/list Partitioning per client_id .. But keep in mind his limits in the past there was a hard limit off 1024 table partitions i believe not sure if MySQL 8 still has that limit i believe it is now 8192Piegari
@ClaudioCorchez this index was added for future purpose. I will add explain plans.Congress
@RaymondNijland I also considered this. But this test is focus is on the 200 million records for a single client_id. If I understand correctly, partitioning, by client_id will be useful when I have more client_ids. Partitioning by store is also tricky because I do not know for all given 'sgtin's by which store they belong. But 90% of the time I do. Maybe I can still use this by first query with store and filter which I did not find and issue a second query without store?Congress
Not sure why you are using a (unindexed) temporary table to get the values i think it's some way to prevent using OR all the time? But still you should rewrite the first corelated subquery (EXISTS) to a INNER JOIN most likely that is even faster..Piegari
As you can do something like SELECT sgtins.* FROM (SELECT 1 AS sgtin UNION [ALL] ...) AS filter INNER JOIN sgtin ON ... WHERE sgtins.client_id = 4 which makes a bit more sense memory wise and most likely also performance wise.Piegari
@RaymondNijland The temp-table has: ``sgtin` varchar(255) primary key` this is an index?Congress
"The temp-table has: ``sgtin` varchar(255) primary key` this is an index?" Coffee time now noticed now..Piegari
"The exist query is usable but the selects are to slow. What can I do about it?" consider this example .. Keep in mind you can extend (add more sgtin) in FROM (SELECT 'a' AS sgtin) AS filter when you use UNION.Piegari
Let us continue this discussion in chat.Congress
Add an INDEX (client_id, sgtin) this will improve a bit. After you add the index run again the queries using explain and compare the 2 of them. Adding this index it should perform an index scanGalatea
What are you basing "The exist query is usable but the selects are to slow" on? The cost value? Keep in mind that subquery cost is not added to main cost. Hence, the EXISTS query may look cheaper, but in practice, the execution time should be about the same for both the two variants.Anzus
@oysteing I tested both with 50K sgtins and on the table with 200 million rows. The EXIST query is done in < 10s and the SELECT takes > 30 seconds and I got disconnected from MySQL.Congress
@MarkEbbers Strange! In both cases we are talking about a table scan with primary key look-ups. I do not really see what could make the difference. Why do you get disconnected? Have you set a low timeout?Anzus
Huh? PRIMARY KEY (client_id,sgtin) implies the pair is unique. But testing with 10 clientids and 28 sgtins -- That limits the table to about 280 rows, not 200M??Foliar
"For this post I will use 28 sgtins. Temp table" -- Is that a requirement in the long run? Or might you build a long IN(...)?Foliar
Do not trust EXPLAINs "cost" at this level of detail; you really need to run real timing tests.Foliar
@MarkEbbers - I don't what to answer a benchmark query; please provide the 'real' specs -- number of distinct client_ids (really 10?), size of the table compared to the buffer_pool, how many sgtins (really 50K), whether you have already built the temp table, etc. In the typical case, the temp table (vs IN list vs UNIONs) would need to be dynamic, hence part of the cost.Foliar
P
1

i would suggest to rewite your EXISTS SQL as corelated subqueries tends to optimize badly most off the time.
The suggested query would be to use a INNER JOIN instead.

SELECT filter.sgtin
FROM (SELECT '<value>' AS sgtin UNION ALL SELECT '<value>' ..) AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4

As most likely this is faster then using a temporary table.
But your are dealing with 50K values so i would make sense to generate the needed derived table SQL with dynamic SQL directly from the temporary table.

Also like i suggested in the chat.
Making a index (sgtins, client_id) would most likely make more sense depending on the data selectivity which is not really clear.
As that index might make your corelated subquery faster.

Query

# Maybe also needed to be changed with 50 K 
# SET SESSION max_allowed_packet = ??; 


# needed for GROUP_CONCAT as if defualts to only 1024 
SET SESSION group_concat_max_len = @@max_allowed_packet;

SET @UNION_SQL = NULL;

SELECT
  CONCAT(
       'SELECT '
    ,  GROUP_CONCAT(
          CONCAT("'", sgtins_tmp_table.sgtin,"'", ' AS sgtin')
          SEPARATOR ' UNION ALL SELECT '
       )
  )
FROM
 sgtins_tmp_table
INTO
 @UNION_SQL;


SET @SQL = CONCAT("
SELECT filter.sgtin
FROM (",@UNION_SQL,") AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
");


PREPARE q FROM @SQL;
EXECUTE q;

see demo

Editted because of comments

A more ideal approach would be using a fixed table which you index and use CONNECTION_ID() to separate the search values.

CREATE TABLE sgtins_filter (
    connection_id INT
  , sgtin varchar(255) NOT NULL
  , INDEX(connection_id, sgtin)
);

Then you can simply join between both tables

SELECT sgtins_filter.sgtin
FROM sgtins_filter
INNER JOIN sgtins
ON
    sgtins_filter.sgtin = sgtins.sgtin
  AND
    sgtins_filter.connection_id = CONNECTION_ID()
  AND 
    sgtins.client_id = 4; 

see demo

Piegari answered 13/6, 2019 at 10:1 Comment(16)
The UNION query is already much faster and has the same cost as the EXIST query (17) according to the explain plan. Also it works with 50K items. The script I also tried but therefor you need the update the thread_stack option of MySQL. (50K items results in a required stack of around 7MB.) I don't know if this is an option because every connection will require 7MB instead of 192KB of memory.Congress
" I don't know if this is an option because every connection will require 7MB instead of 192KB of memory" Well @MarkEbbers not to mention your temporary table will also consume ~38/39 Mb of memory per connection for 50K values. If you really want to optimize this you should make a fixed table i believe you can use CONNECTION_ID() to keep the values separated then you also can use a join with a smaller query.. If you fill the table i think the best is using a updateable view.Piegari
You are right about the temp table, but this is only for the connections that are issuing that specific query. The thread_stack option is global for every connection and there are a lot more applications using that database. (with a lot more connections). But worth to investigate how many connections are to determine if this is acceptable.Congress
A query with 50K union clauses also causes a stack overflow. (With the default setting of 256KB)Congress
"A query with 50K union clauses also causes a stack overflow. (With the default setting of 256KB) " Like i suggested before use a fixed table and optimize this see this approach.Piegari
I will test this approach.Congress
@RaymondNijland - "faster then using a temporary table" - Eh? You are building the equivalent temp table via that lengthy UNION.Foliar
"(sgtins, client_id) would most likely make more sense" - No. The task is on the brink of blowing out the buffer_pool (200M rows in 24GB); having the opposite order provides better 'clustering', hence better caching.Foliar
"You are building the equivalent temp table via that lengthy UNION. " No ? @RickJames i generating a SQL statement based on the values from that temporary table to be later be merge into a other SQL and executed?Piegari
""faster then using a temporary table"" @RickJames Also that generating appoach was also be confirmed to be much faster by the topicstarter himself but used alot of resources.. Thats why approach two was also added.Piegari
"(sgtins, client_id) would most likely make more sense" - No. The task is on the brink of blowing out the buffer_pool (200M rows in 24GB); having the opposite order provides better 'clustering', hence better caching" Maybe @RickJames i tend to place the most selective column to be first in multi column index which is 9 / 10 times pretty sound to do to limit the needed scannend records even more vs a not so selective column. .Piegari
@RaymondNijland - Contrary to the overwhelming sentiment, the selectivity of each component of a composite index is irrelevant. Think of a composite index as being the concatenation of the components: then think about how it drills down the BTree. It gets to the bottom just as fast, no matter how the bytes within it are arranged.Foliar
@RaymondNijland - And in this example, there is some need to take advantage of clustering the lookups. See my point about table size versus buffer_pool_size.Foliar
@RickJames if you mean with "clustering the lookups." and i consider the (innodb) BTree structure and you use (client_id, sgtins) vs (sgtins, client_id) i assume you most likely mean that the index pages "align" better for the index (client_id, sgtins) to service these queries. As that would make very much sense yes now iam rethinking it..Piegari
@RaymondNijland - Since query is WHERE client_id=1234 AND ..., PRIMARY KEY(client_id, ...) will have all the desired rows less scattered than the opposite ordering for the index.Foliar
@RickJames yes we are on one line.. that is what i meant with “align” maybe it was the wrong word to usePiegari
Y
2

I would write your exists query like this:

SELECT stt.sgtin
FROM sgtins_tmp_table stt
WHERE EXISTS (SELECT 1
              FROM sgtins s
              WHERE s.client_id = 4 AND
                    s.sgtin = stt.sgtin
             );

For this query, you want an index on sgtins(sgtin, client_id).

Yaelyager answered 13/6, 2019 at 11:30 Comment(7)
i believe the MySQL optimizer is smart enough to recognize when using WHERE EXISTS (SELECT sgtin FROM sgtins ..) it does not really have to fetch the data from the table/index or use the selection there .. As it seams it does not matter if you replace sgtin with 0, 1, NULL or even 0/1 for that matter see demo .. Yes the index would most likely make more sense depening on the data selectivityPiegari
@RaymondNijland . . . I'm not sure what the purpose of your comment is. I use select 1 because it is easy to type and works in all databases. The point of the answer is getting the indexes correct.Yaelyager
@GordonLinoff will try the index changeCongress
" The point of the answer is getting the indexes correct." i know i already also suggested the same index in the chat.. "I'm not sure what the purpose of your comment is. I use select 1 because it is easy to type and works in all databases." i know that works best between RDMS, mine point was for performance it does not (really) matter in MySQL if you write sgtins or 1that was mine point more or less.Piegari
In general, unless you are using the latest version, 8.0.16, using IN instead of EXISTS is preferred since MySQL would convert this to a semijoin. In this specific case, it will probably not give much since the number of rows from the inner table is probably much bigger than from the outer table, so being able to switch the order of accessing the tables will not be cheaper. In 8.0.16, MySQL will automatically convert EXISTS to IN.Anzus
I'm stuck with MySQL 5.6 :)Congress
EXISTS ignores what you put in the SELECT. 1 is the convention. In fact, EXPLAIN EXTENDED show that whatever you use gets turned into 1 by the parser/optimizer.Foliar
P
1

i would suggest to rewite your EXISTS SQL as corelated subqueries tends to optimize badly most off the time.
The suggested query would be to use a INNER JOIN instead.

SELECT filter.sgtin
FROM (SELECT '<value>' AS sgtin UNION ALL SELECT '<value>' ..) AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4

As most likely this is faster then using a temporary table.
But your are dealing with 50K values so i would make sense to generate the needed derived table SQL with dynamic SQL directly from the temporary table.

Also like i suggested in the chat.
Making a index (sgtins, client_id) would most likely make more sense depending on the data selectivity which is not really clear.
As that index might make your corelated subquery faster.

Query

# Maybe also needed to be changed with 50 K 
# SET SESSION max_allowed_packet = ??; 


# needed for GROUP_CONCAT as if defualts to only 1024 
SET SESSION group_concat_max_len = @@max_allowed_packet;

SET @UNION_SQL = NULL;

SELECT
  CONCAT(
       'SELECT '
    ,  GROUP_CONCAT(
          CONCAT("'", sgtins_tmp_table.sgtin,"'", ' AS sgtin')
          SEPARATOR ' UNION ALL SELECT '
       )
  )
FROM
 sgtins_tmp_table
INTO
 @UNION_SQL;


SET @SQL = CONCAT("
SELECT filter.sgtin
FROM (",@UNION_SQL,") AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
");


PREPARE q FROM @SQL;
EXECUTE q;

see demo

Editted because of comments

A more ideal approach would be using a fixed table which you index and use CONNECTION_ID() to separate the search values.

CREATE TABLE sgtins_filter (
    connection_id INT
  , sgtin varchar(255) NOT NULL
  , INDEX(connection_id, sgtin)
);

Then you can simply join between both tables

SELECT sgtins_filter.sgtin
FROM sgtins_filter
INNER JOIN sgtins
ON
    sgtins_filter.sgtin = sgtins.sgtin
  AND
    sgtins_filter.connection_id = CONNECTION_ID()
  AND 
    sgtins.client_id = 4; 

see demo

Piegari answered 13/6, 2019 at 10:1 Comment(16)
The UNION query is already much faster and has the same cost as the EXIST query (17) according to the explain plan. Also it works with 50K items. The script I also tried but therefor you need the update the thread_stack option of MySQL. (50K items results in a required stack of around 7MB.) I don't know if this is an option because every connection will require 7MB instead of 192KB of memory.Congress
" I don't know if this is an option because every connection will require 7MB instead of 192KB of memory" Well @MarkEbbers not to mention your temporary table will also consume ~38/39 Mb of memory per connection for 50K values. If you really want to optimize this you should make a fixed table i believe you can use CONNECTION_ID() to keep the values separated then you also can use a join with a smaller query.. If you fill the table i think the best is using a updateable view.Piegari
You are right about the temp table, but this is only for the connections that are issuing that specific query. The thread_stack option is global for every connection and there are a lot more applications using that database. (with a lot more connections). But worth to investigate how many connections are to determine if this is acceptable.Congress
A query with 50K union clauses also causes a stack overflow. (With the default setting of 256KB)Congress
"A query with 50K union clauses also causes a stack overflow. (With the default setting of 256KB) " Like i suggested before use a fixed table and optimize this see this approach.Piegari
I will test this approach.Congress
@RaymondNijland - "faster then using a temporary table" - Eh? You are building the equivalent temp table via that lengthy UNION.Foliar
"(sgtins, client_id) would most likely make more sense" - No. The task is on the brink of blowing out the buffer_pool (200M rows in 24GB); having the opposite order provides better 'clustering', hence better caching.Foliar
"You are building the equivalent temp table via that lengthy UNION. " No ? @RickJames i generating a SQL statement based on the values from that temporary table to be later be merge into a other SQL and executed?Piegari
""faster then using a temporary table"" @RickJames Also that generating appoach was also be confirmed to be much faster by the topicstarter himself but used alot of resources.. Thats why approach two was also added.Piegari
"(sgtins, client_id) would most likely make more sense" - No. The task is on the brink of blowing out the buffer_pool (200M rows in 24GB); having the opposite order provides better 'clustering', hence better caching" Maybe @RickJames i tend to place the most selective column to be first in multi column index which is 9 / 10 times pretty sound to do to limit the needed scannend records even more vs a not so selective column. .Piegari
@RaymondNijland - Contrary to the overwhelming sentiment, the selectivity of each component of a composite index is irrelevant. Think of a composite index as being the concatenation of the components: then think about how it drills down the BTree. It gets to the bottom just as fast, no matter how the bytes within it are arranged.Foliar
@RaymondNijland - And in this example, there is some need to take advantage of clustering the lookups. See my point about table size versus buffer_pool_size.Foliar
@RickJames if you mean with "clustering the lookups." and i consider the (innodb) BTree structure and you use (client_id, sgtins) vs (sgtins, client_id) i assume you most likely mean that the index pages "align" better for the index (client_id, sgtins) to service these queries. As that would make very much sense yes now iam rethinking it..Piegari
@RaymondNijland - Since query is WHERE client_id=1234 AND ..., PRIMARY KEY(client_id, ...) will have all the desired rows less scattered than the opposite ordering for the index.Foliar
@RickJames yes we are on one line.. that is what i meant with “align” maybe it was the wrong word to usePiegari
F
1

Assuming 200M rows and no more than 50K sgtins per client, there must be over 4K clients?

To benchmark with only 10 clients is risky. The Optimizer switches between using an index and doing a table scan in certain cases; this might be such a case.

So, please state the ultimate goal; I don't want to advise you one how to make a benchmark run faster, only to have the 'real' case not work with the advice.

Also, is the list of stgins static? You implied such by suggesting pre-building a MEMORY table. But this seems uncommon. Perhaps the 'real' case is given a different set of sgtins each time.

So, I will answer this question:

  • 200M rows
  • Table is more than 24GB
  • innodb_buffer_pool_size = 24G
  • Thousands of distinct client_id values. (With only 10, the Optimizer is tempted to ignore indexes and do a table scan.)
  • Thousands of stgin values for each client_id
  • The pair (client_id, stgin) is unique
  • Each query may have a different list of stgins; that is, cannot assume the same list of stgins from run to run
  • Want to optimize something like SELECT stgin FROM t WHERE client_id = 1234 AND stgin IN (..long list..)
  • Want to optimize something like SELECT * FROM t WHERE client_id = 1234 AND stgin IN (..long list..)

Regardless of the numbers that EXPLAIN provides, the following is the optimal solution for both queries:

WHERE client_id = 1234 AND stgin IN (..long list..)`
PRIMARY KEY(client_id, stgin)   -- in this order.

Why?

  • The optimizer is happy to focus on client_id = constant and hop through the list of stgins.
  • By having client_id first in the PK, all the activity for a SELECT will be focused on a small fraction of the table. This is important because it limits the number of blocks to touch to less than buffer_pool_size.
  • Technically, an independent INDEX(client_id, stgin) would be faster for the SELECT stgin..., but I don't recommend it, since it is so redundant, and won't save much performance.

Comments on Cost analysis:

  • It does not take into account whether the block(s) are cached or not. With HDD drives, this can make a huge (10x) difference.
  • It does not take much account of index vs data, nor index + data (as in a non-covering secondary index)
  • It knows nothing about the distribution of values. (unless using MariaDB or MySQL 8.0, which have histograms)
Foliar answered 14/6, 2019 at 17:39 Comment(2)
"MariaDB has histograms; MySQL does not." MySQL also has it now in version 8 see manualPiegari
@RaymondNijland - Thanks for the correction; I made a change.Foliar

© 2022 - 2024 — McMap. All rights reserved.