MySQL Partitioning a VARCHAR(60)
Asked Answered
T

2

6

I have a very large 500 million rows table with the following columns:

  • id - Bigint - Autoincrementing primary index.
  • date - Datetime - Approximately 1.5 million rows per date, data older 1 year is deleted.
  • uid - VARCHAR(60) - A user ID
  • sessionNumber - INT
  • start - INT - epoch of start time.
  • end - INT - epoch of end time.
  • More columns not relevant for this query.

The combination of uid and sessionNumber forms a uinque index. I also have an index on date.

Due to the sheer size, I'd like to partition the table.

Most of my accesses would be by date, so partitioning by date ranges seems intuitive, but as the date is not part of the unique index, this is not an option.

Option 1: RANGE PARTITION on Date and BEFORE INSERT TRIGGER

I don't really have a regular issue with the uid and sessionNumber uniqueness being violated. The source data is consistent, but sessions that span two days may be inserted on two consecutive days with midnight being the end time of the first and start time of the second.

I'm trying to understand if I could remove the unique key and instead use a trigger that would

  • Check if there is a session with the same identifiers the previous day and if so,
  • Updates the end date.
  • cancels the actual insert.

However, I am not sure if I can 1) trigger an update on the same table. or 2) prevent the actual insert.

Option 2: LINEAR HASH PARTITION on UID

My second option is to use a linear hash partition on the UID. However I cannot see any example that utilizes a VARCHAR and converts it to an INTEGER which is used for the HASH partitioning.

However I cannot finde a permitted way to convert from VARCHAR to INTEGER. For example

ALTER TABLE mytable
PARTITION BY HASH (CAST(md5(uid) AS  UNSIGNED integer)) 
PARTITIONS 20

returns that the partition function is not allowed.

Trangtranquada answered 21/11, 2017 at 22:32 Comment(0)
V
10

HASH partitioning must work with a 32-bit integer. But you can't convert an MD5 string to an integer simply with CAST().

Instead of MD5, CRC32() can take an arbitrary string and converts to a 32-bit integer. But this is also not a valid function for partitioning.

mysql> alter table v partition by hash(crc32(uid));
ERROR 1564 (HY000): This partition function is not allowed

You could partition by the string using KEY Partitioning instead of HASH partitioning. KEY Partitioning accepts strings. It passes whatever input string through MySQL's built-in PASSWORD() function, which is basically related to SHA1.

However, this leads to another problem with your partitioning strategy:

mysql> alter table v partition by key(uid);
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function

Your table's primary key id does not include the column uid that you want to partition by. This is a restriction of MySQL's partitioning:

every unique key on the table must use every column in the table's partitioning expression.

Here's the table I'm testing with (it would have been a good idea for you to include this in your question):

CREATE TABLE `v` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `date` datetime NOT NULL,
  `uid` varchar(60) NOT NULL,
  `sessionNumber` int(11) NOT NULL,
  `start` int(11) NOT NULL,
  `end` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uid` (`uid`,`sessionNumber`),
  KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Before going any further, I have to wonder why you want to use partitioning anyway? "Sheer size" is not a reason to partition a table.

Partitioning, like any optimization, is done for the sake of specific queries you want to optimize for. Any optimization improves one query at the expense of other queries. Optimization has nothing to do with the table. The table is happy to sit there with 5 billion rows, and it doesn't care. Optimization is for the queries.

So you need to know which queries you want to optimize for. Then decide on a strategy. Partitioning might not be the best strategy for the set of queries you need to optimize!

Vague answered 22/11, 2017 at 0:9 Comment(3)
I'd like to partition by date range because I will store a rolling 13 months of data and would like to be able to delete historic data without locking the database. If I cannot do the latter, all my production queries will be SELECT ... FROM v WHERE uid = '123' order by session_number ascTrangtranquada
So based on your above response I have now ... 1) removed the id column, 2) declared uid/session_number as the primary key, 3) partitioned on those colums using key partitioning. This is now copying the table, but I'll be doing some performance comparison between retrieval of rows in the current un-partitioned table vs the partitioned able. I'm also assuming that using partitioning I can effectively reduce the lock times when deleting historic data by running the DELETE statements one partition at a time.Trangtranquada
Even faster would be ALTER TABLE...DROP PARTITION. That's not a DELETE, it's more like DROP TABLE. It can't be rolled back, but it's much faster. And yes, it locks only the partition(s) you name. But it can only be used with RANGE or LIST partitioning, not HASH or KEY partitioning. See dev.mysql.com/doc/refman/5.7/en/…Vague
J
0

I'll assume your 'uid' is a 128-bit UUID kind of value, which can be stored as a BINARY(16), because that is generally worth the trouble.

Next, stay away from the 'datetime' type, as it is stored like a packed string, and doesn't hold any timezone information. Store date-time-values either as pure numerical values (the number of seconds since the UNIX-epoch), or let MySQL do that for you and use the timestamp(N) type. Also don't call a column 'date', not just because that is a reserved word, but also because the value contains time details too.

Next, stay away from using anything else than latin1 as the CHARSET of (all) your tables. Only ever do UTF-8-ness at the column level. This to prevent unnecessarily byte-wide columns and indexes creeping in over time. Adopt this habit and you'll happily look back on it after some years, promised.

This makes the table look like:

CREATE TABLE `v` (
  `uuid` binary(16) NOT NULL,
  `mysql_created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `visitor_uuid` BINARY(16) NOT NULL,
  `sessionNumber` int NOT NULL,
  `start` int NOT NULL,
  `end` int NOT NULL,
  PRIMARY KEY (`uuid`),
  UNIQUE KEY (`visitor_uuid`,`sessionNumber`),
  KEY (`mysql_created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITIONED BY RANGE COLUMNS (`uuid`)
( PARTITION `p_0` VALUES LESS THAN (X'10')
, PARTITION `p_1` VALUES LESS THAN (X'20')
...
, PARTITION `p_9` VALUES LESS THAN (X'A0')
, PARTITION `p_A` VALUES LESS THAN (X'B0')
...
, PARTITION `p_F` VALUES LESS THAN (MAXVALUE)
);

To make the KEY (mysql_created_at) be only on the date-part, needs a calculated column, which can be added in-place, and then an index on it is also light to add, so I'll leave that as homework.

Jessen answered 25/10, 2020 at 8:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.