What are some optimization techniques for MySQL table with 300+ million records?
Asked Answered
G

6

7

I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.

From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.

So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?

Thanks,

Josh

Gualtiero answered 14/1, 2009 at 18:17 Comment(0)
M
9

There are several things you can do:

  1. Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.

  2. Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.

  3. If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.

Medora answered 14/1, 2009 at 18:34 Comment(0)
C
2

3 suggestions:

  1. index
  2. index
  3. index

p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)

Connelly answered 14/1, 2009 at 18:29 Comment(0)
Q
2

Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.

Quillen answered 14/1, 2009 at 18:29 Comment(0)
U
1

If you are using MYSQL 5.1 you can use the new features. but be warned they contain lot of bugs.

first you should use indexes. if this is not enough you can try to split the tables by using partitioning.

if this also wont work, you can also try load balancing.

Unravel answered 14/1, 2009 at 18:36 Comment(0)
I
1

A few suggestions.

You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !

This means I will be grabbing some data at an interval using timestamps.

So this means you only use data from the last X days ?

Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.

Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.

Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.

Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?

That depends on the table and the queries ; can't give any advice without knowing more.

If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.

I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.

Example :

CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;

(serie contains 16M lines with n = 1 .. 16000000)

MySQL    Postgres     
58 s     100s       INSERT
75s      51s        CREATE INDEX on (category,id) (useless)
9.3s     5s         SELECT category, sum(counter) FROM t GROUP BY category;
1.7s     0.5s       SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;

On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).

Inflexible answered 30/4, 2011 at 10:16 Comment(0)
A
0
  1. EXPLAIN Your SELECT Queries
  2. LIMIT 1 When Getting a Unique Row SELECT * FROM user WHERE state = 'Alabama' // wrong SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1

  3. Index the Search Fields Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.

  4. Index and Use Same Column Types for Joins If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.

  5. Do Not ORDER BY RAND() If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.

  6. Use ENUM over VARCHAR ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.

  7. Use NOT NULL If You Can Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.

    "NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."

  8. Store IP Addresses as UNSIGNED INT In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().

Alastair answered 12/10, 2017 at 4:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.