Efficiently storing 7.300.000.000 rows
Asked Answered
T

7

24

How would you tackle the following storage and retrieval problem?

Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:

  • id (unique row identifier)
  • entity_id (takes on values between 1 and 2.000.000 inclusive)
  • date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
  • value_1 (takes on values between 1 and 1.000.000 inclusive)
  • value_2 (takes on values between 1 and 1.000.000 inclusive)

entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).

What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.

Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.

Now - how would you tackle the described problem?

Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.

Tuberculin answered 20/3, 2009 at 10:32 Comment(4)
Ten years is very likely not exactly 3650 days, due to leap years.Nephron
How will you access it? From PHP across a network? From C++ on local machine...?Grannias
@unwind: good point. The correct number to memorize is 1461 days in 4 years, which is closer to being correct. It's still not exact, but the error doesn't matter until you get to 100 years of data.Clemence
@knorv: that one read per second in total or one read per second per entity?Chengtu
C
13

Use partitioning. With your read pattern you'd want to partition by entity_id hash.

Chengtu answered 20/3, 2009 at 10:36 Comment(6)
I think this is the best solution. It makes it for example possible to count the number of records without opening 2000 files. Or something like select count(*) from table where value_1 =100. Your data comes more alive.Vitalize
@S.Lott: why should it? It'll use index with logarithmic cost of access.Chengtu
@tuinstoel: That's not a use case. The use cases are write data and fetch by a specific entity_id. Optimizing for something that's not a use case isn't helpful, is it?Clemence
@S.Lott. It is not about optimizing, it is about making something not impossible in the early stage. Data has value on its own.Vitalize
@usr: you see no benefit of spreading data evenly among multiple servers?Chengtu
How does this relate to sharding?Candlelight
C
29

"Now - how would you tackle the described problem?"

With simple flat files.

Here's why

"all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345."

You have 2.000.000 entities. Partition based on entity number:

level1= entity/10000
level2= (entity/100)%100
level3= entity%100

The each file of data is level1/level2/level3/batch_of_data

You can then read all of the files in a given part of the directory to return samples for processing.

If someone wants a relational database, then load files for a given entity_id into a database for their use.


Edit On day numbers.

  1. The date_id/entity_id uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.

  2. The date_id "rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. The date_id should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.

Since no query relies on date_id, nothing ever needs to be done with it. It can be the file name for all that it matters.

To include the date_id in the result set, write it in the file with the other four attributes that are in each row of the file.


Edit on open/close

For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.

You have two choices for the architecture of your writer.

  1. Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.

  2. Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.

Clemence answered 20/3, 2009 at 10:44 Comment(18)
I'm voting this up - too often people miss the simpler solution. Level 1 has 201 directories, levels 2 and 3 have 100 each, well within the bounds of any file system.Behindhand
My only concern is that the dates are stored as day numbers 1-3650ish. What happens if you get to 10years and 1day and you need to roll them over - that's a lot of rename operations.Behindhand
In a DB, it would be "delete from X where daynum = 1; update X set daynum = daynum -1;".Behindhand
Instead of a lot of flat files you can also use a lot of Sqlite files but I think partioning one big table is better.Vitalize
If you use a lot of Sqlite files you can create a unique index on date_id, not possible with a flat file. If you use one big table in MySQL you can create a unique index on the combination of entity_id and date_id.Vitalize
@tuinstoel: a lot of SQLite files is a lot of SQL overhead for no real value. Even though SQLite is fast; this is much, much faster. Similarly, a big MySQL table is going to be much slower than this. The complexity of SQL adds no real value.Clemence
@Pax: a delete and update on 7 billion rows is unlikely to finish in your lifetime. A rename on 2.000.000 files will probably take about 36 hours -- and won't require any undo storage or fill up an database transaction log files.Clemence
Appending entities to large files would still be slow though wouldn't it? Opening and closing a new file for each insert. Don't know just asking.Omalley
@Clemence There is extra value, you get a uniqie index that checks whether data_id is really unique (data consistency). SQL isn't such an overhead because SQLite caches SQL statements.Vitalize
@Clemence I don't know whether a partioned MySQL table is slow? I never tried but why should that be slow? MySQL only needs to read one partition. If you want to do some different reporting you can do it because all the data is in one table. It is much more flexible.Vitalize
Well, that might even work if you'd set up this with GFS. Although I'm not sure setting up GFS cluster would be simpler that setting up a cluster of MySQL slaves.Chengtu
@tuinstoel: any SQL is more overhead than direct OS files. Since this is so simple, OS files are sufficient. The "unique index" is also part of the OS file naming -- with less overhead than in SQL.Clemence
@Clemence No you can't have a unique index on date_id inside a flat file cause flat files don't have indexes. Your file path corresponds with the entity_id, not the date_id.Vitalize
@S.Lott, Don't understand how you want to use the file name to ensure uniqueness of the entity_id, data_id combination?Vitalize
@tuinstoel: unique file name IS your unique index. Don't need a separate index schema.Clemence
Doesn't that mean you will have 7.300.000.000 files in the end? That is quite a lot.Vitalize
"entity_id combined with date_id is unique" -- it's provided as unique data. I'm not sure anything needs to be done to ensure this. I think it's imposed from outside this storage schema.Clemence
Other advantages: This can readily be scaled across machines if necessary (scatter some NFS mounts in those directories, for example). Using mv, updates are "atomic". Write to a new file, rename it over the old. Bang, the file is updated "instantly", and atomically.Embrocation
C
13

Use partitioning. With your read pattern you'd want to partition by entity_id hash.

Chengtu answered 20/3, 2009 at 10:36 Comment(6)
I think this is the best solution. It makes it for example possible to count the number of records without opening 2000 files. Or something like select count(*) from table where value_1 =100. Your data comes more alive.Vitalize
@S.Lott: why should it? It'll use index with logarithmic cost of access.Chengtu
@tuinstoel: That's not a use case. The use cases are write data and fetch by a specific entity_id. Optimizing for something that's not a use case isn't helpful, is it?Clemence
@S.Lott. It is not about optimizing, it is about making something not impossible in the early stage. Data has value on its own.Vitalize
@usr: you see no benefit of spreading data evenly among multiple servers?Chengtu
How does this relate to sharding?Candlelight
G
5

You might want to look at these questions:

Large primary key: 1+ billion rows MySQL + InnoDB?

Large MySQL tables

Personally, I'd also think about calculating your row width to give you an idea of how big your table will be (as per the partitioning note in the first link).

HTH.,

S

Grillparzer answered 20/3, 2009 at 10:36 Comment(0)
A
4

Your application appears to have the same characteristics as mine. I wrote a MySQL custom storage engine to efficiently solve the problem. It is described here

Imagine your data is laid out on disk as an array of 2M fixed length entries (one per entity) each containing 3650 rows (one per day) of 20 bytes (the row for one entity per day).

Your read pattern reads one entity. It is contiguous on disk so it takes 1 seek (about 8mllisecs) and read 3650x20 = about 80K at maybe 100MB/sec ... so it is done in a fraction of a second, easily meeting your 1-query-per-second read pattern.

The update has to write 20 bytes in 2M different places on disk. IN simplest case this would take 2M seeks each of which takes about 8millisecs, so it would take 2M*8ms = 4.5 hours. If you spread the data across 4 “raid0” disks it could take 1.125 hours.

However the places are only 80K apart. In the which means there are 200 such places within a 16MB block (typical disk cache size) so it could operate at anything up to 200 times faster. (1 minute) Reality is somewhere between the two.

My storage engine operates on that kind of philosophy, although it is a little more general purpose than a fixed length array.

You could code exactly what I have described. Putting the code into a MySQL pluggable storage engine means that you can use MySQL to query the data with various report generators etc.

By the way, you could eliminate the date and entity id from the stored row (because they are the array indexes) and may be the unique id – if you don't really need it since (entity id, date) is unique, and store the 2 values as 3-byte int. Then your stored row is 6 bytes, and you have 700 updates per 16M and therefore a faster inserts and a smaller file.

Edit Compare to Flat Files

I notice that comments general favor flat files. Don't forget that directories are just indexes implemented by the file system and they are generally optimized for relatively small numbers of relatively large items. Access to files is generally optimized so that it expects a relatively small number of files to be open, and has a relatively high overhead for open and close, and for each file that is open. All of those "relatively" are relative to the typical use of a database.

Using file system names as an index for a entity-Id which I take to be a non-sparse integer 1 to 2Million is counter-intuitive. In a programming you would use an array, not a hash-table, for example, and you are inevitably going to incur a great deal of overhead for an expensive access path that could simply be an array indeing operation.

Therefore if you use flat files, why not use just one flat file and index it?

Edit on performance

The performance of this application is going to be dominated by disk seek times. The calculations I did above determine the best you can do (although you can make INSERT quicker by slowing down SELECT - you can't make them both better). It doesn't matter whether you use a database, flat-files, or one flat-file, except that you can add more seeks that you don't really need and slow it down further. For example, indexing (whether its the file system index or a database index) causes extra I/Os compared to "an array look up", and these will slow you down.

Edit on benchmark measurements

I have a table that looks very much like yours (or almost exactly like one of your partitions). It was 64K entities not 2M (1/32 of yours), and 2788 'days'. The table was created in the same INSERT order that yours will be, and has the same index (entity_id,day). A SELECT on one entity takes 20.3 seconds to inspect the 2788 days, which is about 130 seeks per second as expected (on 8 millisec average seek time disks). The SELECT time is going to be proportional to the number of days, and not much dependent on the number of entities. (It will be faster on disks with faster seek times. I'm using a pair of SATA2s in RAID0 but that isn't making much difference).

If you re-order the table into entity order ALTER TABLE x ORDER BY (ENTITY,DAY) Then the same SELECT takes 198 millisecs (because it is reading the order entity in a single disk access). However the ALTER TABLE operation took 13.98 DAYS to complete (for 182M rows).

There's a few other things the measurements tell you 1. Your index file is going to be as big as your data file. It is 3GB for this sample table. That means (on my system) all the index at disk speeds not memory speeds.

2.Your INSERT rate will decline logarithmically. The INSERT into the data file is linear but the insert of the key into the index is log. At 180M records I was getting 153 INSERTs per second, which is also very close to the seek rate. It shows that MySQL is updating a leaf index block for almost every INSERT (as you would expect because it is indexed on entity but inserted in day order.). So you are looking at 2M/153 secs= 3.6hrs to do your daily insert of 2M rows. (Divided by whatever effect you can get by partition across systems or disks).

Amory answered 21/3, 2009 at 3:51 Comment(0)
B
2

I had similar problem (although with much bigger scale - about your yearly usage every day)

Using one big table got me screeching to a halt - you can pull a few months but I guess you'll eventually partition it.

Don't forget to index the table, or else you'll be messing with tiny trickle of data every query; oh, and if you want to do mass queries, use flat files

Boyles answered 20/3, 2009 at 10:46 Comment(0)
H
1

Your description of the read patterns is not sufficient. You'll need to describe what amounts of data will be retrieved, how often and how much deviation there will be in the queries.

This will allow you to consider doing compression on some of the columns.

Also consider archiving and partitioning.

Hackery answered 20/3, 2009 at 10:43 Comment(0)
C
0

If you want to handle huge data with millions of rows it can be considered similar to time series database which logs the time and saves the data to the database. Some of the ways to store the data is using InfluxDB and MongoDB.

Cuda answered 24/10, 2014 at 4:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.