Is DynamoDB right for my 1M events-per-day scenario where I need access to both records and summary (aggregate) information
Asked Answered
T

1

7

Software engineer from the C world breaking into the cloud space. I am trying to understand if I can use DynamoDB for all my scenario needs, if I shouldn't use it at all or if I should use DynamoDB for some scenarios then offload to another database (maybe a more traditional RDMS) for other scenarios.

I have a bunch of machines which send telemetry to my AWS APIs. My current idea is to put these events in a DynamoDB table. In the first year I am expecting about 1M new entries total in the table per day (average > 10 per second), with peak times being closer to 100 events per second. The size of the telemetry events will be 0.5-1KB. Firstly I simply want to capture every telemetry and log it. That is about 1GB per day.

I would use RDMS, but I am concerned that with the ACID properties, as the database size grows given we have 10s to 100s writes per second, database access will slow down significantly. Am I right to be concerned, or will a traditional RDMS be suitable for my needs? I don't feel I need the atomic guarantees of an RDMS (I can live with eventual consistency) especially for reading/aggregating queries.

I was thinking my primary key would be a composite key between machine ID (partition) and time unix millisecond epoch (sort key). This should support a key scenario scenario such as retrieving a list of all telemetry events for a specific time ranges and for a set of machines.

However there are then other things I wish to do such as the ability to get daily totals for various telemetry events, like the amount of times a machine sensor was triggered (i.e. count). Or list all the telemetry events recorded when the temperature was above a certain value for a set of machines. Telemetry events are heterogeneous - there are many types such as a temperature, triggered, and so on, and I need to be able to quickly filter for a specific type of telemetry event for a given set of machineIDs and time window.

So you can imagine one of my queries might look like 'for this time window, and this set of machines, list all temperature (type) telemetry events where a temperature over 70 degrees was recorded'. The first 2 constraints come from my partition (machineId) and sort key (time), but the latter one would query on all telemetry types that are temperature and have a value greater than 70. That's 4 fields I'm constraining on. Should I accept the reality that I can query on machineId and time, but will have to scan for the telemetry type and value from the results return in that query?

Further, this dynamodb table ingests telemetry, but the frontend needs to present summary information of things such as last time telemetry event received, last time machine serviced, daily counts of specific telemetry events. I don't want to have to re-query them every time the summary page reloads in the frontend. Should these aggregates be calculated when the telemetry arrives and then stored in a separate table (or perhaps if performance needs it, in memory something like redis). Is this technically stream processing?

I read online that good design typically only uses one table (single table design), but I am finding it hard to understand how to get there. And maybe it doesn't make sense in my scenario where I want to be able to view individual telemetry events but also have quick access to aggregate information.

After writing out this question, here is what my intuition is telling me right now:

  1. Do use dynamodb
  2. As you ingest telemetry events create aggregates for most common user scenarios (hourly, daily, weekly, monthly totals) and store them somewhere for quick access. Investigate stream processing and/or kinesis here.
  3. For less common user scnearios requiring aggregates, use another technology like Redshift, Athena

Thanks, Thomas

Tithe answered 5/1, 2021 at 20:23 Comment(0)
P
10

My current idea is to put these events in a DynamoDB table. In the first year I am expecting about 1M new entries total in the table per day (average > 10 per second), with peak times being closer to 100 events per second. The size of the telemetry events will be 0.5-1KB.

Sounds like a very good use case for DynamoDB — lots of relatively small writes per second.

I would use RDMS, but I am concerned that with the ACID properties, as the database size grows given we have 10s to 100s writes per second, database access will slow down significantly. Am I right to be concerned, or will a traditional RDMS be suitable for my needs?

An RDBMS can scale to handle that kind of load, yes. Not trivial, but nothing out of this world, either.

Since you mention DynamoDB, I assume you're ok with other services in AWS — I'd suggest looking at Aurora for this kind of thing, if you decide to use an RDBMS.

But it sounds like a waste, especially because of this:

I don't feel I need the atomic guarantees of an RDMS (I can live with eventual consistency) especially for reading/aggregating queries.

So, it's pointing to not RDBMS. Not necessarily DynamoDB, though.

You'll need to be careful with the data model, though.

Now...

Should I accept the reality that [...] will have to scan for the telemetry type and value from the results return in that query?

Almost certainly not. Designing DynamoDB-based data models that, from start, require you to Scan a growing amount of data is a bad, bad idea. You'll hate yourself in the future for it if you go that way.

The very important thing to keep in mind here is that you don't want to have operations take longer due to increase in size (e.g., like your concern with RDBMS).

A Scan operation will take longer the larger the table is. So your system's performance will degrade over time as you accumulate more and more data in your table. You can possibly work around this by offloading "non-hot" data from a "main" table to an "archival" table, for example. But this increases the complexity of managing the whole system.

Instead, try to design your data model so that operations maintain their performance, regardless of the total volume of data you have. E.g., make sure that your queries are always restricted to operations on one (or O(1) at least) item collection (i.e., items that have the same partition key).

the frontend needs to present summary information [...]. I don't want to have to re-query them every time [...]. Should these aggregates be calculated when the telemetry arrives and then stored in a separate table (or perhaps if performance needs it, in memory something like redis). Is this technically stream processing?

Stream processing is a good approach here.

You are absolutely right in avoiding recalculating this summary information every time the frontend is displayed. Aggregating it "on the fly" and storing it somewhere is a very good approach. The frontend would then simply have to "show the most recent pre-calculated piece of data": trivial.

Now, the question becomes when to aggregate, where to store, and what to do if this fails.

  • When to aggregate.

You could do it as the telemetry arrives, yes, but if you're going to be writing it to DynamoDB, keep in mind that you can use DynamoDB Streams — every new item gets written into a durable stream that can be replayed (in case you need to recalculate something) and you can have it automatically invoke a Lambda function, for example, to perform your aggregation.

  • Where to store.

The risk of storing in memory is that you increase the chance of losing your summary, and having to recalculate it. I'd prefer at least some durability here. The best solution would depend on a lot of factors, including how you're serving this frontend (static html? a non-cacheable API request?), and how often you want it to be updated (every X seconds? every new piece of telemetry data?).

One thing that you may want to investigate here is Kinesis Data Analytics: it can run all the aggregation automatically for you, and write the results to some durable storage that you choose. You can then decide how that summary gets into the front end (e.g., request from it every time, write it into something like Redis, invert control and push it to servers serving the frontend so they don't have to make requests, etc).

  • What to do if this fails.

This part is very important. You don't want your frontend to break if the summary is either outdated due to a failure in the processing pipeline, or if the data is missing in case of loss of the in-memory cache servers or unavailability of the durable storage. So it's important to make sure you have a clear idea of how you'll handle these situations.

The other area you should think about here is what to do if you lose your current aggregation. The DynamoDB contains all the measurements, so you should be able to recalculate all the summary. But remember — the data is growing, and this will require a scan, so it may take a very long time to recalculate from scratch.

You may want to write "checkpoints" (e.g., "all summary data until day X") or coarser pre-aggregations (e.g., "all summary data only for day X") assuming that the statistics you are calculating can be correctly derived from pre-aggregations (e.g., you can do it with a "average" or a "max"; you cannot do it with a p99). Saving those checkpoints to S3 can be a good idea. Alternatively, using multiple DynamoDB tables as I mentioned elsewhere can be useful (a "hot" table which is being added to, and a "non-hot" table to which you move archived data, and you always maintain a summary of this archived table somewhere; this way, you would only have to scan the small portion of "hot" data).

I read online that good design typically only uses one table (single table design), but I am finding it hard to understand how to get there.

Yeah... be careful with that. While I don't think it's inherently a bad design, the fact is that a lot of people are absurdly over-hyping it. The main issue is that it's incredibly more difficult to evolve the design.

With that said, keep in mind that this is not an RDBMS. I say this because I got the feeling that that's something you're more used to. But designing a data model in DynamoDB is very different (single-table or not). Duplication of data is a normal part of it, "normal forms" are not desirable.

The main thing to always keep in mind is that you want your read and write operations both to be O(1) with the volume of data, and at the same time you want to have flexibility to query all the information that you need. Single-table isn't something that automatically gives you that, nor necessarily simplify achieving it.


After writing out this question, here is what my intuition is telling me right now: [...]

I'd say you have very good intuition. Spot on.

Regarding point number 3, Athena and Redshift, keep in mind that they're usually better suited for "batch processing" (although not exclusively). In other words, you don't want your frontend, for example, ever sending a query to either and waiting for the results. You may want a process to query either of them and store the pre-calculated result on, e.g., S3, which will then be displayed on something like your frontend. But they should be kept separate from the low-latency requests.

Psychosomatics answered 5/1, 2021 at 21:39 Comment(7)
Great answer that touches on a lot of points - I'd maybe add a reference to two great re:invent 2018 talks by Rick Houlihan if OP is serious about using DynamoDB Advanced Design Patterns for DynamoDB (DAT401) - + Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) - these are a great primer for people new to the world of NoSQL.Quin
Fantastic response. I can't thank you enough. I really can't. In regards to what I was talking about the scan operation... i suppose for a given day and machine id, the number of telemetry events may be fairly similar, so you can argue that the performance won't decrease as you will be roughly 'scanning' the same amount of elements. In an RDMS I guess you would build an index over 3 elements instead (machineid, time, telemetry type), but in DynamoDB we would have to walk the results returned by using the index (machineId, time) and discard records with non-matching telemetry typesTithe
Not sure if I fully understood what you mentioned about the Scan operation here on the comment, but the main issue is that a Scan will necessarily go over the entire table. You can't specify something like "only look at this value for the partition key". So if you add more days to your table, the scan operation will have to go over all the days, even if you're interested in just one. Same for machine ID - if you add more machines, any scan will have to go over all machines. The filter parameter is applied only after the data is accessed (so you're charged for it and it takes time) [...]Psychosomatics
Unlike the Query operation - it takes (at least) a partition key and then DynamoDB loads only the elements with that partition key. Also, similarly a filter would be applied after read (time and money spent).Psychosomatics
I think I am abusing the word scan. What I meant was if I performed a query in DynamoDB over machineID and time, that would return a bunch of records. In my application logic I would then have to manually, in linear time, walk (scan) over them. My bad. Just wondering, given MongoDB can create more fields per index, are you able to advise on its performance for "lots of small writes per second"? Any idea if perhaps MongoDB could work well for my situation? Thank you again for your speedy and thorough response. It is so appreciated.Tithe
Ohhh, there you go. Yes, in DynamoDB Query is perfectly fine, and Scan is evil. Now, if we think of Query, then what you described is very good design - Query to get roughly the same amount of data (based on your assumptions on the machines and amount of telemetry each), and then application-side logic for any extra filtering, aggregation, etc. Cheers!Psychosomatics
I have very limited experience with operating MongoDB at scale in production - can't give you any useful info there, sorry! Maybe if you add a question specifically about MongoDB, someone with experience might give you some good recommendations!Psychosomatics

© 2022 - 2024 — McMap. All rights reserved.