Software engineer from the C world breaking into the cloud space. I am trying to understand if I can use DynamoDB for all my scenario needs, if I shouldn't use it at all or if I should use DynamoDB for some scenarios then offload to another database (maybe a more traditional RDMS) for other scenarios.
I have a bunch of machines which send telemetry to my AWS APIs. My current idea is to put these events in a DynamoDB table. In the first year I am expecting about 1M new entries total in the table per day (average > 10 per second), with peak times being closer to 100 events per second. The size of the telemetry events will be 0.5-1KB. Firstly I simply want to capture every telemetry and log it. That is about 1GB per day.
I would use RDMS, but I am concerned that with the ACID properties, as the database size grows given we have 10s to 100s writes per second, database access will slow down significantly. Am I right to be concerned, or will a traditional RDMS be suitable for my needs? I don't feel I need the atomic guarantees of an RDMS (I can live with eventual consistency) especially for reading/aggregating queries.
I was thinking my primary key would be a composite key between machine ID (partition) and time unix millisecond epoch (sort key). This should support a key scenario scenario such as retrieving a list of all telemetry events for a specific time ranges and for a set of machines.
However there are then other things I wish to do such as the ability to get daily totals for various telemetry events, like the amount of times a machine sensor was triggered (i.e. count). Or list all the telemetry events recorded when the temperature was above a certain value for a set of machines. Telemetry events are heterogeneous - there are many types such as a temperature, triggered, and so on, and I need to be able to quickly filter for a specific type of telemetry event for a given set of machineIDs and time window.
So you can imagine one of my queries might look like 'for this time window, and this set of machines, list all temperature (type) telemetry events where a temperature over 70 degrees was recorded'. The first 2 constraints come from my partition (machineId) and sort key (time), but the latter one would query on all telemetry types that are temperature and have a value greater than 70. That's 4 fields I'm constraining on. Should I accept the reality that I can query on machineId and time, but will have to scan for the telemetry type and value from the results return in that query?
Further, this dynamodb table ingests telemetry, but the frontend needs to present summary information of things such as last time telemetry event received, last time machine serviced, daily counts of specific telemetry events. I don't want to have to re-query them every time the summary page reloads in the frontend. Should these aggregates be calculated when the telemetry arrives and then stored in a separate table (or perhaps if performance needs it, in memory something like redis). Is this technically stream processing?
I read online that good design typically only uses one table (single table design), but I am finding it hard to understand how to get there. And maybe it doesn't make sense in my scenario where I want to be able to view individual telemetry events but also have quick access to aggregate information.
After writing out this question, here is what my intuition is telling me right now:
- Do use dynamodb
- As you ingest telemetry events create aggregates for most common user scenarios (hourly, daily, weekly, monthly totals) and store them somewhere for quick access. Investigate stream processing and/or kinesis here.
- For less common user scnearios requiring aggregates, use another technology like Redshift, Athena
Thanks, Thomas