SimpleDB Select VS DynamoDB Scan
Asked Answered
H

1

6

I am making a mobile iOS app. A user can create an account, and upload strings. It will be like twitter, you can follow people, have profile pictures etc. I cannot estimate the user base, but if the app takes off, the total dataset may be fairly large.

I am storing the actual objects on Amazon S3, and the keys on a DataBase, listing Amazon S3 keys is slow. So which would be better for storing keys?

This is my knowledge of SimpleDB and DynamoDB:

SimpleDB:

  • Cheap
  • Performs well
  • Designed for small/medium datasets
  • Can query using select expressions

DynamoDB:

  • Costly
  • Extremely scalable
  • Performs great; millisecond response
  • Cannot query

These points are correct to my understanding, DynamoDB is more about killer. speed and scalability, SimpleDB is more about querying and price (still delivering good performance). But if you look at it this way, which will be faster, downloading ALL keys from DynamoDB, or doing a select query with SimpleDB... hard right? One is using a blazing fast database to download a lot (and then we have to match them), and the other is using a reasonably good-performance database to query and download the few correct objects. So, which is faster:

DynamoDB downloading everything and matching OR SimpleDB querying and downloading that

(NOTE: Matching just means using -rangeOfString and string comparison, nothing power consuming or non-time efficient or anything server side)

My S3 keys will use this format for every type of object

accountUsername:typeOfObject:randomGeneratedKey

E.g. If you are referencing to an account object

Rohan:Account:shd83SHD93028rF

Or a profile picture:

Rohan:ProfilePic:Nck83S348DD93028rF37849SNDh

I have the randomly generated key for uniqueness, it does not refer to anything, it is simply there so that keys are not repeated therefore overlapping two objects.

In my app, I can either choose SimpleDB or DynamoDB, so here are the two options:

  • Use SimpleDB, store keys with the format but not use the format for any reference, instead use attributes stored with SimpleDB. So, I store the key with attributes like username, type and maybe others I would also have to include in the key format. So if I want to get the account object from user 'Rohan'. I just use SimpleDB Select to query the attribute 'username' and the attribute 'type'. (where I match for 'account')

  • DynamoDB, store keys and each key will have the illustrated format. I scan the whole database returning every single key. Then get the key and take advantage of the key format, I can use -rangeOfString to match the ones I want and then download from S3.

Also, SimpleDB is apparently geographically-distributed, how can I enable that though?

So which is quicker and more reliable? Using SimpleDB to query keys with attributes. Or using DynamoDB to store all keys, scan (download all keys) and match using e.g. -rangeOfString? Mind the fact that these are just short keys that are pointers to S3 objects.

Here is my last question, and the amount of objects in the database will vary on the decided answer, should I:

  • Create a separate key/object for every single object a user has
  • Create an account key/object and store all information inside there

There would be different advantages and disadvantages points between these two options, obviously. For example, it would be quicker to retrieve if it is all separate, but it is also more organized and less large of a dataset for storing it in one users account.

So what do you think?

Thanks for the help! I have put a bounty on this, really need an answer ASAP.

Humes answered 4/1, 2013 at 2:59 Comment(8)
Just a couple of notes for clarity sake: 1. DynamoDB does have a query operation, it just requires use of a RangeKey. 2. The scan operation allows you to find data across the entire table, but does not require that you download the whole table. 3. SimpleDB has redundant replicas within the same region your domain was created in, it does not act like a CDN for your database.Arrowworm
@BobKinney what do you mean by you can find data throughout the whole table but don't need to download it?Humes
I mean exactly what I said. A scan operation will scan all the data on a DynamoDB table, and only return items in the table that match your scan parameters, and only these will need to be downloaded to your application. Scan operations can be bound so that you only look for the first N matching results, but it will use as much read throughput as necessary to find those N results.Arrowworm
@BobKinney Oh ok, so presume SimpleDB and DynamoDB have 10,000 keys each, would select be faster or would scan then matchHumes
I don't have hard numbers, so I can't speak to actual performance, but there is more at play here than just the number of keys. I encourage you to do some small scale tests and make a judgement for yourself with the understanding that in general DynamoDB will scale better.Arrowworm
@BobKinney Thanks for the discussion, helped, Im gonna go ahead and build a nice dataset for both (looping) and start doing some tests.Humes
How about Google Cloud Storage via JSON API. You can query key.Spectra
@KyawTun I want to use AWS, it has a great iOS api and docs, its cheap and it has a suite of useful servicesHumes
G
7

Wow! What a Question :)

Ok, lets discuss some aspects:

S3

S3 Performance is low most likely as you're not adding a Prefix for Listing Keys.

If you sharding by storing the objects like: type/owner/id, listing all the ids for a given owner (prefixed as type/owner/) will be fast. Or at least, faster than listing everything at once.

Dynamo Versus SimpleDB

In general, thats my advice:

  • Use SimpleDB when:

    • Your entity storage isn't going to pass over 10GB
    • You need to apply complex queries involving multiple fields
    • Your queries aren't well defined
    • You can leverage from Multi-Valued Data Types
  • Use DynamoDB when:

    • Your entity storage will pass 10GB
    • You want to scale demand / throughput as it goes
    • Your queries and model is well-defined, and unlikely to change.
    • Your model is dynamic, involving a loose schema
    • You can cache on your client-side your queries (so you can save on throughput by querying the cache prior to Dynamo)
    • You want to do aggregate/rollup summaries, by using Atomic Updates

Given your current description, it seems SimpleDB is actually better, since: - Your model isn't completely defined - You can defer some decision aspects, since it takes a while to hit the (10GiB) limits

Geographical SimpleDB

It doesn't support. It works only from us-east-1 afaik.

Key Naming

This applies most to Dynamo: Whenever you can, use Hash + Range Key. But you could also create keys using Hash, and apply some queries, like:

  • List all my records on table T which starts with accountid:
  • List all my records on table T which starts with accountid:image

However, those are Scans at all. Bear that in mind.

(See this for an overview: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/API_Scan.html)

Bonus Track

If you're using Java, cloudy-data on Maven Central includes SimpleJPA with some extensions to Map Blob Fields to S3. So give it a look:

http://bitbucket.org/ingenieux/cloudy

Thank you

Gorgerin answered 4/1, 2013 at 2:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.