Amazon DynamoDB is a managed NoSQL database service that provides fast and predictable performance with seamless scalability. The main purpose of DynamoDB is to offload the administrative burdens of operating and scaling a distributed database without worrying about hardware provisioning, software setup and maintenance, and scalability.
Even though DynamoDB does offer those advantages and the
ability to send unstructured data to be stored with minimal
effort, the way you query this data might come with a huge
cost.
With DynamoDB, you not only pay for reads and writes (through
reading capacity units RCUs and write capacity units WCUs),
but you also pay for additional things to be able to query
your data.
You pay for global secondary indexes (GSIs for short) which
are separate DynamoDB tables that contain a subset of
attributes from the source table. It contains an alternate
primary key to support query operations.
As your data set grows, your GSI tables will grow
simultaneously, but you end up paying for additional tables
when using more than one column as your key to your GSI table
to be able to query the requested data as expected.
You might end up using multiple DynamoDB tables, and each of
those tables will contain GSI tables to be able to query
different sets of data to your liking.
These tables and GSI tables end up charging you extra, but you
are able to achieve similar outcomes by cutting your costs by
using the Single Table Design on your DynamoDB table.
The Single Table Design method requires you to use one
DynamoDB table and one GSI table. To be able to query your
data, you must register your data using two important key
pairs which are: a primary key PK and a sort key SK.
The primary key is a required key requested by DynamoDB, but
the sort key is the key to be used for your GSI table to be
able to query.
Instead of creating physical DynamoDB tables and assigning GSI
tables to each table, you create virtual tables in the same
DynamoDB table and differentiate them using a Table property,
and since DynamoDB is a NoSQL database, then you can insert
your data without worrying about any schema validation
issues.
Example of a regular DynamoDB record:
{ Key: "1234", Title: "test data", CreatedBy:"user-123", Content: "this is a test post" }
{ PK: "Post#1234", SK: "User#user-123", Title: "test data", Content: "this is a test post", Table: "Posts" }
The key difference here is that instead of putting a GSI table to be able to query the Key and CreatedBy, you now query on the PK-SK GSI table. Hence, you can focus on utilizing those keys to query on instead of creating multiple GSI tables to be able to query your requested data (keep in mind that you are allowed 20 GSI tables per DynamoDB table).
One issue with utilizing Single Table Design is that you are required to write additional software code
to be able to cut costs on your DynamoDB table. While there are ready-made open source solutions that do
a good portion of the required job, that doesn’t mean they will provide a ready solution that fits
everyone’s needs.
It requires you to think about how you will structure your data on DynamoDB to be able to query your
data as expected. This would become less cumbersome if you structured your code properly, ideally
structuring your modules and considering each module as a virtual table on DynamoDB.
You might consider using composite primary keys and composite sort keys to increase your querying
ability rather than creating new GSI tables.
Example of a composite key:
Post#12345#User#user-1234
You would write additional code to be able to split and combine several values to create a composite key.
If you have a good amount of relationships and key pairs to query in your datasets, then it is
recommended to use a SQL database like AWS Aurora which preserves the relationship between your datasets
and lets you query as much as you want based on a monthly basis without worrying about reading and write
capacity units.
It requires you to think about how you will structure your data on DynamoDB to be able to query your
data as expected. This would become less cumbersome if you structured your code properly, ideally
structuring your modules and considering each module as a virtual table on DynamoDB.
If your data is not structured but you have a lot of records to query that might exceed DynamoDB’s
allowed read capacity units, then consider doing ETL jobs on your data using AWS Glue and then loading
it on AWS Redshift to query it.