11.2: Denormalization by duplicating data
Here, we’ll violate the principles of second normal form by duplicating data across multiple items.
To get to second normal form, each non-key attribute must depend on the whole key. This is a confusing way to say that data should not be duplicated across multiple records. If data is duplicated, it should be pulled out into a separate table. Each record that uses that data should refer to it via a foreign key reference.
Imagine we have an application that contains Books and Authors. Each Book has an Author, and each Author has some biographical information, such as their name and birth year.
In a RDS database we would have two separate tables for books and authors, where no data gets duplicated. But look at the DynamoDb table below. We can ignore the rules of second normal form and include the Author’s biographical information on each Book item.
Primary Key | Attributes | ||||
---|---|---|---|---|---|
Partition key: AuthorName | Sort key: BookName | AuthorBirthDate | ReleaseYear | ||
"Stephen King" | "It" | "21st Sep, 1947" | "1986" | ||
"Stephen King" | "The Shining" | "21st Sep, 1947" | "1977" | ||
"JK Rowling" | "Harry Potter and the Sorcerer's Store" | "31st Jul, 1965" | "1997" |
Notice that there are multiple Books that contain the biographical information for the Author Stephen King. Because this information won’t change, we can store it directly on the Book item itself. Whenever we retreive the Book, we will also get information about the parent Author item.
There are two main questions you should ask when considering this strategy:
1: Is the duplicated information immutable?
2: If the data does change, how often does it change and how many items include the duplicated information?
In our example above, we’ve duplicated biographical information that isn’t likely to change. Because it’s essentially immutable, it’s OK to duplicate it without worrying about consistency issues when that data changes.
If the data changes fairly infrequently and the denormalized items are read a lot, it may be OK to duplicate to save money on all of those subsequent reads. When the duplicated data does change, you’ll need to work to ensure it’s changed in all those items.
Which leads us to the second factor—how many items contain the duplicated data. If you’ve only duplicated the data across three items, it can be easy to find and update those items when the data changes. If that data is copied across thousands of items, it can be a real chore to discover and update each of those items, and you run a greater risk of data inconsistency.