Denormalization in databases is an optimization technique to improve the performance of certain queries. We may need to apply denormalization when the usual normalization incurs some performance penalties. This is usually the case when the volume of data has changed, but we can no longer expand the database resources.
Denormalization aims to improve query performance in a database. As the volume of data grows, we may no longer be able to expand database resources. In these cases, we may need to apply denormalization techniques.
Before we begin, let’s clarify normalization and denormalization in SQL. We briefly explain what these concepts are. We then review an example of a normalized database with performance issues, apply denormalization, and explain how this improves performance.
What is standardization?
Let’s refresh our memory on what normalization is and why it is used. Normalization is a database design process used primarily to minimize data redundancy in tables. In a normalized database, each piece of information is saved only once and in a table, with relationships pointing to other tables. As a bonus, we also avoid certain edge situations where we may encounter update anomalies.
Normalization is applied using one of three normal ways. The most common form is the normal form 3NF. Sometimes this normal form is not strict enough; for that reason, the Boyce-Codd Normal Form was invented. For further reading, I recommend this great article, “A Unified View of Normal Database Forms.”
What is denormalization?
Now that we have clarified what normalization is, the concept of denormalization in databases is relatively simple. Normalization results in data being divided into multiple tables; Denormalization results in data from normalized tables being entered into a single table.
The goal of denormalization is to move data from normalized tables to a single table to get the data where it is needed. For example, if a query joins multiple tables to get the data, but indexing is not enough, denormalization may be better.
We exchange data redundancy and data storage to improve query time and possibly resource consumption during query execution. Denormalization has some disadvantages and can bring its own set of problems, but we will discuss it later.
a normalized model
Whenever we consider denormalization, we must start from a normalized data model
We will use the following model for our example. Represents a part of an app’s data model for tracking orders and tickets submitted by customers about their orders.
As we can see, the data is perfectly normalized. We found no redundancies in the stored data apart from the columns needed to maintain relationships.
Let’s take a quick look at tables, their structures, and the types of data stored:
The customer table contains basic information about the customers in the store. The user table contains the login information for both customers and store employees. The
- tickets submitted by customers about the orders they have placed
- contains information about calls between customers and customer service representatives about tickets.
- The table call_outcome contains information about call results.
- The order table contains basic information about orders placed by customers.
- The order_items table contains many-to-many information, linking the products that customers purchased in orders. It also stores the quantity of each product ordered.
- The product table contains information about the products that customers can order.
table contains information about store
employees. Tickets The table contains information about the
. The call table
This normalized data model is acceptable for a relatively small amount of data. As long as we pay attention during the design of the data model and consider the types of queries needed, we should not have any performance issues.
new business requirements
Suppose we have two new business requirements
First, every customer service representative needs quick access to some metrics about the customer in question. The new functionality of the application needs a dashboard about the customer with total sales, the number of tickets sent and the number of customer service calls.
Secondly, customer service representatives need a ticket dashboard with a list of tickets and their respective details quickly. They need this to select the tickets to board based on priority.
Because our data model is normalized, we need to join quite a few tables to get all this information: customer, tickets, call, order, order_items, and product. As the data grows, this query consumes more and more resources. We need to find a way to improve performance since indexing doesn’t give us the best performance in this case.
We apply denormalization to our data model to improve performance. As we’ve mentioned, take a step back and increase data redundancy in exchange for faster queries.
The denormalized version is below with the solutions for our performance issues.
We have made changes to address new business requirements. Let’s analyze the solution for each problem.
Customer metrics dashboard. We create a new table, customer_statistics, highlighted in green in the data model. We use this table to store up-to-date information about each customer’s buying habits. Each time a change is made through the application, the new data is stored in appropriate normalized tables and updated customer_statistics. For example, when a customer places a new order, orders_total_amount increases in the total of the new order for that customer.
The ticket panel for customer service representatives. We denormalize the data in the ticket table and add additional columns that contain the information we need. The app then reads all the summary information from a table without joining the customer, tickets, call, order_items order, and product. It is true that we now have duplicate information about product names, prices, customer names, phone numbers, etc. But the execution time of our query is now much shorter, which is the goal.
When to apply denormalization performance
As we mentioned, one of the main situations to apply
is when we are trying to solve a performance problem. Because we read data from fewer tables or even from a single table, it usually makes the runtime much shorter.
Sometimes applying denormalization is the only way to address business requirements. Specifically, you may need to keep the story denormalizing.
In our example, consider the following scenario: The customer service representative has noticed that the price paid for a product does not match the price displayed on the dashboard. If we look closely at our initial data model, product_price is only available in the product table. Each time the representative opens the dashboard, a query is run to join the next four tables.
But because the price has changed since the time the order was placed, the price on the representative panel no longer matches the amount paid by the customer. This is a big problem in functionality and a scenario where we need the price history of each order.
To do this, we have to duplicate product_price in the order_items table so that it is stored for each order. The order_items table now has the following schema.
We can now take product_price value at the product order level, not just at the product level.
Although denormalization provides great benefits in performance, it comes with trade-offs. Some of the most important are:
Updates and insertions are more expensive. If a piece of data is updated in one table, all duplicate values in other tables must also be updated. Similarly, when inserting new values, we need to store data in both the normalized table and the denormalized table.
- More storage is needed. Due to data redundancy, the same data takes up much more space depending on how many times it is duplicated. For example, the updated ticket table stores the customer’s name twice, as well as the name of the support representative, as well as the product and other information.
- Data anomalies may occur. When updating the data, we must take into account that they can be present in several tables. We need to update every duplicate piece of data.
- More code is needed. We need additional code to update table schemas to migrate data and maintain data consistency to achieve the same functionality. This should be done through multiple INSERT and UPDATE commands as soon as new data enters or old data is modified.
Denormalization doesn’t have to be scary
Denormalization is not something you need to fear. Before you start denormalizing a normalized data model, make sure you need it. Track database performance to see if it degrades, and then analyze the query and data model. If there really is a problem accessing the same data frequently, then denormalizing some tables is probably an option. If you evaluate the need for it and track the changes made, you should not have any problems.