Build Faster Web Apps with Denormalization

Controlling database redundancy is one of the first subjects we learn as young web developers. We are taught to be aware of normalization, and constructing our databases according to the strict normal form structure. So what if being redundant can be a good thing. Let’s take a look at how the use of redundant data can significantly decrease query times, allow our web application to scale better, and how we can, in general, build faster web apps with denormalization.

Database Normalization

First, an introduction to database normalization for those unfamiliar with the concept. This will be a fairly quick summary, so for more information, please see the references at the end of the article or the wiki here. Relational Databases are databases made up by a collection of “tables” containing data. These tables include a series of columns, each dictating a specific data item, and a series of rows, each containing a record, or tuple. Each row/record will have an ID to help identify it, known as a primary key. For an example lets assume we are developing a simple blog where authors can make posts and users can simply comment on each post. The Author table may look something like this:

id	first_name	last_name	level
1	John	Doe	Editor
2	Sally	Smith	Writer

Though this is a rather simple example, we have already begun the normalization process – the authors first and last name have been split up. This allows us to store a single piece of data in each field of the table. By normalizing data, we reduce redundancy throughout our database, which allows us to only alter a single table, which is then propagated throughout the database using relations. But what if we wanted to add the blog posts? In relational databasing, the general practice is to create a new table. This table could be called Posts:

id	post_title	post_content	posted_date
1	Programming 101	Blog content…	01-01-2014
2	Web Security	Blog Content…	02-01-2014

How do we tell the database which author wrote a post. We could create a relationship between Posts and Authors by including the primary key, Author id, in the Posts table such as this.

id	post_title	post_content	posted_date	author_id
1	Programming 101	Blog content…	01-01-2014	2
2	Web Security	Blog Content…	02-01-2014	1

When placing a primary key of one table into a different table, this field is called a foreign key. By querying our database and completing a join between the author table and the posts table, we can see that Sally wrote the Web Security Blog Post. We have normalized our data up to this point, but why is this process important? Well, suppose that instead of placing the Author ID into our Posts table, we instead placed the authors name in each post. What are the negative side-effects to this approach?

Suppose we wanted to display the Authors level with their post. If we only have the authors name, this will make querying for such information tricky.
What if an Author changes their name? We will have to change each post to reflect their new name. If we follow the normalized approach from before, there is only one field we need to update.
We are creating redundant data. Databases often take up enormous amounts of space, and by creating multiple places with the same piece of data, we are increasing the size of our database.

Web Applications and Scalability.

With the problems outlined previously, why would we ever not normalize our database? Although the web has matured over the last couple decades, the importance of scalability is still in it’s infancy. Social networks and large E-commerce applications such as Facebook, Reddit and E-bay have to handle enormous databases with petabytes of data, and resolve queries in mere milliseconds. Furthermore, new web applications have to handle rapid scalability, allowing for databases to be quickly spread across multiple servers.

Image may be NSFW.
Clik here to view.

Facebook’s number of users in 2004 was roughly 1 million. By the next year it was 6M – A 600% increase from the previous year. Each progressive year for the next 4 years the number of users doubled. This is an extreme case, but web apps often grow rapidly, and the infrastructure must be able to handle this sudden surge.

Denormalization

One way we can create a faster database is by denormalizing certain area’s of our database. Let’s take a look at our example we used previously. We decided not to put our Author Names in the Posts database, because there would be challenges to updating the authors name, and because we would not have a connection with the Author table. But we can still insert the author_id into the Posts table, as well as the author_name. This would allow us to create a relation between the Posts and Author table if there was any further data required other then the author name. Only two issues would remain with this approach, the size of the database, which increases with redundancy, and the task of updating data in multiple places of the database. Suppose an author changes their name, we need to now propagate that data into both the Author table, and the Post table. Let’s see how we can handle these issues.

Denormalization Issue One – Size

There are two main reasons that size is an issue for relational databasing.

When looking at the cost of a web application, the size of a database is a factor in the cost. Each GB of space used by the database has some sort of cost upon the host/company.
If the redundant data added creates new rows/records in the database, queries can take longer by having to iterate through more records.

Handling number two is always a concern. When denormalizing your data, you want to avoid creating more records with redundant data. If denormalizing your data includes adding more records to your database, you likely are making a poor choice. On the other hand, if you denormalize data without adding new records or null data, you could be efficiently increasing query times. When the cost of your web application is coming together, you will find frequently that hard drive space is usually not one of your primary costs. Load balancing, CPU’s, and Slaves will frequently be far greater in cost. That’s not to say that it can’t be expensive. Furthermore, we are essentially paying this price to speed up our application, not to mention saved CPU costs, and server costs based on quicker queries.

Denormalization Issue Two – Data Propagation

There isn’t a strong answer to this issue. You will have to handle updating multiple area’s of the database. In I/O of databases, you will reach a bottleneck on either the input or output of data. By using more denomalization, you are focusing longer queue times for database writes, rather then reads. This is frequently chosen, as your database is usually read far more then written to. For more information, please read the section ahead called “The Cost of Denomalization.”

Denomalization Examples

Continuing with our example, lets assume we have now added an author_name column and author_id column to the database. This would look like the table below.

id	post_title	post_content	posted_date	author_id	author_name
1	Programming 101	Blog content…	01-01-2014	2	Sally Smith
2	Web Security	Blog Content…	02-01-2014	1	John Doe

We’ve added some additional redundant data then mentioned before, as we’ve also placed the first and last name together in the Posts table. This isn’t necessary, but if I’m querying data by just the first or last name, I likely will be doing so through the authors table, and not through author_name. So where is this useful? Lets take a look at two possible scenarios, and see how this benefits us.

A user is reading a post
A user comes to the front page of our blog, which lists the 25 most recent posts.

Scenario 1

While loading the page content, we would query the database for a post. Lets look how these queries would look between a normalized version, and our denormalized version.

Normalized

SELECT *
FROM posts INNER JOIN authors ON posts.author_id=authors.id;
WHERE posts.id=1;

Denormalized

SELECT *
FROM posts;
WHERE id=1;

Though our denormalized query does look simpler, and would technically be faster, our query probably isn’t doing much on it’s own (assuming each table was housed on the same server, more on that later). Likely with our blog, the number of authors writing for our blog isn’t very high, therefore the number of records to search in the authors tables isn’t very large. The query would be a rather simple query of a small table, and the denormalized version wouldn’t offer a significant speed boost. Each join in a query does add some time to a query, but the smaller the tables, the faster the joins.

Scenario 2

Scenario two offers a great chance to display why denormalizing some of our data can be very helpful. We need to display the 25 most recent blog posts on our front page. For the sake of simplicity, I’m going to assume they are the first 25 posts queried. This shouldn’t be assumed (and will likely be wrong), but we’re going to avoid cluttering our SQL statement for the sake of learning.

Normalized

SELECT posts.post_title, author.first_name, author.last_name, posts.posted_date
FROM posts INNER JOIN authors ON posts.author_id=authors.id;
WHERE posts.id=1;
LIMIT 25;

Denormalized

SELECT posts.post_title, posts.author_name, posts.posted_date

FROM posts;
WHERE id=1;
LIMIT 25;

With our normalized query, we will often be searching the authors table to match the needed author_id. Often times, we will likely be searching for an author we have already found. Each RDBMS has it’s own way attempting to minimize the hit from this query, but this will always take longer then the call we made in our denormalized version.

When Denomalization Becomes Important

Up to now, we will only be seeing marginal gains from our denormalization process. The amount denormalization becomes a useful technique depends on two factors

The number of Joins we need to make, as well as the amount of records in each joined table
Database sharding over multiple servers.

Scenario One

Lets assume with our front page that we also want to display the amount of comments made on each blog post. To go along with this assumption, lets assume there is a comment table, with the following schema:

post_id

user_id

comment

Now, if we want to show the amount of comments there are for each blog post, along with the other information we already showed, we will have to include another inner join, where we count the number of comments in the comment table that matches our passed post_id. We are creating two bottlenecks here, we are joining three tables together, which will have a large overhead, and we are counting the number of comments for 25 posts, which will require 25 count functions.

What if we included the number of comments made on our post in the post table. It would make our table look something like this:

id	post_title	post_content	posted_date	author_id	author_name	number_of_comments
1	Programming 101	Blog content…	01-01-2014	2	Sally Smith	34
2	Web Security	Blog Content…	02-01-2014	1	John Doe	13

This would mean that we don’t do a single join with another table, and that we don’t have to count the number of comments for each post every time a user comes to our front page. Instead, whenever a comment is made, while we add that comment to the comment table, we will increase the number_of_comments field by 1. This technique also allows us to avoid using the costly “count” function in general. We can take this example even further, assuming there are tables that show the number of likes, or up/down votes. This will save our database huge amounts of query time, but will increase write times.

Scenario Two

One of the most important area’s for denormaliztion is in the development of web apps with a huge growth expectancy. When web apps become large enough, databases frequently need to be sharding across multiple servers. Creating joins across multiple servers is extremely costly, and should be avoided whenever possible. For further information about this area, see this article.

The Cost of Denormalization

Using smart denormalization is great, but developers must be careful of its weak point. Denormalization works great for features that use frequent database reads. On the flip side, it puts a heftier burden on writes. If your web application feature attracts more reads then writes (such as the front page of a blog), using denormalization wisely can greatly increase your performance. On the other hand, it complicates database writes, so you may want to be cautious around write heavy features. There is no one-size fits all answer, and as a developer you must calculate each feature accordingly.

Conclusion

In conclusion, developers can build faster web apps with denormalization. This is a very common tactic that can often feel out of the norm for a young web developer. A common error web developers make is to normalize their data into 3NF, and require several joins to collect the data they need for a specific query. With the demand of lightning-fast load times on the web, denormalization is necessary to meet the demand of users. In fact, there has recently been a shift in web technology to move away from relational databases, and move into NOSQL, non-relational technologies such as Mongo DB and Couch DB, which use JSON objects to better serve Real-time and Big-Data web apps. The are further technologies which help support denormalization and increase query times, such as Memcache, that will be discussed in a later post. Until then, make sure to stay redundant!

Note: Steve Huffman, one of the founders of Reddit, has a great piece about his lessons learned while creating Reddit. Lesson 6 being the topic of this article. Also, lesson 5 being one of the upcoming articles. Image may be NSFW.
Clik here to view.