Database Normalization

Database normalization is the process of organizing data into tables in such a way that the results of using the database are always unambiguous and as intended. Such normalization is intrinsic to relational database theory. It may have the effect of duplicating data within the database and often results in the creation of additional tables.

The concept of database normalization is generally traced back to E.F. Codd, an IBM researcher who, in 1970, published a paper describing the relational database model. What Codd described as “a normal form for database relations” was an essential element of the relational technique. Such data normalization found a ready audience in the 1970s and 1980s — when disk drives were quite expensive, and a highly efficient means for data storage was essential. Since that time, other techniques, including denormalization, have also found favor.

Data Normalization Rules

While data normalization rules tend to increase data duplication, they do not introduce redundancy, which is unnecessary. Database normalization is typically a refinement process after the initial exercise of identifying the data objects in the relational database, identifying their relationships, and defining the tables required and the columns within each table.

Normalization degrees of relational database tables have been defined and include:

First standard form (1NF). This is the “basic” level of database normalization, and it generally corresponds to the definition of any database, namely:

  • It contains two-dimensional tables with rows and columns.
  • Each column corresponds to a sub-object or an attribute of the object represented by the entire table.
  • Each row represents a unique instance of that sub-object or quality and must be different from any other row (that is, no duplicate rows are possible).
  • All entries in any column must be of the same kind. For example, in the column labeled “Customer,” only customer names or numbers are permitted.

Second standard form (2NF). At this level of normalization, each column in a table that is not a determiner of the contents of another column must be a function of the other columns in the table. For example, in a table with three columns containing the customer ID, the product sold, and the product price when sold, the cost would be a function of the customer ID (entitled to a discount) and the specific product. In this instance, the data in the third column depends on the data in the first and second columns. This dependency does not occur in the 1NF case.

The column labeled customer ID is considered a primary key because it is a column that uniquely identifies the rows in that table, and it meets the other accepted requirements in standard database management schema: It does not have NULL values, and its values won’t change over time.

In the example above, the other column headers are considered candidate keys. The attributes of those candidate keys that make them unique are called prime attributes.

Third standard form (3NF). In the second normal form, modifications are still possible because changing one row in a table may affect data that refers to this information from another table. For example, using the customer table just cited, removing a row describing a customer purchase (because of a return, perhaps) will also remove the fact that the product has a specific price. In the third standard form, these tables would be divided into two tables so that product pricing would be tracked separately.

Primary standard forms include the domain/critical normalized form, in which a key uniquely identifies each row in a table, and the Boyce-Codd common form (BCNF), which refines and enhances the techniques used in the 3NF to handle some types of anomalies.

Database normalization’s ability to avoid or reduce data anomalies, data redundancies, and data duplications while improving data integrity has made it an essential part of the data developer’s toolkit for many years. It has been one of the hallmarks of the relational data model.

The relational model arose in an era when business records were, first and foremost, on paper. Its use of tables was, in some part, an effort to mirror the type of tables used on paper that acted as the original representation of the (mostly accounting) data. The need to support that representation has waned as digital-first pictures of data have replaced paper-first records.

But other factors have also contributed to challenging the dominance of database normalization.

Over time, continued reductions in the cost of disk storage and new analytical architectures have cut into normalization’s supremacy. The rise of denormalization as an alternative began with the advent of data warehouses in the 1990s. More recently, document-oriented NoSQL databases have arisen; these and other non-relational systems often tap into non-disk-oriented storage types. Now, more than in the past, data architects and developers balance data normalization and denormalization as they design their systems.

Database Normalization Tools

Data modeling software can incorporate features that help automate preparing incoming data for analysis. IT managers must still develop a plan to address common problems, including data normalization. Vendors in data normalization include 360Science, ApexSQL, and many other smaller niche developers.