Introduction
Database Normalization can be complicated. The definitions of the various "normal forms" are academic and not very practical. All this can leave one feeling lost. Hopefully, this guide can help. This guide aims to be a practical, succinct guide to database normalization for the working software engineer. In what follows we will look at the first few normal forms and their properties in simple language. (If you just want a table of normal forms and their properties, check out this table.)
Basic columns and special columns
In what follows I will refer to table columns as either "basic columns" or "special columns". When I say "basic", I really just mean "not special", which raises the core question of: What is meant by "special column"? Simply put, a special column is a column you could use to uniqely identify rows within a table. (The technical version is that when I say "special column" I am mostly referring to an attribute set, isolated by the column name, that is a member of a candidate key, aka, a minimal superkey, of a relational schema, isolated by the table name. Sometimes, it's instead referring to a non-minimal superkey.) It is of critical importance in database normalization to identify all the special columns of a table. This might naively seem easy to do, but determining the special columns of a table can quickly become a subtle and debatable business.
That said, there is one special column that is always straightforward, and will show up in essentially every production database table you will ever interact with: the serial id
column. It's basically the most special column of all because it exists for the sole purpose of uniquely identifying rows in a table. It's so special in fact, that, really, it's too special, and so we will completely ignore it when discussing database normalization. I discuss this oddity in more detail in the next section.
Ignoring the serial id
column
The reason we ignore the serial id
column is because what we are really looking for is the naturally special columns. We are looking for the fundamental attributes, or properties, of the entities in question. We want to normalize the "real" data, the data you need in order to appropriately describe the entities. The serial id
column isn't "real" data, it's not a property of the entity, it's just tacked on to a row to guarantee unique identification. Only by normalizing against the "real" data will you actually gain the benefits of normalization, which are, among other things, less duplication and less redundancy.
So, while I will include the id
column in examples, because that's the most practical way to interact with tables, please keep in mind that (1) we will be ignoring it, and (2) why we are choosing to do so (which, as a reminder, is because we are only concerned with "real" data).
First Normal Form, 1NF
First Normal Form, simply put, means don't put lists in table cells. (The technical version is: The data-domain of each attribute contains only atomic values.)
Putting a list in a cell is strongly discouraged unless you have a very good reason to do so. There are multiple datatypes that will allow you to do this, for example the ARRAY
keyword in SQL, but just because you can doesn't mean you should.
Before I discuss the most common way I see this violation occur, I want to briefly discuss JSON and JSONB data in table cells. JSON(B) is a valid data type, and has many justifiable use cases, including NoSQL-like data storage, but it should still be avoided unless you have a good reason to use it. Section 8.14.2 of the PostgreSQL docs states the following on this topic:
Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently.
Now, probably the most common way you'll see 1NF being violated is with cells containing comma-separated values as follows:
id | name | books |
---|---|---|
1 | Austin | East of Eden,Green Eggs and Ham |
2 | Bertha | Grapes of Wrath,The Cat in the Hat |
The solution to the problem is to create a row per book instead:
id | name | book |
---|---|---|
1 | Austin | East of Eden |
2 | Austin | Green Eggs and Ham |
3 | Bertha | Grapes of Wrath |
4 | Bertha | The Cat in the Hat |
See Appendix A for a SQL script that transforms the unnormalized example data into the 1NF data as shown above. It is in a format that's easy to copy and paste and experiment with yourself, which I encourage you to do. (For this and every other script, I will be using PL/pgsql, and writing .sql
files against a PostgreSQL local instance.)
Second Normal Form, 2NF
Second Normal Form, simply put, means that, in addition to being in 1NF, every basic column is determined by all the special columns. (To phrase it in the negative: No basic column can be determined by just some of the special columns.) (The technical version is: No non-prime attribute is functionally determined by a proper subset of the candidate key.)
An example 1NF table that violates 2NF is as follows:
id | name | book | price |
---|---|---|---|
1 | Austin | East of Eden | 25 |
2 | Austin | Green Eggs and Ham | 10 |
3 | Bertha | East of Eden | 25 |
4 | Bertha | Grapes of Wrath | 20 |
The price
column is determined completely by the book
column, and nothing else. But, the special columns are both book
and name
. This partial determination of price
by just book
, and not also name
, is a violation of 2NF. The logic here is that if a special column has no impact on a basic column, then the basic column probably shouldn't be in the table at all.
The solution is separating out the book details into a second table:
id | title | price |
---|---|---|
1 | East of Eden | 25 |
2 | Grapes of Wrath | 20 |
3 | Green Eggs and Ham | 10 |
id | name | book_id |
---|---|---|
1 | Austin | 3 |
2 | Austin | 5 |
3 | Bertha | 3 |
4 | Bertha | 4 |
This is sufficient to satisfy 2nd Normal Form. (Note that if this data were, for example, representing the books owned by each user, you'd be better off creating a join table, users_books
, that had foreign key references to both users
and books
. In that situation, the last table above could represent half of a full join on users
and users_books
. I'm ignoring join tables in this guide. I will perhaps include them later, since they are pertinent to good database maintenance.)
See Appendix B for a SQL script that transforms the 1NF table into 2NF tables as shown above. It is in a format that's easy to copy and paste and experiment with yourself, which I encourage you to do.
Third Normal Form, 3NF
Third Normal Form, simply put, means that, in addition to being in 2NF, every basic column is directly determined by the special columns. (To phrase it in the negative: No basic column is determined by the special columns through another basic column.) (The technical version is: No non-prime attribute is transitively functionally determined through another non-prime attribute. Or, every functional dependency, transitive or otherwise, is determined by an element of a superkey.) This might sound confusingly similar to 2NF, but the key difference is direct determination. 2NF still allows for indirect determination, otherwise known as transitively dependent determination.
An example 2NF table that is violating 3NF is as follows:
id | loan_date | loanee | loanee_dob | book_id |
---|---|---|---|---|
1 | 2021-03-03 | Bertha | 1902-02-02 | 3 |
2 | 2021-03-03 | Bertha | 1902-02-02 | 7 |
3 | 2021-02-02 | Austin | 1900-01-01 | 3 |
The problem is that loanee_dob
is determined by loanee
, but loanee
is not one of the special columns, which are book_id
and loan_date
. It is true that, if you give me the special columns of book_id
and loan_date
I will find the loanee_dob
unique to that row, but the "truth" of loanee_dob
is dependent on the loanee
. This indicates that the loanee_dob
is indirectly (i.e., transitively) dependent on the special columns through the loanee
column. This is the kind of indirectness we are trying to avoid.
The solution is extracting the transitively dependent columns into their own table. For example:
id | name | dob |
---|---|---|
1 | Bertha | 1902-02-02 |
2 | Austin | 1900-01-01 |
id | loan_date | loanee_id | book_id |
---|---|---|---|
1 | 2021-03-03 | 2 | 3 |
2 | 2021-03-03 | 1 | 7 |
3 | 2021-02-02 | 1 | 3 |
See Appendix C for a SQL script that transforms the 2NF table into 3NF tables as shown above. It is in a format that's easy to copy and paste and experiment with yourself, which I encourage you to do.
A note about Boyce-Codd Normal Form, BCNF, 3.5NF
There are edge cases in the technical definition of 3NF that BCNF attempts to address. (This is the best articulation of the difference between 3NF and BCNF/3.5NF that I've read, which occurs near the end of the section.) To the best of my knowledge, the simple takeaway is essentially the same it was for 3NF, i.e., every basic column is directly determined by the special columns.
Fourth Normal Form, 4NF
Fourth Normal Form, in a sense, cannot be put simply. If I had to try, I'd say that 4NF, "simply put", means that, in addition to being in 3NF, if a group of cells across one or more columns can determine a group of cells in one or more other columns, then the original group of cells in the original column(s) must at least be in the special columns. Since that "simple" sentence barely makes sense, I'll just include the technical version, instead of hiding it in a footnote. The technical version is: Every non-trivial, multivalued functional dependency is derived from a superkey, minimal or otherwise.
I have not found a dataset that, in my opinion, is unequivocally both in 3NF and violating 4NF, so at the moment I do not have an example nor script to transform the relation. The wikipedia pages for 3NF, BCNF, and 4NF are good starting points for learning more about 4NF. When I find a good example, I will include it here.
Fifth Normal Form and beyond
Things get quite academic and arguably non-practical after 4NF, and maybe actually after 3NF, although it does seems that 4NF violations do actually occur in practice.
I will chose not to cover 5NF and other higher-order properties that can be found in this table (which I also linked to in the intro). It is an interesting field of research, so I encourage you to dig deeper and look into these properties yourself.
Conclusion
I hope you found this guide helpful. If you have any questions or feedback, feel free to reach out! Happy coding :D
Appendices
Appendix A: SQL script converting unnormalized data to 1NF
This is probably not the best solution. It is a fully contained solution. You are encouraged to copy and paste and experiment with it. For example, if you saved this file as example.sql
, you could run the following from the command line (assuming you are using PostgreSQL):
createdb test
psql -d test -f example.sql
Similar functionality can be achieved with other datatypes. For example, if your data was in a text ARRAY
data type, you could use unnest(books)
instead of regexp_split_to_table()
.
DROP TABLE IF EXISTS users;
CREATE TABLE users (id SERIAL PRIMARY KEY, name text, books text);
INSERT INTO users (name, books) VALUES
('Austin', 'East of Eden,Green Eggs and Ham'),
('Bertha', 'Grapes of Wrath,The Cat in the Hat');
ALTER TABLE users ADD COLUMN book text;
DO $$
DECLARE
user_record record;
user_book text;
BEGIN
FOR user_record IN
SELECT * FROM users
LOOP
FOR user_book IN
SELECT regexp_split_to_table(user_record.books, ',')
LOOP
INSERT INTO users (name, book) VALUES (user_record.name, user_book);
END LOOP;
END LOOP;
END;
$$;
DELETE FROM users WHERE book IS NULL;
ALTER TABLE users DROP COLUMN books;
SELECT * FROM users;
Appendix B: SQL script converting 1NF to 2NF
This is probably not the best solution. It is a fully contained solution. You are encouraged to copy and paste and experiment with it. For example, if you saved this file as example.sql
, you could run the following from the command line (assuming you are using PostgreSQL):
createdb test
psql -d test -f example.sql
DROP TABLE IF EXISTS users, books, users_and_books;
CREATE TABLE users_and_books (
id SERIAL PRIMARY KEY,
name text, book text,
price integer
);
INSERT INTO users_and_books (name, book, price) VALUES
('Austin', 'East of Eden', 25),
('Austin', 'Green Eggs and Ham', 10),
('Bertha', 'East of Eden', 25),
('Bertha', 'Grapes of Wrath', 10);
CREATE TABLE books (id SERIAL PRIMARY KEY, title text, price integer);
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name text,
book_id integer REFERENCES books ON DELETE CASCADE
);
WITH
books_rows AS (SELECT DISTINCT book AS title, price FROM users_and_books)
INSERT INTO books(title, price) SELECT title, price FROM books_rows;
DO $$
DECLARE
r users_and_books%ROWTYPE;
the_book_id books.id%type;
BEGIN
FOR r IN
SELECT * FROM users_and_books
LOOP
SELECT id FROM books INTO the_book_id WHERE title = r.book;
INSERT INTO users(name, book_id) VALUES (r.name, the_book_id);
END LOOP;
END;
$$
LANGUAGE plpgsql;
DROP TABLE users_and_books;
SELECT * FROM users;
SELECT * FROM books;
Appendix C: SQL script converting 2NF to 3NF
This is probably not the best solution. It is a fully contained solution. You are encouraged to copy and paste and experiment with it. For example, if you saved this file as example.sql
, you could run the following from the command line (assuming you are using PostgreSQL):
createdb test
psql -d test -f example.sql
DROP TABLE IF EXISTS loans, users;
CREATE TABLE loans (
id SERIAL PRIMARY KEY,
book_id integer,
loan_date DATE,
loanee text,
loanee_dob DATE
);
INSERT INTO loans (book_id, loan_date, loanee, loanee_dob) VALUES
(3, '2021-03-03', 'Bertha', '1902-02-02'),
(7, '2021-03-03', 'Bertha', '1902-02-02'),
(3, '2021-02-02', 'Austin', '1900-01-01');
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name text,
dob DATE
);
INSERT INTO users(name, dob)
SELECT DISTINCT loanee AS name, loanee_dob AS dob
FROM loans;
ALTER TABLE loans
ADD COLUMN user_id integer REFERENCES users ON DELETE RESTRICT;
DO $$
DECLARE
r users%ROWTYPE;
BEGIN
FOR r IN
SELECT * FROM users
LOOP
UPDATE loans SET user_id = r.id
WHERE loanee = r.name AND loanee_dob = r.dob;
END LOOP;
END;
$$
LANGUAGE plpgsql;
ALTER TABLE loans
DROP COLUMN loanee,
DROP COLUMN loanee_dob;
SELECT * FROM users;
SELECT * FROM loans;