This course is closed. The accuracy of this page's content is not guaranteed. [Return to college.yukondude.com]
|COMP 210 - Fall 2003
Database Design I
For our purposes, normalization is the process of converting an Entity-Relationship model to a relational database schema. A "schema" is really nothing more than the table-column-key equivalent of an entity-attribute-identifier-relationship model. The primary difference is that a schema can be directly translated into the SQL Data Definition Language (DDL) in order to implement this physical design in an RDBMS.
The 10 Easy Steps to Normalization
The following steps outline the procedure that you will always take when normalizing a conceptual model. Specific types of normalization conversions are explained in the the sections that follow.
Relational databases introduce the useful, but occasionally misused, concept of a NULL value. NULL should be thought of the absence of any meaningful value: it is not the same as zero, or false, or an empty string. Bear in mind that any mathematical operation including a NULL will always result in NULL: contrast 2 + 0 = 2, and 2 + NULL = NULL (actually, we'll see later that the equals sign can't be used with NULL, because NULL does not equal NULL, and even more strangely, NULL does not not equal NULL).
NULLs are most often used in columns to indicate either Unknown or Not Applicable. Of the two, Not Applicable is the more "proper" interpretation, but both uses are common.
A column in a table may be declared to either allow or disallow NULL values. Despite their utility, columns that allow NULL values should be kept to an absolute minimum. For one thing, relational theory--and most RDBMSs--do not allow primary keys composed of columns with NULL values.
Keys are very important to relational schemas; much more so that identifiers to entity-relationship models. We've already covered some of the key types when discussing Normal Forms, but there are some new types that we have not yet seen. We'll redefine all of them now in terms of tables and columns.
A candidate key is the set of one or more columns in a table that uniquely identify a row in that table. A table may have many candidate keys, and if the table is in 1NF, always has one candidate key: the set of all columns in the table (1NF requires that all rows be distinct).
A primary key is chosen from a table's candidate keys. The choice of primary key is a design decision, often influenced by either performance or business concerns. A primary key must not contain any columns that allow NULL values. A primary key should never change values.
A foreign key is the set of one or more columns in a table that together are the designated primary key of another, related table. Relational schemas, oddly enough, have no concept of a relationship as in the E-R model. Instead, foreign keys are used to define the relationships between tables. A table may have many foreign keys, and the column or columns making up the foreign key may be included in the table's primary key. The column used to store the foreign key does not have to use the same name as that used by the primary key in the related table.
A surrogate, or artificial, key is a system-supplied unique value for a column that will serve as a table's primary key. The value of a surrogate key has no business meaning and is intended only as a means for establishing relationships between tables (through foreign keys).
There are three main reasons for using surrogate keys:
The last of these reasons may occasionally induce a design decision to break the "All tables must have a primary key" interpretation of 1NF, especially if no other table must implement a foreign key of this table. While a surrogate key will satisfy the commandment and can safely be used as a primary key, it may also be overkill. The tables that represent many-to-many relationships have little intrinsic meaning to business users, and so identifying each row in these tables by a unique integer is of questionable value.
Other than the situations that demand them, using surrogate keys is a personal choice. I prefer to use them for almost every table, more to preserve consistency in the schema design than for any other reason (although performance is undoubtedly a consideration). There are a few stumbling blocks in using RDBMS-supplied values: J2EE's Container Managed Persistance does not coexist well with surrogate keys, nor do other database-disguising technologies; and any client-layer application that inserts a new row into a table with a surrogate key will have to do a little extra work to retrieve that new system-generated value from the database.
The actual mechanics of converting entities into tables is relatively straightforward. Each of the special cases are described in the following sections.
The notation used for relational schemas is arbitrary at best. There are no standards. I have used a simplified version of Crow's Foot notation, although it's not unusual to just use simple straight lines to connect tables and infer their relationship from the primary and foreign keys. Note that primary key columns are underlined in the schema notation.
One-to-one relationships are recognized in tables by inserting a foreign key column or columns into one of the tables that refers to the primary key of the other table. Which table will receive the foreign key is a design decision, but usually the "weaker" of the two entities (even if both are technically strong) should store the foreign key.
In this example, the USER_ACCOUNT entity is reckoned to be the weaker of the two (it likely would be a true weak entity) and so its corresponding table has an additional column, full_name, which is the foreign key that refers to the primary key of the same name in the PERSON table.
Note that the primary key chosen for the USER_ACCOUNT table is also the foreign key to the PERSON table: full_name. The user_name column is also a reasonable candidate to be the primary key (chances are you can't have two identical user names in the system). However, making the foreign key in a 1:1 relationship the primary key as well enforces the one-to-one nature of the relationship.
A one-to-many relationship is converted by inserting a foreign key into the table that lies on the "many" side of the relationship.
In this example, the full_name column is the foreign key that is inserted into the BANK_ACCOUNT table. The primary key is then chosen to be both the account_type and full_name columns. Since the primary key must be a unique identifier for the row, both columns are necessary for the case that a person has multiple bank accounts. Note that the primary key in this case also limits a person to at most one account of any account type.
To implement a many-to-many relationship in a schema, an additional table must be added that contains foreign keys for each of the two related tables. The additional (sometimes called "mapping") table may also have additional columns for any attributes of the many-to-many relationship.
The example above shows the new table, SUBSCRIPTION, that is composed of foreign keys to both the PERSON and MAGAZINE tables (the foreign keys are numbered just to show that they're not a single foreign key made up of two columns). The combination of these two foreign keys forms the primary key for the SUBSCRIPTION table, and the subscribe_date attribute of the many-to-many relationship is now a column in the mapping table. The foreign keys in such a mapping table are often used as the primary key for the table. Note that the relationship between the mapping tables and the two previous tables are now one-to-many. Following normalization, a relational schema will never contain a direct many-to-many relationship.
The participation of a relationship (optional or mandatory) is converted to schema form depending upon the location of the foreign keys. If the table opposite an optional relationship possesses a foreign key to the related table, then that foreign key column or columns must allow NULL values. If the table opposite an optional relationship does not possess a foreign key, then no further action need be taken.
In the example above, we have the strange case where a bank account need not be assigned to a person (perhaps the bank has special accounts just for balancing the books, or dealing with Enron-esque corporate clients). The relationship is one-to-many, with BANK_ACCOUNT on the many side, so it must have a foreign key that points back to PERSON: full_name. Since the foreign key is on the opposite side of the relationship, it must accept NULL values. Therefore, a row that represents a bank account that is not assigned to a person would have a NULL value in the full_name column.
Note that because full_name now accepts NULL values, we can't use the combination of account_type and full_name as the primary key. Instead, we create a surrogate key, account_id, and use it as the primary key.
If the (much more reasonable) relationship was instead that a person may or may not have a bank account, but that a bank account must be assigned to a person, then the conversion would be as follows:
In this case, the table opposite the optional relationship, PERSON, has no foreign key to BANK_ACCOUNT, and therefore we're done. No columns need to be changed to accept NULL values. If a person is added to the database without any bank account, then we'll simply add the record to PERSON, and not add anything to BANK_ACCOUNT.
As I've alluded to in the past, the concept of strong and weak does not translate to relational schemas. Instead, a weak entity is represented in table form just as would be a strong entity, except that it's often necessary to change the participation of the relationship on the side of the weak entity to optional. This will allow rows to be added to the "strong" tables, without having to invent "weak" table data to preserve the relationship.
In this example, the relationship on the USER_ACCOUNT side is converted to optional. Because the optional end of the relationship is on the same side as the foreign key, no columns need to change in their NULLability. Now a new person can be added to the database without having to immediately also add a user account. Note too that a weak entity table will always contain the foreign key to any strong entity table with which it is related.
At the relational schema level, a subtype relationship is nothing more than a one-to-one relationship that is optional on the side of the subtype table.
In this example, DOCTOR is a subtype of PERSON, and has an extra attribute, speciality. Once converted to schema form, the relationship is nothing more than one-to-one. A subtype table will always carry a foreign key to the supertype or parent table, and that foreign key will almost always serve as the primary key for the subtype table. Because the optional end of the relationship is on the same side as the foreign key, no columns need to change in their NULLability: if a new person is added who is not a doctor, a record will be added to PERSON, and nothing will be added to DOCTOR.
Recursive Entity Relationships
Finally, recursive relationships are really no different than inter-entity relationships. Foreign keys must still be allocated, just as in the case for the other cardinality cases, including extra mapping tables for many-to-many recursive relationships. The only difference is that the foreign keys will always end up in the same table as the primary keys to which they refer.
In the example above, the foreign key mother_full_name is added which refers to the full_name primary key. They can't have the same names by the rules of a Relation, but there's no rule that a foreign key has to have the same name as the primary key to which it refers. The recursive relationship can be read as: a person has to have a mother, and a mother can have many children, or none at all. It takes some thinking, but the mother_full_name foreign key is actually on the optional side of the relationship and so there is no need for it to allow NULL values.
The PERSON table does have one peculiarity in that eventually we'll run out of mothers, but the relationship says that every person must have a mother. This is what we otherwise call: The Chicken and the Egg Problem.