Practical database design, Part 1

Keys and datatypes

Comments

In this series, I discuss some general best practices that I have found to be particularly helpful. Nothing in it is specific to any one vendor's product and everything should, therefore, be applicable, regardless of which database implementation is being used. In this frist part of the article, I want to talk about both simple and complex datatypes, and about primary and foreign keys, which are the plumbing that holds the entire database together.

Primary keys and related matters

A relational database (DB) stores two kinds of information -- data and plumbing. Data comprises the customer names, inventory numbers, item descriptions, and so on, that the application uses. Plumbing refers to the primary and foreign keys that the DB needs to find database records and relate them to one another.

Basic plumbing

For the purpose of data modeling, the plumbing should be largely transparent. In fact, purist DB lore makes no distinction between data and plumbing. However, you will see that it is more efficient for administration and maintenance, as well as in terms of runtime performance, to have some additional fields to serve as DB keys.

Every table must have a primary key: an attribute or combination of attributes that are guaranteed to be unique and not-null. It is generally helpful to introduce a surrogate key -- a table attribute which has no business meaning, but simply serves as unique identifier for each record in the table. This is the plumbing that I have been referring to.

The requirements for a primary key are very strict. It must:

  • Exist
  • Be unique
  • Not change over time

Surrogate keys help to mitigate the fact that real business data never reliably fulfills these requirements. Not every person has a Social Security Number (think of those outside the U.S.), people change their names, and other important information.

Business data might also simply be bad -- glitches in the Social Security Administration's system may lead to different persons having the same Social Security Number. A surrogate key helps to isolate the system from such problems.

The second reason that surrogate keys are favorable has to do with efficiency and ease of maintenance, since you can choose the most efficient datatype for the surrogate key. Furthermore, the surrogate key typically is a single field (not a compound key), which simplifies the schema (particularly when the key is used in other tables as a foreign key).

Every table should have a dedicated column to serve as this table's primary key. This column may be called id or pk (or possibly <table_name>_id or <table_name>_pk). Most databases are tuned for queries on integers, so it makes sense to use this datatype as primary key. Many databases, including Postgres and Oracle, also provide a special serial or sequence integer type, which generates a sequence of unique integers. Declaring a column to be of this type guarantees that a unique key is generated for each inserted row.

Foreign keys are table attributes, the values of which are the primary keys of another table. It often makes sense to label foreign key columns explicitly, for instance, by adopting a naming convention such as <other_table_name>_fk. A referential integrity constraint (references) should be declared as part of the CREATE statement when creating the table.

It bears repeating that the surrogate keys discussed earlier are part of the plumbing only -- their existence does not obviate the modeling requirement to be able to form a primary key from the business data alone. Such a business data candidate key is a subset of all attributes, the values of which are never null, and each combination of values is unique. As a check on correct data modeling, such a candidate key must exist and should be documented for every table.

Strictly speaking, you may not always find a candidate key among the business data. Imagine a table recording the first and last name for each user, but having no further attributes. Now assume that there are two different persons, both of whom have the first name "Joe" and last name "Blow." In such a case, there exists no combination of table attributes that can form a suitable candidate key.

The underlying problem here is whether you are talking about the uniqueness of datasets or about the uniqueness of the underlying entities -- users, in this example. It is generally more intuitive, in particular to developers used to object-oriented analysis, to model the uniqueness of the underlying entities. Surrogate keys as discussed earlier can help to achieve this.

Alternate keys and visible identifiers

As part of the plumbing, the surrogate key has no need to ever be visible outside the DB. In particular, it should never be revealed to the user. This allows the DB administrator to change the representation of the keys at will if necessary. If a business need arises for providing the user with a unique identifier to a particular dataset, this identifier should be considered real business data and kept separate from the plumbing. For instance, an additional column called VisibleAccountNumber or the like can be introduced. Of course, this attribute should be non-null and unique so that it forms an alternative candidate key (an alternate key). Having a separate column for visible identifiers also makes it possible to generate and format the values for this attribute in a user-friendly way so that it is easy to read over the phone to a customer support person, for instance.

A borderline case is when the identifier is not directly visible, but may still be accessible to the user. Examples include hidden fields in Web pages in which an identifier is shuttled to the client to be used as a parameter in the following request. Although there is no need for the user to handle the identifier, a malicious user may read and attempt to spoof it. Using the numerical values of a primary key directly, in principle, allows any attacker to walk the entire table!

Defences against this problem include either encrypting and decrypting the value of the primary key, or protecting the key by appending a Message Authentication Code (MAC). An alternative is to use a hard-to-spoof visible identifier attribute for the table, such as the hash of the record's primary key or creation timestamp. (Of course, the uniqueness of this attribute must be assured.)

Whether the key is visible to the application (as opposed to the end user) depends on the specifics of the project. Using a numeric type directly carries the key's database representation straight into the application code and should be avoided to prevent coupling. In small-scale developments, a String representation of the key's value may be acceptable (all datatypes that can be stored in a DB must be able to be serialized).

But a better solution is a simple wrapper object that adds very little complexity, but provides strong decoupling of the database keys' representation from their interfaces. A danger exists in making the wrapper object too smart. The intention with surrogate keys is to make them simple and efficient for the database to handle. Settings from a database value and possibly from a String, comparing with another key object, and possibly serializing are all the methods that are required. Smarts, such as the ability to verify the contents based on a checksum calculation, suggest that this object probably belongs to the business data domain (like the visible record identifiers, introduced earlier).

The problem of the Universally Unique Identifier

A final consideration concerns the possible need for a Universally Unique Identifier (UUID). The short answer is that relational databases do not require UUIDs at all. In fact, the entire UUID concept is somewhat unrelated to relational database management. Relational database keys -- the plumbing -- need only be unique per table, which can be achieved by using an auto-incrementing datatype such as the serial type mentioned earlier.

UUIDs can have some technical difficulties. To ensure uniqueness, all UUIDs must be generated by a centralized service -- which leads to scalability problems and can become a single point of failure. (The scalability issue can be mitigated by a stratified approach in which a central master is used to give out seeds to several slaves, which in turn generate the final identifiers in batches, and so on.) To represent a UUID in a database, use either a string attribute or a compound key comprising several integer columns. Both approaches are significantly slower than operations based on keys made up of long integers. Compound keys also increase the complexity of the DB schema when used as foreign keys.

In the end, whether or not a record in a database needs to have a truly globally unique ID is dictated by the business rules, not the database architecture. Some records may already contain some form of UUID (merchandise items, for instance, typically possess a Universal Product Code as barcode). Some other records, possibly corresponding to principal business entities, may otherwise already contain a unique identifier as part of their business data (such as the combination of timestamp and account name for a ledger entry). If this is not the case, a UUID can be generated and stored alongside the business data for those records that require it. In any case, UUIDs should be considered part of the business data -- not of the plumbing.

Even if (and this is a big if) the object-relational mapping approach chosen requires every business object to have a persistent, unique ID, there is no need to base the internal workings of the underlying relational database engine on this fact.

In summary, I argue to keep business data apart from the database's internal plumbing. Building a relational database around UUIDs breaks this principle by using attributes, which, in the end, are really part of the business data, as internal infrastructure. (For a totally different point of view on this issue and a careful discussion of the problems involved in generating UUIDs in a scalable fashion, see Scott Ambler's paper, "Mapping objects to relational databases," in Related topics.)

Datatypes

The SQL standard defines a number of standard datatypes and most DB vendors support some additional ones which are specific to their own product. In the absence of truly compelling reasons to the contrary, avoid such extensions for the sake of portability.

Strings and numbers

In general, numerical types pose few problems -- just select one that is large enough to support the necessary range of values.

The attempt to find the optimal width of a string column is usually not worth the effort. You can avoid a lot of confusion later on by making all text messages of type varchar(n) and limiting yourself to a few standard string lengths and introducing aliases for them, such as: 32 bytes ("Label"), 256 bytes ("Note"), and 4k ("Text").

Even if other business requirements restrict the maximum length of certain fields to specific values, the DB schema is arguably not the best place to enforce these rules. By the time the data reaches the DB, it is too late to do anything about it (except reject it). Individual restrictions, stemming from business rules and requirements, should be enforced by the business logic layer, which handles user interaction and input validation. On the other hand, maintenance of the DB schema is considerably simplified if it is restricted to a handful of different string attributes.

Limit the use of fixed-width strings to codes of all sorts (as opposed to variable-length strings for real text). Keep in mind however, that many seemingly fixed-length codes do actually become wider over time. The prudent DB designer tries to avoid anything similar to the Y2K problem for new development work.

Time and money

A type to record timestamps (date/time combinations) is always necessary and is, fortunately, covered by the SQL standard. No fully satisfying way exists to record a monetary value, however.

Saving monetary values and treating them in program code as a floating-point values always leads to round-off errors. Recording the value as an exact integer of the smallest currency subdivision (such as "cent" for US dollars, as well as for Euros and other appropriate currencies) may not be sufficient either. Many values carry more digits behind the decimal point than the two for which actual coins exist (just visit your local gas station). A choice of decimal with 5 to 9 digits should work, though.

It goes without saying that no monetary value should ever be recorded without also recording the currency -- even if you think your application will never handle anything but US dollars. Consider setting up a currency table and relating it to the monetary values using foreign keys rather than embedding currency information directly. This helps with internationalization (different currency names and symbols), as well as with formatting issues.

Booleans and the System Types Table

The use of the type bool anywhere in the design is a hint to rethink this particular module again. Very few attributes are truly limited to only two values -- even a gender column has a malicious tendency to morph towards (at least) three states -- male, female, and unknown. Allowing nulls only masks the real problem. A more flexible approach to type codes is needed.

In many places in the database, attributes determine a record's type in some way. The GenderType mentioned above is such an instance. Other examples may include ItemType (such as Merchandise, Freight, Packaging, Insurance), PaymentType (Cash, Check, MoneyOrder, CreditCard, Barter), and things like StoreType, MembershipType, DeviceType, and many more. This also includes instances in which you want to store the type of an object in some applicable object model.

For each type, you need some form of documentation telling you not only the name of the type, but also the characteristics associated with it; for instance, you may want to know what permissions each UserType entails. What better place to keep this information than in the database itself?

Any record that has some form of type information associated with it should contain a type code column, which itself is a foreign key referencing a type code table. The foreign key constraint ensures that no records with invalid types exist. The type code table might have the following attributes:

  • typeCode_pk
  • label (unique mnemonic, such as varchar(32))
  • description (varchar(256) should be sufficient)
  • uri (pointing to additional resources, where necessary)
  • codeGroup_fk

Additional attributes are, of course, conceivable -- such as a three-letter code or a visible numeric code.

The codeGroup_fk attribute serves to organize related type codes. For instance, all subscriber types might form a group. The codeGroup_fk attribute is a foreign key into a separate code group table. However, realizing that a code group is nothing but a type code itself, the relationship can be made recursive so that codeGroup_fk references typeCode_pk. Not only does this make the additional type code table unnecessary, but it also makes it possible to order groups in an arbitrarily deep hierarchical structure. It is best to keep the type of code system relatively simple and straightforward.

Complex datatypes

Finally, there are some common but complex datatypes -- such as phone numbers, postal addresses, contact information, and credit cards -- that occur in almost every database schema. Typically, such records need to be accessed from many tables in the database. In a typical eCommerce system, for instance, it might be necessary to store contact information for users, suppliers, warehouses, and admins.

Rather than including those attributes in the respective user, supplier, or other record. (and thereby repeating those columns throughout the database), it makes sense to set up a single table for the contact information that is referenced through foreign keys by all other tables. This has two immediate benefits:

  • It is easier to later change the cardinality of the relationships.
  • It localizes any future changes to the complex datatype.

Anticipating the attributes that will likely be required for each of those complex types is something of an art. My suggestion is to try to strive for completeness from the outset rather than being forced to change the schema each time an additional field becomes necessary.

A sampling of possible attributes for postal addresses includes:

  • Department
  • Company
  • Mail Stop
  • Address Line 1
  • Address Line 2
  • Address Line 3
  • City
  • State
  • Postal Code ("Zip")
  • Country

Full contact information might include the following attributes:

  • Title
  • First Name
  • Middle Name (or Initial)
  • Last Name
  • Suffix (such as jr. or sr.)
  • Home address
  • Work address
  • Home Phone
  • Work Phone
  • Cell Phone
  • Fax
  • Pager
  • Email

Finally, phone numbers should never be considered flat numbers. In fact, they break down into the following fields:

  • CountryCode
  • AreaCode (Number Plan Area)
  • ExchangeCode (Prefix)
  • LineNumber (Suffix)
  • Extension

In a phone number such as 987-1234, the prefix is the 987 and the suffix is the 1234. The extension is the only part of the phone number that is optional. It is probably sufficient to use char(4) for all columns, but one might consider char(6) to be on the safe side. Note that area codes in the US are limited to three digits, but this is not true for other countries.

Sensitive data should be kept in encrypted form. Even if the database system itself is compromised, the data is still protected from misuse. The most famous example of this kind of data management is the Unix password system which only stores hashes of the users' passwords rather than the passwords themselves. Some data, such as credit card numbers, needs to be encrypted in a recoverable fashion; however, a one-way encryption (as for the Unix password file) will not do. This leads to the problem of encryption key management -- clearly, it should not be stored in the DB, along with the secrets, but supplied at boot-time, for instance.

Summary

In this article, I've discussed some general best practices when designing relational databases, including:

  • The benefits of maintaining additional table attributes without any business meaning to serve as surrogate keys.
  • The recommendation not to base the internal workings of the database on Universally Unique Identifiers.
  • The use of a centralized type codes facility to express attributes with finite, predetermined ranges of values.
  • Some considerations in designing complex datatypes to be used throughout the database schema.

In the next half of this article, I'll cover database normalization and some additional uses for a database within a project, like the use of history tables and event logs.


Downloadable resources


Related topics

  • Learn about the importance of the surrogate primary key when designing stable, flexible, and well-performing enterprise databases in Mike Lonigro's article, "The Case for the Surrogate Key."
  • Read the book, Translucent Databases, as Peter Wayner offers a better, deeper protection paradigm for databases, one that works by scrambling the data with encryption algorithms, using the minimal amount of encryption that ensures the database is still highly functional.
  • Search developerWorks' vast library for more than 1,000 articles on database design.
  • Discover how DB2 Relational Connect helps with information integration by allowing several databases to be viewed and manipulated as one.

Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=11804
ArticleTitle=Practical database design, Part 1
publish-date=05272003