Dimension keys

Natural keys are the business keys, operational keys, or production keys that are used in source systems and consideration needs to be given as to how they can be used in dimension entities.

For example, customer number, product number, a person’s social security number or driving license number, an organization’s VAT number, are keys that are often used in operational systems. The temptation is high to reuse them as the primary key of dimension entities, however the following should be considered:
  • What if their value changes over time, for example, when two source systems are merged?
  • How to handle history, that is, attribute change tracking?
  • Adding effective dates to the primary key, would result in a compound primary key. When concatenating all dimension keys, this might result in very long primary keys in the fact entities and therefore generate performance issues.
  • How to handle null conditions in fact entities, such as “unknown” or “not applicable” dimensions, with such compound primary keys?

Natural keys are not excluded from dimension entities, but they must be designed as business attributes. For example, a business attribute in the .

The primary key must be a surrogate key, of which the values are independently created and managed within the DWM itself, without any business or source system dependency. The surrogate key is a 4-byte integer attribute that identifies uniquely and without business meaning, the dimension entity. This attribute can handle approximately 2 billion positive values (232) or 4 billion positive and negative values (-232 to +232), which suffices for about any dimension.1

The values of the surrogate keys are assigned and managed by the ETL. Typically, the first row of a dimension table gets the value 1, the next one 2, and so forth. Some specific values, such as "-1" or "-2", can be restricted to a special purpose.

One commonly accepted exception to the surrogate key rule is the . In the DWM, a standard integer surrogate key is defined as the primary key for the , as with any other dimension entity. This is perfectly in line with the benefits of a surrogate key explained above. Moreover, if the values are assigned to the integer attribute according to the YYYYMMDD format (for example, 20160527 is the integer value for the 27 May 2016) rather than as purely sequential values which start at 1, it eases the partitioning management, therefore improving the time for loading, back-up, archiving, and so on, as well as the query response time. However, some designers can still prefer to replace the surrogate key by an attribute with a DATE data type.

Reasons to use the YYYYMMDD format include:
  • it allows business users to know the date in the fact table without having to join it to the , especially if they do not care about the specific attributes that qualify the date.
  • the data base system’s optimizer may be intelligent enough to know that there are maximum 31 days in a month, that a year has only 12 months, which provides an opportunity for a smarter access strategy and therefore better query performance.

Another type of key, which also plays a fundamental role in dimension entities, is the durable supernatural key. It is usually known as the Anchor Id in the DWM. It behaves like a surrogate key, except that it uniquely distinguishes occurrences of the dimension entity independently of its history. The supernatural key is qualified as "durable" because it is an identifier that uniquely and reliably identifies the dimension entity across its attribute changes. The durable supernatural key keeps the same value in all rows that represent the different versions of the dimension over time. In some cases, the durable supernatural key can be an alternative for representing the dimension key in fact entities, such as with slowly changing facts.

1 The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, Third Edition, Chapter 3 - Retail Sales, John Wiley & Sons © 2013