Output columns

Each rule set contains a list of output columns. Standardized data is added to these columns.

Output columns are defined in the dictionary (previously called the .DCT file).

You can add or modify rules that map input data to output columns. You can also specify leading separators, which separate strings in an output column.

Most rule sets include the types of output columns that are shown in the following table.
Table 1. Types of output columns
Type Description Example columns
Business intelligence Contain the standardized values. These columns are often arranged in an order that is suited to the domain. For example, in the predefined rule set that is applied to US address data, the house number output column is before the street name output column.
  • HouseNumber
  • HouseNumberSuffix
  • StreetPrefixDirectional
Matching Contain information that is used exclusively for matching.
  • StreetNameSOUNDEX
  • MatchFirstNameNYSIIS
  • CityNameRVSNDX
Reporting Contain information that is used to evaluate the standardization results.
  • UnhandledData
  • InputPattern
  • ExceptionData

Dictionary (.DCT file)

The dictionary (previously called the .DCT file) defines the output columns that the rule set provides on the output page of the Standardize stage.

The dictionary holds a list of domain, matching, and reporting fields. Each field is identified by a unique descriptive name. The dictionary also provides the data type, such as character or integer, and length information.

The following example shows the format for the dictionary.


field-identifier field-type field-length
 missing-value-identifier [ description ;comments] 

The table explains the dictionary format.

Table 2. Dictionary format
Format Description
field-identifier A descriptive field name that follows the package rules and is unique for all dictionaries.
field-type The type of information in the field. For more information about field types see Field types.
field-length The field length in characters.
missing-value-identifier Optional. Serve as placeholders. The possible values are:
  • S - spaces
  • Z - zero or spaces
  • N - negative number, such as -1
  • 9 - all nines, such as 9999
  • X - no missing value

Generally, use X or S for this argument.

description Optional. Describe the field.
; comments Optional. Any additional comments about the field. Comments can continue on separate lines if the comments are preceded by a semicolon.

The following string must appear as the first two lines of a dictionary:


;;QualityStage vn.n
\FORMAT\ SORT=N

The following example shows part of a USADDR dictionary:


;;QualityStage v8.0
\FORMAT\ SORT=N
;----------------------------------
; USADDR Dictionary File
;----------------------------------
; Total Dictionary Length = 411
;----------------------------------
; Business Intelligence Fields
;----------------------------------
HouseNumber C 10 S HouseNumber ;0001-0010
HouseNumberSuffix C 10 S HouseNumberSuffix ;0011-0020
StreetPrefixDirectional C 3 S StreetPrefixDirectional ;0021-0023
.
.

The order of fields in the dictionary is the order in which the fields appear on the output tab in the stage. When you map input data to the output columns, you can change the order.

Field types

The following field types are supported by the dictionary. In the output, the SQL type is varchar.

Table 3. Field type definitions
Field Type Definitions
N Numeric fields, which are right-aligned and filled with leading blanks.
C Alphabetic fields, which are left-aligned and filled with trailing blanks.
NS Numeric field in which leading zeros are stripped. For example, you define the house number field as NS and the ZIP Code as N. Use NS because leading zeros are relevant with ZIP codes, but could interfere with matching on the house number.
M Mixed alphabetic and numeric information, in which numeric values are right-aligned and alphabetic values are left-aligned. Leading zeros are retained. The U.S. Postal Service uses this type of field for house numbers and apartment numbers. For example, a four-character type M field, where b represents a space, is:
  • 102 becomes b102
  • A3 becomes A3bb
MN Mixed name, which is often used for representing street names. Field values that begin with a letter are left-aligned. Field values that begin with a number are indented as if the number is a separate four-character field.

In the following examples, b represents a space. The one-digit numbers are indented three spaces, two-digit numbers are indented two spaces, and so on. The U.S. Postal Service uses this type of field for street names in the ZIP+4 files.

  • MAIN
  • CHERRY HILL
  • bbb2ND
  • bb13TH
  • b123RD
  • 1023RD