Output columns
Each rule set contains a list of output columns. Standardized data is added to these columns.
Output columns are defined in the dictionary (previously called the .DCT file).
You can add or modify rules that map input data to output columns. You can also specify leading separators, which separate strings in an output column.
| Type | Description | Example columns |
|---|---|---|
| Business intelligence | Contain the standardized values. These columns are often arranged in an order that is suited to the domain. For example, in the predefined rule set that is applied to US address data, the house number output column is before the street name output column. |
|
| Matching | Contain information that is used exclusively for matching. |
|
| Reporting | Contain information that is used to evaluate the standardization results. |
|
Dictionary (.DCT file)
The dictionary (previously called the .DCT file) defines the output columns that the rule set provides on the output page of the Standardize stage.
The dictionary holds a list of domain, matching, and reporting fields. Each field is identified by a unique descriptive name. The dictionary also provides the data type, such as character or integer, and length information.
The following example shows the format for the dictionary.
field-identifier field-type field-length
missing-value-identifier [ description ;comments]
The table explains the dictionary format.
| Format | Description |
|---|---|
field-identifier |
A descriptive field name that follows the package rules and is unique for all dictionaries. |
field-type |
The type of information in the field. For more information about field types see Field types. |
field-length |
The field length in characters. |
missing-value-identifier |
Optional. Serve as placeholders. The possible
values are:
Generally, use X or S for this argument. |
description |
Optional. Describe the field. |
; comments |
Optional. Any additional comments about the field. Comments can continue on separate lines if the comments are preceded by a semicolon. |
The following string must appear as the first two lines of a dictionary:
;;QualityStage vn.n
\FORMAT\ SORT=N
The following example shows part of a USADDR dictionary:
;;QualityStage v8.0
\FORMAT\ SORT=N
;----------------------------------
; USADDR Dictionary File
;----------------------------------
; Total Dictionary Length = 411
;----------------------------------
; Business Intelligence Fields
;----------------------------------
HouseNumber C 10 S HouseNumber ;0001-0010
HouseNumberSuffix C 10 S HouseNumberSuffix ;0011-0020
StreetPrefixDirectional C 3 S StreetPrefixDirectional ;0021-0023
.
.
The order of fields in the dictionary is the order in which the fields appear on the output tab in the stage. When you map input data to the output columns, you can change the order.
Field types
The following field types are supported by the dictionary. In the output, the SQL type is varchar.
| Field Type | Definitions |
|---|---|
| N | Numeric fields, which are right-aligned and filled with leading blanks. |
| C | Alphabetic fields, which are left-aligned and filled with trailing blanks. |
| NS | Numeric field in which leading zeros are stripped. For example, you define the house number field as NS and the ZIP Code as N. Use NS because leading zeros are relevant with ZIP codes, but could interfere with matching on the house number. |
| M | Mixed alphabetic and numeric information, in which numeric values are
right-aligned and alphabetic values are left-aligned. Leading zeros are retained. The U.S. Postal
Service uses this type of field for house numbers and apartment numbers. For example, a
four-character type M field, where b represents a space, is:
|
| MN | Mixed name, which is often used for representing street names. Field values
that begin with a letter are left-aligned. Field values that begin with a number are indented as if
the number is a separate four-character field. In the following examples, b represents a space. The one-digit numbers are indented three spaces, two-digit numbers are indented two spaces, and so on. The U.S. Postal Service uses this type of field for street names in the ZIP+4 files.
|