Converting information
You use the CONVERT action to convert data according to a lookup table or a literal you supply.
- CONVERT
- Changes tokens.
- CONVERT_S
- Concatenates a suffix onto specified tokens.
- CONVERT_P
- Concatenates the first prefix in a token that matches a value in a lookup table onto specified tokens.
- CONVERT_PL
- Concatenates the longest prefix in a token that matches a value in a lookup table onto specified tokens.
- CONVERT_R
- Runs tokens through the tokenization process again when you implement changes to the tokens.
- TRANS_KH
- Converts Katakana characters to Hiragana characters.
- TRANS_HK
- Converts Hiragana characters to Katakana characters.
- TRANS_WN
- Converts full-width characters to half-width characters.
- TRANS_NW
- Converts half-width characters to full-width characters.
Converting place codes
You use conversion with input records that use numeric codes for place names.
The codes are converted to actual place names. You must first create a table file with two columns. The first column is the input value and the second column is the replacement value. For example, the file CODES.TBL contains:
001 "SILVER SPRING"
002 BURTONSVILLE 800.0
003 LAUREL
...
Multiple words must be enclosed in quotation marks (""). Optional weights can follow the second operand in the previous example to indicate that uncertainty comparisons might be used. The string comparison routine is used on BURTONSVILLE, and any score of 800 or greater is acceptable. The following pattern converts tokens according to the preceding table:
&
CONVERT [1] @CODES.TBL TKN
The tokens remain converted for all patterns that follow, as if the code is permanently changed to the text.
Convert files must not contain duplicate input values (first token or first set of tokens enclosed in quotes). If duplicate entries are detected, the Standardize stage issues an error message and stops.
Temporary conversion
With the CONVERT action, you can specify TEMP to apply a conversion to only the current set of actions.
The TEMP mode is a temporary conversion. The following example converts the suffix of the first operand according to the entries in the table SUFFIX.TBL.
CONVERT_S [1] @SUFFIX.TBL TEMP
If you have an operand value of HESSESTRASSE and a table entry in SUFFIX.TBL of:
STRASSE STRASSE 800.0
Operand [1] is replaced with the value:
HESSE STRASSE
There is now a space between the words. Subsequent actions in this pattern-action set operate as expected. For example, COPY_S copies the two words HESSE STRASSE to the target. COPY, CONCAT, and PREFIX copy the string without spaces. For example, if the table entry is:
STRASSE STR 800.0
The result of the conversion is HESSE STR. COPY_S preserves both words, but COPY copies HESSESTR as one word. The source of a CONVERT_P, CONVERT_PL, or CONVERT_S action can be an operand (as in the example), a dictionary field, or a user variable with equivalent results.
Permanent conversion
The mode of TKN provides permanent conversion of a token.
Generally, when you are making a permanent conversion, you specify a retype argument that applies to the suffix with CONVERT_S or the prefix with CONVERT_P or CONVERT_PL. For example, assume that you are using the following CONVERT_S statement:
CONVERT_S [1] @SUFFIX.TBL TKN T
You also have an operand value of HESSESTRASSE and a table entry in SUFFIX.TBL of:
STRASSE STRASSE 800.0
HESSE retains the class ? because you did not specify a fifth argument to retype the body or root of the word, and STRASSE is given the type for the suffix, such as T, for street type. To perform further actions on these two tokens, you need a pattern of:
? | T
If no retype class is given, both tokens retain the original class ?.
You might want to retype both the prefix or suffix and the body. When checking for dropped spaces, a token such as APT234 can occur. In this case, the token has been found with a class of < (leading alphabetic character) and an optional fourth argument can retype the prefix APT to U for multiunit and an optional fifth argument can retype the body 234 to ^ for numeric. In the following example, the PREFIX.TBL table contains an APT entry:
CONVERT_P [1] @PREFIX.TBL TKN U ^
If you want to retype just the body, you must specify a dummy fourth argument that repeats the original class.
Converting multi-token operands
If you are converting multi-token operands that matched to patterns ** or ?, the format of the convert table depends on whether the third argument to CONVERT is TKN or TEMP.
If the third argument is TKN, each token is separately converted.
Thus, to convert SOLANO BEACH to MALIBU SHORES, the convert table must have the following two lines:
Original Token | Converted Token |
---|---|
SOLANO | MALIBU |
BEACH | SHORES |
This might produce unwanted side effects, since any occurrence of SOLANO is converted to MALIBU and any occurrence of BEACH is converted to SHORES.
To avoid this situation, the TEMP option for CONVERT must be used. The combined tokens are treated as a single string with no spaces. Thus, SOLANOBEACH becomes the representation for a ? pattern containing the tokens SOLANO and BEACH. The following entry in the CONVERT table accomplishes the proper change:
Original Token | Converted Token |
---|---|
SOLANOBEACH | "MALIBU SHORES" |
In this convert table there must be no spaces separating the original concatenated tokens. When copying the converted value to a dictionary field, COPY does not preserve spaces. Therefore, use COPY_S if you need to keep the spaces.
Assigning fixed values
You can use CONVERT to assign a fixed value to an operand or dictionary column.
This is accomplished by:
CONVERT operand literal TEMP | TKN
CONVERT dictionary-field literal
For example, to assign a city name of LOS ANGELES to a dictionary column, you can use either of the following actions:
COPY "LOS ANGELES" {CT}
CONVERT {CT} "LOS ANGELES"
More important, it can be used to convert an operand to a fixed value:
CONVERT [1] "LOS ANGELES" TKN
TKN makes the change permanent for all actions involving this record, and TEMP makes the change temporary for the current set of actions.
An optional class can follow the TKN. Since converting to a literal value is always successful, the retyping always takes place.
Converting prefixes and suffixes
When a single token can be composed of two distinct entities, a CONVERT-type action can be used to separate and standardize both parts.
An example of this is in German addresses, where the suffix STRASSE can be concatenated onto the proper name of the street, such as HESSESTRASSE.
If a list of American addresses has a significant error rate, you might need to check for occurrences of dropped spaces such as in MAINSTREET. To handle cases such as these, you can use the CONVERT_P or CONVERT_PL action to examine the token for a prefix and CONVERT_S for a suffix.
Like CONVERT_P, the CONVERT_PL action examines the token for a prefix. However, CONVERT_P takes the first prefix that matches a value in the lookup table and CONVERT_PL takes the longest prefix that matches.
For example, assume that a lookup table contains entries for NORTH and NORTHWEST. For the token NORTHWESTPOINT, the CONVERT_P action takes the prefix NORTH and the CONVERT_PL action takes the prefix NORTHWEST.
CONVERT_P, CONVERT_PL, and CONVERT_S use almost the same syntax as CONVERT. The first difference is that you must use a lookup table with these actions. The second difference is that you have an optional fifth argument.
CONVERT_P source @table_name TKN | TEMP retype1 retype2
CONVERT_PL source @table_name TKN | TEMP retype1 retype2
CONVERT_S source @table_name TKN | TEMP retype1 retype2
Argument | Description |
---|---|
source | Can be either an operand, a dictionary field, or a user variable. |
retype1 | Refers to the token class that you want assigned to the prefix with a CONVERT_P or CONVERT_PL action or the suffix with a CONVERT_S. This argument is optional. |
retype2 | Refers to the token class that you assigned to the remainder of the token after the conversion, if the source is an operand. |
Converting with retokenization
You can use the CONVERT_R action to force the new tokens through the tokenization process so that classes and abbreviations are correct.
Use the CONVERT_R action when you want to convert a single token into two or more tokens, some of which can be of classes different from the original class.
The syntax for CONVERT_R is simpler than the syntax for CONVERT since an operand is the target of the action and the argument of TKN is assumed. Use a convert table and the tokenization process retypes automatically:
CONVERT_R source @table_name
With CONVERT_R, the source is the operand.
An example of using CONVERT_R is for street aliases. For example, the file ST_ALIAS.TBL contains the following entries:
OBT "ORANGE BLOSSOM TRL"
SMF "SANTA MONICA FREEWAY"
WBN "WILSHIRE BLVD NORTH"
WBS "WILSHIRE BLVD SOUTH"
The pattern action set looks like:
*+
CONVERT_R [1] @ST_ALIAS.TBL
The alias is expanded and its individual tokens properly classified. Using the preceding example, WBN is expanded into three tokens with classes ?, T, and D. The remaining pattern and actions sets work as intended on the new address string.
Retyping tokens
The token is retyped to this class if the conversion is successful. If no class is specified, the token retains its original class. The following example converts a single unknown alphabetic token based on the entries in the three files. If there is a successful conversion, the token is retyped to either C, T, or U.
+
CONVERT [1] @CITIES.TBL TKN C
CONVERT [1] @TOWNS.TBL TKN T
CONVERT [1] @UNINCORP.TBL TKN U
Conversion actions for languages that are not Latin-based
You can use conversion with records that contain characters from languages that are not Latin-based. These actions can be used in data quality stages such as the Investigate stage and Standardize stage.
These conversion actions can be applied to any type of character. However, the actions are most useful when they are applied to languages that are processed by the CJK tokenizer, such as languages that are spoken in China, Japan, and Korea.
Converting Katakana and Hiragana characters
The TRANS_KH action converts Katakana characters to Hiragana characters. The TRANS_HK action converts Hiragana characters to Katakana characters.
If you use these actions, specify the CJK tokenizer by using the TOK command in the PRAGMA section of the pattern-action file. The CKJ tokenizer handles tokens based on the conventions of languages that are not Latin-based.
Although the syntax for these actions is similar to the syntax for the CONVERT action, these actions do not use lookup tables and cannot convert an object to a literal. For example, you can use the following syntax for the TRANS_KH action:
TRANS_KH source TKN | TEMP retype1
Argument | Description |
---|---|
source | The object that is converted. The source can be an operand, dictionary field, or user variable. |
retype1 | The token class that you want assigned to the converted token. This argument is optional. |
Converting character width
The TRANS_WN action converts full-width characters to half-width characters. The TRANS_NW action converts half-width characters to full-width characters.
If you use these actions, specify the CJK tokenizer by using the TOK command with an appropriate locale variable in the PRAGMA section of the pattern-action file. The CKJ tokenizer handles tokens based on the conventions of languages that are not Latin-based. For example, to use the TRANS_WN action to convert full-width Japanese characters to half-width characters, specify TOK ja_JP.
Although the syntax for these actions is similar to the syntax for the CONVERT action, these actions do not use lookup tables and cannot convert an object to a literal. For example, you can use the following syntax for the TRANS_WN action:
TRANS_WN source TKN | TEMP retype1
Argument | Description |
---|---|
source | The object that is converted. The source can be an operand, dictionary field, or user variable. |
retype1 | The token class that you want assigned to the converted token. This argument is optional. |
CONVERT considerations
A CONVERT source can be an operand, a dictionary column, or a user variable, and if the source is an operand, it requires a third argument. The results of CONVERT_P, CONVERT_PL, and CONVERT_S actions can vary based on the pattern classes the actions are used with.
Some considerations to take into account when using the CONVERT action include:
- The source of a CONVERT can be an operand,
a dictionary field, or a user variable. In the following example,
both actions are valid:
CONVERT temp @CODES.TBL CONVERT {CT} @CODES.TBL
- Entire path names can be coded for the convert table file specification:
CONVERT {CT} @..\cnvrt\cnvrtfile.dat
- If the source of a CONVERT is an operand, a
third argument is required:
CONVERT operand table TKN CONVERT operand table TEMP
TKN is used to make the change permanent for all pattern action sets that follow the conversion and that involve this record.
If TEMP is included, the conversion applies only to the current set of actions.
If the conversion must be applied both to actions further down the program and to the current set of actions, specify two CONVERT actions (one by using TKN for other action sets and rule sets, and one by using TEMP for other actions in the current action set).
The results of CONVERT_P, CONVERT_PL, and CONVERT_S actions are affected by pattern classes. For example, the ? class can match to more than one token. If SUFFIX.TBL has the following lines:
STRASSE STRASSE 800.
AVENUE AVE 800.
If the pattern and actions are:
^ | ?
COPY [1] {HouseNumber}
CONVERT_S [2] @SUFFIX.TBL TEMP
COPY_S [2] {StreetName}
EXIT
The following input:
123 AAVENUE BB CSTRASSE
has the following result in the house number and street name fields:
123 AAVENUEBBC STRASSE
If the pattern and actions are:
^ | ?
CONVERT_S [2] @SUFFIX.TBL TKN T
^ | + | T | + | + | T
COPY [1] {HouseNumber}
COPY [5] {StreetName}
COPY [6] {StreetSuffixType}
EXIT
The following input:
123 AAVENUE BB CSTRASSE
results in:
HouseNumber | StreetName | StreetSuffixType |
---|---|---|
123 | C | STRASSE |
The {HouseNumber}, {StreetName}, and {StreetSuffixType} fields are:
COPY [1] {HouseNumber}
COPY [2] {StreetName}
COPY [3] {StreetSuffixType}
which results in moving:
HouseNumber | StreetName | StreetSuffixType |
---|---|---|
123 | A | AVENUE |
When you concatenate alphabetic characters with the pattern ?, the CONVERT_P, CONVERT_PL, or CONVERT_S action operates on the entire concatenated string for user variables, dictionary fields, and operands if the mode is TEMP.
For operands with a mode of TKN, each token in the token table that comprises the operand is examined individually and new tokens corresponding to the prefix or suffix are inserted into the table each time the prefix or suffix in question is found.