Affinity privacy provider

Use the affinity privacy provider to mask data while maintaining the format and character types of the source values. For example, the provider can maintain the format of data such as account numbers or driver's license numbers while masking character data with character data and numeric data with numeric data. That is, the data type remains unchanged.

If the source data is uppercase, alphabetic characters, the affinity privacy provider generates uppercase, alphabetic characters at the destination. The provider masks alphabetic and numeric characters, but copies other characters in the source data to the destination. For example, a credit card number in the format "nnnn nnnn nnnn nnnn" is masked to a different number that includes spaces at the same intervals, while one formatted with dashes is masked as "nnnn-nnnn-nnnn-nnnn". Using this provider, you can generate unique values, a different value for each occurrence of the same source, and values with a length different from the source.

The provider includes two algorithms for masking data, a default algorithm and a format preserving encryption (FPE) algorithm.

The FPE algorithm is encryption-based and offers stronger masking capability than the default. The FPE algorithm is based on the Advanced Encryption Standard 256-bit (AES-256) algorithm, can optionally use an encryption key that is supplied by the user, and can produce masked values that are unique. The same user-supplied key produces repeatable masked values. For this reason, knowledge of the key should be secured from unauthorized users to prevent reverse engineering to discover the original values.

The FPE algorithm also produces outputs that are extremely varied and without a discernible pattern. For example, two close or similar source values (such as 001 and 002) are masked with values that are not similar (such as 196 and 837), thus masking any pattern these values may have had in the source data.

The TWEAKS parameter, available with the FPE algorithm only, can mask identical strings within source values by using other strings within each source value to influence the masking process. For example, for two source values such as ABC-123 and DEF-123, the TWEAKS parameter can mask the 123 string in each value by using the preceding string to produce the masked values ABC-981 and DEF-704.

While the FPE algorithm offers an industry standard level of encryption, consider using the default algorithm when your data does not require such encryption. The most significant differences between the two algorithms are the degree of varied masking patterns, the time that is required to complete the masking effort, and the strength of the masked results. The default algorithm usually outperforms the FPE algorithm in processing time, as the FPE algorithm in most cases takes more time to complete. The complexity of the input data can add more time to the masking operation of the FPE algorithm, depending on the amount of parallel processing and I/O being performed.

Note: Starting from IBM® InfoSphere® Optim™ Version 11.3.0.4, the following enhancements are applicable:
  • Unicode support is enabled by default
  • Default value of the parameter TWEAKS has been changed from YES to NO
  • To get the results similar to the earlier versions of IBM InfoSphere Optim Version 11.3.0.4, you must specify TWEAKS = YES and LANG = "EN". For more information, refer to LANGUAGE and TWEAKS

Examples

Account numbers
Masking account numbers might require that the account number format is maintained while producing repeatable results for testing. To meet these requirements, you might use the following syntax:
pro=aff, algo=fpe, mtd=rep, key="Xyz123", rule=num, whenmatch=prefix(“33”), 
flddef1=(name=acctnbr, datatype=varchar_sz datatype=varchar)

This example uses the following parameters:

ALGO=FPE
The FPE algorithm offers a high level of encryption and produces masked values that are much less likely to reveal the original source value than the default.
MTD=REP
This parameter produces consistent target values when the same data is processed multiple times.
KEY="fpekey123"
This parameter provides the case-sensitive encryption key for the FPE algorithm. To produce repeatable results, use the same key value.
RULE=NUM
This parameter ensures that numeric source values produce numeric target values.
WHENMATCH=PREFIX
This parameter prevents identical source and target values by adding a prefix to target values that match the source.
This syntax produces following results:
Source values Masked values
acct12345 uaho84506
acct12345 uaho84506
acct56789 uaho66769
Driver's license numbers
Masking drivers license numbers might require that the format is maintained, as the format might be unique to the issuing agency. To meet these requirements, you might use the following syntax:
pro=aff, algo=FPE, mtd=rep, key="fpekey123", flddef1=(name=driver_number, dt=wvarchar_szdt=varchar)

This example uses the following parameters:

ALGO=FPE
The FPE algorithm offers a high level of encryption and produces masked values that are much less likely to reveal the original source value than the default.
MTD=REP
This parameter produces consistent target values when the same data is processed multiple times.
KEY="fpekey123"
This parameter provides the case-sensitive encryption key for the FPE algorithm. To produce repeatable results, use the same key value.
This syntax produces following results:
Source values Masked values
M2267 89890 34567 O0280 50902 54225
M2267 89890 34567 O0280 50902 54225
P2267 89890 34567 D9254 86713 09303
French data
Masking French data using the LANGUAGE parameter. To meet these requirements, you might use the following syntax:
PRO=AFF,ALGO=FPE,METHOD=REPEATABLE,KEY="DataPrivacy101",LANG="FR",FLDDEF1=(NAME=SAMPLE_FLD,DT=WVARCHAR_SZ)

This example uses the following parameters:

ALGO=FPE
The FPE algorithm offers a high level of encryption and produces masked values that are much less likely to reveal the original source value than the default.
MTD=REP
This parameter produces consistent target values when the same data is processed multiple times.
KEY="DataPrivacy101"
This parameter provides the case-sensitive encryption key for the FPE algorithm. To produce repeatable results, use the same key value.
LANG="FR"
This parameter specifies the language of the input data.
This syntax produces following results:
Source values Masked values
ABCDEFG ÎÏÊÎJNI
abcdefg îïêîjni
1234567890 9832939316
ADÉLAÏDE PÎÙMRWÔE
ÉMILIE ËËRSÏÏ
Unicode data
Starting from IBM InfoSphere Optim Version 11.3.0.4, Unicode support is enabled by default. To meet these requirements, you might use the following syntax:
PRO=AFF,ALGO=FPE,METHOD=REPEATABLE,KEY="DataPrivacy101",FLDDEF1=(NAME=SAMPLE_FLD,DT=WVARCHAR_SZ)

This example uses the following parameters:

ALGO=FPE
The FPE algorithm offers a high level of encryption and produces masked values that are much less likely to reveal the original source value than the default.
MTD=REP
This parameter produces consistent target values when the same data is processed multiple times.
KEY="DataPrivacy101"
This parameter provides the case-sensitive encryption key for the FPE algorithm. To produce repeatable results, use the same key value.
This syntax produces following results:
Source values Masked values
ABCDEFG RWXLNTL
abcdefg rwxlntl
1234567890 9832939316
ADÉLAÏDE MAÐHHÉAL
ÉMILIE ÐUIXLU
Passport numbers
Passport number data might contain identical strings that occur among multiple values. To produce masked data that does not include such strings, use the TWEAKS parameter. This data might also contain hyphens that can be removed from masked output by using the REMOVE parameter. To meet these requirements, you might use the following syntax:
pro=aff, algo=fpe, mtd=rep, key="fpekey123", iterations=20, tweaks=yes, 
flddef1=(name=passport_nbr, dt=varchar_sz dt=varchar)

This example uses the following parameters:

ALGO=FPE
The FPE algorithm offers a high level of encryption and produces masked values that are much less likely to reveal the original source value than the default.
MTD=REP
This parameter produces consistent target values when the same data is processed multiple times.
KEY="fpekey123"
This parameter provides the case-sensitive encryption key for the FPE algorithm. To produce repeatable results, use the same key value.
ITERATIONS=20
This parameter determines the number of times the FPE algorithm processes each value to ensure uniqueness.
TWEAKS=No
This parameter ensures that identical strings are masked with unique strings.
This syntax produces following results:
Source values Masked values
US20139999 YT03385676
US20139999 YT03385676
CA11233388 AU96923919

Syntax

The affinity privacy provider uses the following syntax:

Masking parameters
PROVIDER = AFFINITY , 
	[ ALGORITHM = { DEFAULT | FPE } ] ,
	METHOD = { REPEATABLE | HASH } ,
ALGO=FPE parameters
	KEY = { "literal" | @environment-variable-name } ,
	[ ITERATIONS = iterations-value ]  ,
	[ TWEAKS = { YES | NO } ]  ,
	[ LANGUAGE  = {"literal" | "exit-name" } ]

ALGO=DEF parameters
	[ SEED = { "literal" | @variable | RANDOM } ]  ,
CHAR parameters
	[ LENGTH = { n | MAX } ]  ,
	[ COPY = (copy-start,copy-len [,"copy-lit"]) …   ]  , 
Numeric parameters
	[ RULE = { NUMERIC | BINARY | SCALED } ]  ,
Source value parameters
	[ CASE = UPPER ]  ,      
	[ REMOVE = “remove-chars“  ]  ,      
	[ TRIM = RIGHT ]  ,      
	[ WHENMATCH = { REVERSE | SHUFFLE | PREFIX( { "literal" | SEQ(low,high) | RAND(low,high) } | 
			SUFFIX( { "literal" | SEQ(low,high) | RAND(low,high) } | 
			OVERLAY( n, m, "literal" ) } ]  ,
Processing parameters
	[ DISCARDLIMIT = discard-limit-value ] ,
Data definition parameters
	FLDDEFn = ( NAME = field-name,    
		DATATYPE = datatype-value, 
		[ PRECISION = field-precision-value ], 
		[ SCALE = field-scale-value ],    
		[ LENGTH = field-length-value ],
		[ CODEPAGE = codepage-value ],
		[ CPTYPE = { DB2ZOS |DB2LUW | ORACLE |SYBASE |ODBC | INFORMIX |NETEZZA |SQLSERVER |TERADATA |ANY |NONE } ] ) ,
	[ CODEPAGE = codepage-value ]  ,
	[ CPTYPE = { DB2ZOS | DB2LUW | ORACLE | SYBASE | ODBC | INFORMIX |
		     NETEZZA | SQLSERVER | TERADATA | ANY | NONE  } ]

Masking parameters

Parameters that determine how to mask data.

PROVIDER (or PRO)
Required. Enter the provider name, AFFINITY (or AFF).
Note: The PRO parameter must be first in the masking string. All other parameters can appear in any order.
ALGORITHM (or ALGO)
The algorithm to use. Enter one of the following options:
DEFAULT (or DEF)
Mask data by using the default algorithm.

ALGO=DEF is not compatible with the parameters KEY, ITERATIONS, and TWEAKS.

FPE
Mask data by using the FPE algorithm. Use the KEY parameter to enter the encryption key.

ALGO=FPE is not compatible with the parameters SEED and MTD=HASH.

METHOD (or MTD)
The masking method to use. For small or less complex data sets, the HASH method has a better strength of masking compared to the REPEATABLE method. The HASH method might provide slightly slower processing compared to the REPEATABLE method.

Enter one of the following options:

REPEATABLE (or REP)
Generates consistently repeatable and unique output values. The same masked value is generated for every instance of a given source value, and each unique source value has a unique output value.
HASH
Generates output values by using a hash algorithm. The output values might not be unique nor repeatable between masking operations.

MTD=HASH is not compatible with the parameter ALGO=FPE.

ALGO=FPE parameters

Parameters for use with ALGO=FPE only.

KEY
The encryption key. To produce repeatable masked data, use the same key value.

Enter one of the following options:

"literal"
A 1 - 32 character alphanumeric string (including special characters such as $, @, and #). A NULL value is not valid.
@environment-variable-name
Name of an environment variable that provides the encryption key. The environment variable name must be prefixed with the at sign (@) and assigned a value before invoking the masking provider.
Note: It is mandatory to mention the KEY parameter when specifying ALGO=FPE.
Note: If you are using the Affinity provider FPE environment variable option with the data privacy UDFs for DB2 for Linux, UNIX, and Windows, you must set the variable using the db2set DB2ENVLIST command.
ITERATIONS
The number of times the FPE algorithm processes each source value to avoid creating a masked value that matches the source value. Enter an integer in the range 1 - 99. The default is 12. A value less than 12 has a negligible improvement on performance.
TWEAKS
If two values contain identical strings, determines whether to mask the identical strings differently for each combination of non-identical characters.

TWEAKS is not applicable to numeric data types or with the parameters ALGO=DEF and RULE.

Enter one of the following options:

YES
Default. Use tweaks to mask identical strings in source values by using other strings within each source value to influence the masking process. For example, if TWEAKS=YES, the provider might mask values AB-1234 and CD-1234 to DX-4795 and NR-3687, rather than DX-4795 and NR-4795.
NO
Do not use tweaks.
LANGUAGE (or LANG)
Specifies the language of the input data.

Enter one of the following options:

"literal"
The abbreviation that specifies the language.
"exit-name"
Custom languages are supported via an exit. The exit name must be prefixed with ioqx0. The sample exit supplied with the ODPP binaries is named ioqx0affexit.dll / libioqx0affexit.so. The exit may be invoked as LANG="ioqx0affexit"

Supported language abbreviations

Table 1. Supported language abbreviations
Abbreviation Language
ca Catalan
ca_ES Catalan
da Danish
de German
el Greek
en English
es Spanish
eu_ES Basque (Spain)
fi Finnish
fr French
gl_ES Galician (Spain)
is Icelandic
it Italian
ja Japanese
ko Korean
nb_NO Norwegian (Bokmal)
nl Dutch
nn_NO Norwegian (Nynorsk)
pl Polish
pt_BR Brazilian Portuguese
pt_PT Portuguese
ru Russian
sr_CS Serbian Cyrillic
sv Swedish
tr_TR Turkish
zh_CN Simplified Chinese
zh_TW Traditional Chinese

ALGO=DEF parameters

Parameters for use with ALGO=DEF only.

SEED
Use the SEED parameter to influence the masking process and to produce repeatable or non-repeatable output values. To produce repeatable masked data, use the same seed value.

Enter one of the following options:

Parameters for producing repeatable values:
"literal"
A character string, within enclosing double quotation marks. Any special characters in the string are ignored. The string is case-sensitive. Note that there is no particular advantage to providing a lengthy string. Using a literal potentially exposes the seed value and might allow reverse engineering to discover the original values.
@variable
The name of an environment variable, preceded by the at sign (@). To prevent reverse engineering to discover the original values, the variable can be created by the Optim administrator or other authorized person and secured from unauthorized access.
Parameter for producing non-repeatable values:
RANDOM (or RAN)
Specifies a random seed value, which produces a non-repeatable output value. This option can be used as a test data generator by concatenating the output of multiple runs. This option might produce duplicate values each time the process is run.

CHAR parameters

Parameters for use with CHAR character types.

LENGTH (or LEN)
For character data types only. Specifies a fixed length for each destination value, regardless of the length of the source value. The source value is repeated or truncated to match the LENGTH value before processing. If this parameter is not specified, the length of the destination value is the same as the source value. The LENGTH parameter specifies the length in wide characters and not in bytes.

The trailing blanks of an unfilled CHAR input value are considered part of a source value, resulting in an output value that also has trailing blanks. To include only non-blank characters in the output value, also specify TRIM=RIGHT.

The destination value might not be unique when the LENGTH parameter is specified. For example, if LEN=2 and the unique values "test1" and "test2" are input, the provider masks only the first two characters in each value, "te", and generates duplicate output values.

Specify MTD=HASH when using LENGTH with ALGO=DEF. To specify LENGTH with ALGO=FPE, use MTD=REP.

LEN is not compatible with the parameter RULE.

Enter the following option:

n
An integer value that specifies the destination value length. The value cannot exceed the length of the destination field.
COPY
For character data types only. Specifies one or more pairs of substrings to copy to the destination value. The provider can also replace characters with a literal string. Note that this parameter might cause non-unique destination values.

If the literal string is shorter than the substring, the provider replicates the literal until the lengths match. If the literal is longer than the substring, it is truncated.

Spaces are supported within literals only (for example, “X Y”).

COPY is not compatible with the parameter RULE.

Enter the following options:

copy-start
The starting position of the substring to copy to the destination.
copy-len
The length of the substring to copy to the destination.
copy-lit
A literal string that replaces the source characters in the specified positions. If the literal string is shorter than the substring, the provider replicates the literal until the lengths match. If the literal is longer than the substring, it is truncated.

Numeric parameters

Parameters for use with numbers and numeric data types.

RULE
Specifies processing rules to ensure that numeric strings are properly masked as numeric values.

Enter one of the following options:

NUMERIC (or NUM)
For numeric values within character data types only, specifies that numeric values generate masked numeric values instead of other characters. Without this option, the value 12345 in a character type could produce a masked value that contains alphabetic characters. If a source value is not a number, the provider processes the value normally without an error or warning message. Use this option for the following situations:
  • To transform integers in a character data type field to match the values of a numeric data type field. This usage maintains DBMS-type foreign key integrity across differing data types.
  • To correctly transform floating point numbers in character data type fields. Without this parameter, the exponent character E (or e) is transformed into another letter.
  • To maintain leading zeros that might otherwise be removed. For example, to prevent a value such as 00015 from being masked as a two-digit value.

RULE=NUM is not compatible with the parameters COPY and LEN.

BINARY (or BIN)
Specifies that source values of a numeric data type are transformed into values that conform to the provider data type restrictions. The output value might differ from the source value in order of magnitude and sign. Without this parameter, source values in numeric data type fields are converted to character strings and then converted back to the original data type.

Without RULE=BIN, there is a discrepancy between the number of digits that are required to represent a numeric value in character format and the maximum value that a given number of digits can express. For example, an unsigned short integer needs five-digit positions to hold the maximum unsigned value of 65535, but five-digit positions can express a maximum string value of 99999, which when converted back to an unsigned short integer results in an overflow. Without RULE=BIN, boundary source values in numeric data type fields might be transformed into values that use this discrepancy and do not conform to the storage constraints.

RULE=BIN is not compatible with the parameters COPY, LEN, TRIM, REMOVE, and CASE.

SCALED (or SCA)
Specifies that the output value has the same sign and approximately the same magnitude as the source value of the numeric data type. For example, when masking data such as salary figures, use this option to ensure that a value such as 12,345.67 does not produce a value such as -3.1.

Source value parameters

Parameters for handling source values.

CASE
Specifies that the provider converts the source values to uppercase before processing.

CASE is not compatible with the parameters RULE=BIN and RULE=SCALED.

Enter the following option:

UPPER (or UP)
Convert the source values to uppercase before processing.
REMOVE
Specifies the characters, within enclosing double quotation marks, to remove from the source value before processing.

REMOVE is not compatible with the parameters RULE=BIN and RULE=SCALED.

TRIM
Specifies to remove trailing spaces from the source value before processing.

TRIM is not compatible with the parameters RULE=BIN and RULE=SCALED.

Enter the following option:

RIGHT
Right-trim the source value to remove trailing spaces before processing.
WHENMATCH
If the source value matches the masked value, determines how to ensure that the masked value does not match the source. Spaces are not supported, except within literals (for example, “X Y”).

WHENMATCH might create masked values that are not unique. Also, format, length, or both might be changed when using WHENMATCH.

Enter one of the following options:

REVERSE
Change a matching value by reversing the order. For example, 1234 is changed to 4321.
SHUFFLE
Change a matching value by shuffling the order. For example, 1234 is changed to 4213.
PREFIX
Change a matching value by including a prefix. If the prefix creates a masked value that exceeds the length of the data type field, the row is skipped. Enter one of the following options:
"literal"
Enter a literal string to use as a prefix.

For numeric data types, the string must be numeric. For all other data types, use alphanumeric and special characters.

For character data types, the length of the string cannot exceed the field length. For numeric data types, if the literal value exceeds the data type limit, the row is skipped.

SEQ(low,high)
Use a sequential value for a prefix, where low is the starting value and high is the highest possible value. Each sequential number is incremented by one. Both values must be integers no greater than 2,000,000,000. The low value must be greater than 0 and less than the high value. When the high value is reached, the sequence repeats.
RAND(low,high)
Use a random value for a prefix, where low is the lowest possible random value and high is the highest possible random value. Both values must be integers no greater than 2,000,000,000. The low value must be greater than 0 and less than the high value.
SUFFIX
Change a matching value by including a suffix. If the suffix creates a masked value that exceeds the length of the data type field, the row is skipped. Enter one of the following options:
"literal"
Enter a literal string to use as a suffix.

For numeric data types, the string must be numeric. For all other data types, use alphanumeric and special characters.

For character data types, the length of the string cannot exceed the field length. For numeric data types, if the literal value exceeds the data type limit, the row is skipped.

SEQ(low,high)
Use a sequential value for a suffix, where low is the starting value and high is the highest possible value. Each sequential number is incremented by one. Both values must be integers no greater than 2,000,000,000. The low value must be greater than 0 and less than the high value. When the high value is reached, the sequence repeats.
RAND(low,high)
Use a random value for a suffix, where low is the lowest possible random value and high is the highest possible random value. Both values must be integers no greater than 2,000,000,000. The low value must be greater than 0 and less than the high value.
OVERLAY(n,m,"literal")
Replace consecutive characters with a literal value, where n is the position of the first character, m is the number of consecutive characters to replace, and "literal" is the literal value. For example, OVERLAY(1,3,"ABC") would mask the value 1234 as ABC4.

If the value of m differs from the length of the "literal" value, the length of the masked value might differ from the length of the source value. For example, OVERLAY(1,4,"ABC") would mask the value 123456 as ABC56, and OVERLAY(1,1,"ABC") would mask the value 123456 as ABC23456.

Processing parameters

Parameters for managing provider processes.

DISCARDLIMIT (or DLIM)
Specifies the number of failed rows to discard before the provider stops processing.

Data definition parameters

Parameters for defining source and target data. For further information see, supported data types.

FLDDEFn
Required. Specifies the attributes of input values to use for processing. See Field definition parameter.
CODEPAGE (or CP)
An integer value that specifies the codepage or character-set identifier of the source fields. The default is UTF-8. The CP parameter within the FLDDEFn parameter overrides this value.
CPTYPE (or CPT)
The codepage type of the source fields. The CPT parameter within the FLDDEFn parameter overrides this value.

When the origin of the data is DBMS-specific but not tied to any one DBMS, specify the value as ANY. When the origin of the data is from a non-DBMS source, specify the value as NONE. As there are no DBMS-specific code pages for Netezza®, a specification of NONE is implied when Netezza is specified.

Enter one of the following values:

Value Description
DBZ (or DB2zOS) DB2® for z/OS®
DB2 (or DB2LUW) DB2 for Linux®, UNIX, and Windows
IFX (or INFORMIX) Informix®
MSS (or SQLSERVER) Microsoft SQL Server
NZ or NETEZZA Netezza
ODBC ODBC
ORA (or ORACLE) Oracle
SYB (or SYBASE) Sybase
TD or TERADATA Teradata
ANY Any DBMS
NONE No DBMS

Supported data types

The provider supports the following data types. Note that some data types are not compatible with RULE=BIN and RULE=SCALED.

Table 2. Supported data types
Data type Description RULE=BIN and RULE=SCALED
CHAR Fixed size character data that is left justified and space padded. No
VARCHAR Character data starting with a short integer value that indicates the length, in bytes, of the character data to follow. No
WVARCHAR Wide character data starting with a short integer value which indicates the length, in bytes, of the wide character data to follow. No
WCHAR Fixed size wide character data that is left justified and space padded. No
VARCHAR_SZ Character data string which is terminated by a NULL character. No
WVARCHAR_SZ Wide character data string that is terminated by a NULL character. No
DOUBLE Double precision floating point number in the range 1.7E +/- 308 (15 digits). Yes
FLOAT Floating point number in the range 3.4E +/- 38 (7 digits). Yes
LONG_LONG An 8 byte signed numeric value in the range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. Yes
U_LONG_LONG An 8 byte unsigned numeric value in the range 0 to 18,446,744,073,709,551,615. Yes
INTEGER A 4 byte signed integer value in the range -2,147,483,648 to 2,147,483,647. Yes
U_INTEGER A 4 byte unsigned integer value in the range 0 to 4,294,967,295. Yes
SMALLINT A 2 byte signed integer value in the range -32,768 to 32,767. Yes
U_SMALLINT A 2 byte unsigned integer value in the range 0 to 65,535. Yes
TINYINT A single byte signed integer value in the range -128 to 127. No
U_TINYINT A single byte unsigned integer value in the range 0 to 255. Yes
DECIMAL_370 Packed decimal encoded buffer. No

Using the provider in the data privacy application CLI

Use the following syntax to include the affinity provider in the data privacy application command-line interface (CLI): MASK="PRO=affinity, algo=...".

The following example shows the data privacy application CLI syntax for an affinity provider that uses the FPE algorithm to transform values.

MASK"pro=aff, algo=fpe, mtd=rep, key="foo", iterations=5, whenmatch=shuffle, 
					 flddef1=(name=col, dt=varchar_sz)"

Using the provider in a column map

Use the following syntax to include the affinity provider in a column map: TRANS PRO=affinity, algo=....

The following example shows the column map syntax for an affinity provider that uses the FPE algorithm to transform values.

TRANS PRO=aff, algo=fpe, mtd=rep,key="foo",iterations=5, whenmatch=shuffle,flddef1=(name=col, dt=varchar_sz)