Mask data with data protection rules (IBM watsonx.data intelligence)
To mask data, the data must conform to these requirements:
- The data is structured. The data must be in relational tables or CSV, Avro, partitioned data, or Parquet files.
- The column headers contain only alphanumeric characters (a-z, A-Z, 0-9). The column headers can't contain unsupported characters, such as, multi-byte characters or special characters.
When you choose the masking action, you must specify the masking criteria and the masking method.
Masking criteria
The masking criteria identifies the columns to mask. You select the type of column property, and specify one or more specific values of the property, which are logically combined with the OR operator.
| Type of column property | Description | Specific values |
|---|---|---|
| Business term | A business term that is assigned to the column. | Search for and then select one or more published business terms. |
| Data class | The data class that is assigned to the column. | Search for and then select one or more published data classes. |
| Tag | A tag that is assigned to a column in the asset. | Enter one or more tags, separated by commas. |
| Column name | The name of a column. | Enter one or more column names, separated by commas. |
For example, suppose you choose the column property of Data class and the specific values of California State Driver's License and Nevada State Driver's License. Values are then masked in columns that are assigned either the California State Driver's License or the Nevada State Driver's License data class.
Overview of masking methods
The main differences between the masking methods are how much of the original characteristics of the data remain. The more original characteristics of the data that are retained, the more useful, but the less secure, the masked data becomes. When you choose a masking method, consider these factors:
-
Data integrity: Whether to repeat the same masked value for a repeated original value to maintain referential integrity between tables.
-
Data format: Whether to retain the format of the original data. Preserving the format means that letters are replaced by letters with the same case, digits are replaced by digits, and the number of characters is the same.
The following table describes how each masking method affects these characteristics.
| Method | Description | Preserves integrity? | Preserves data format? |
|---|---|---|---|
| Redact | By default, replaces values with ten X characters. The most secure method. You can customize the replacement character and the number of replacement characters. For columns that have some assigned data classes, you can choose partial replacement. |
No | No: If you are not using advanced masking options. Yes: If you are using advanced masking options. |
| Substitute | Replace values with randomly generated values that preserve referential integrity. | Yes | No |
| Obfuscate | Replace values with values that preserve referential integrity and the original data format. The least secure method. | Yes | Yes |
Redact
You can redact data using two different methods.
-
The basic redact method replaces each data value with a string of exactly ten letters of X. With redacted data, the format of the data and data integrity are not preserved. Redact is the most secure masking method, but results in the least useful masked data.
For example, the phone number 510-555-1234 is replaced with XXXXXXXXXX. All other phone numbers are replaced with the same value.
Substitute
The substitute method replaces data with values that don't match the original format. However, it does preserve referential integrity for repeated values for all assets in the catalog. The substituted values are meaningless and the original format of the values can't be determined. Substitute provides security and data usefulness in between the Redact and Obfuscate methods.
For example, the phone number 510-555-1234 is always replaced with 500ddcc98133703531re3456.
Obfuscate
The obfuscate method replaces the data values with similarly formatted values that match the original format and preserves referential integrity for repeated values. Because the obfuscated values are similarly formatted, they can be valid values. Obfuscate is the least secure masking method, but results in the most useful masked data.
For example, the phone number 510-555-1234 is always replaced with 415-987-6543.
However, the obfuscate method is limited to data values in columns that have assigned data classes with the following types of information:
- Personal information, for example, basic attributes of an individual, such as honorific or name suffix.
- Contact details, for example, email addresses, phone numbers, state, postal addresses, latitude, or longitude.
- Financial accounts, for example, credit cards, banking, or other financial account numbers.
- Government identities, for example, personal identification numbers issued by governments, such as SSN (US social security numbers) and CCN (credit card numbers).
- Personal demographic information, for example, religion, ethnicity, marital status, hobbies, or employee status.
- Connectivity data, for example, IP address, or mac address.
If you create a rule to obfuscate data and the rule is enforced on data that is not assigned a data class that supports obfuscation, the substitute method is used instead.
Watch this video to see how to mask data.
This video provides a visual method to learn the concepts and tasks in this documentation.