Masking virtual data in Data Virtualization

You can use a data protection rules to mask virtual data. When queried, masked columns return disguised data.

Data masking applies to the result sets of the queries only. The original data in tables and columns remains untouched. Masking does not apply to query predicates or views. You can use data masking to avoid exposing sensitive data. However, data masking does not stop Data Virtualization users from connecting to the service and running queries against that data. Users can join and group data, generate the reports, perform analytics, and collect insights by using the raw data, while masking the result set only.

Data masking rules apply only to the following objects in Data Virtualization:

  • Virtualized objects
  • Native Db2® tables
  • Materialized Query Tables (MQTs)

Depending on the method of data masking, data is redacted, substituted, or obfuscated. For more information, see Governing virtual data with data protection rules in Data Virtualization. A lock icon (Lock icon) next to the column name in your virtualized data indicates that the data in the column is masked by a data protection rule.

Supported masking actions

The following masking actions are supported in Data Virtualization:
  • Redaction (Basic)
    Redact action data mapping
    The following list describes which data types are mapped for Redact action in Data Virtualization.
    • VARCHAR is masked to 'XXXXXXXXXX' or 'X' * column length, whichever is shorter.
    • CHAR is masked to 'XXXXXXXXXX' or 'X' * column length, whichever is shorter.
    • Numeric data types are masked to 0.
    • DATE is masked to '2001-01-01'.
    • TIME is masked to '00:00:00'.
    • TIMESTAMP is masked to DATE('2001-01-01 00:00:00.0').
    • BOOLEAN is redacted as "False".
  • Partial redaction (Advanced)
  • Substitution
    Substitute action data mapping
    CHAR, VARCHAR
    The Substitute masking action is based on a salted sha256 hash and when it is used with CHAR/VARCHAR columns, it gets base64 encoded. Column length must be at least 43 bytes to avoid collisions (two distinct inputs resulting in the same output).
    SMALLINT, INT, BIGINT, DECIMAL, DOUBLE, REAL, DECFLOAT
    Substitution displays a numeric hash of the original value. If the resulting value has a leading "0", that zero will not be displayed. For example, a masked value of "02" is displayed as just "2".
    DATE, TIMESTAMP
    Substitution displays a hashed representation of the original value.
    CLOB, BLOB, BINARY, VARBINARY

    NULL (or empty value if the column is defined as NOT NULL)

    Other data
    Falls back to the Redact action.
  • Obfuscation (Basic)
  • Obfuscation (Advanced)
The following masking actions are not supported.
  • Reversible masking
  • Character formatting options with Identifier method
For advanced obfuscation, the following list of subproperties are supported.
  • Preserve Format
  • Irreversible masking
  • Repeatable Consistency
  • Random Consistency
  • Input validation
  • Identifier method
For more information about advanced obfuscation, see Obfuscating data method (Masking flow).

For obfuscation and advanced masking options, the following data classes are supported.

  • Person name
  • First name
  • Last name
  • Name Suffix
  • Honorific
  • Gender
  • Email Address
  • Date of Birth
  • Driver’s License
  • All the 50 US state driving license number formats
  • US Social Security Number
  • US Social Security Number Last 4

  • Ethnicity
  • Eye Color
  • Hair Color
  • Hobby/Leisure Activity
  • Legal Marital/Civil Status
  • Political Party
  • Employment Status
  • Relationship
  • Religion
  • Language Code or Name as per ISO 369

  • Organization Name
  • Dun & Bradstreet Business Identifier
  • DUNS Number
  • Fortune 1000 Company
  • International Commercial Terms
  • Commercial and Government Entity Code

  • Address Line
  • Address Line 1
  • Address Line 2
  • Address Line 3
  • Country Name
  • Country Codes
  • US Street Name
  • US State Capital Name
  • US zip code
  • US phone number
  • US State Codes
  • US State Names
  • US County Names
  • State/Province Name
  • City (English)
  • Canada Province Codes
  • Canada Province Names
  • Latitude
  • Longitude
  • ISO 3166-2 code of state/province of country

  • Internet Protocol Address
  • Internet Protocol version 6 Address

  • Bank Identifier Code
  • Account Number

  • Credit Card Number
  • Visa Card Credit Card
  • American Express Credit Card
  • Master Card Credit Card
  • Discover Card Credit Card
  • Diners Club Credit Card
  • Japan Credit Bureau Credit Card

  • Date
  • Datetime
  • Month
  • Temperature

  • German Vehicle Registration Number

  • International Statistical Classification of Diseases v10

Usage notes

  • Data masking support in Data Virtualization is limited to the following data types:
    • Character strings (CHAR, VARCHAR)
    • Numeric (SMALLINT, INT, BIGINT, DECIMAL, DOUBLE, REAL, DECFLOAT)
    • Date and time (DATE, TIMESTAMP)
    • Boolean
  • Columns of the data type TIME are always masked as 00:00:00.000 regardless of how the masking rule is defined.
  • Columns of types CLOB, BLOB, BINARY, and VARBINARY are masked as NULL (or as an empty value if the column is defined as NOT NULL) regardless of the masking action.
  • Masking is not applied to query predicates such as WHERE, GROUP BY, JOIN, HAVING, SELECT DISTINCT, and ORDER BY clauses in a SQL statement.
  • For the data masking rules to apply, virtual objects must be captured in a governed catalog and use the Data Virtualization connection type.
  • Data masking rules cannot be applied to views directly. The result sets of views are masked according to the data protection rules that apply to the objects that are referenced by the view. You can mask identifying details from tables that are referenced in the view definition.
  • Redaction, including partial, of columns of all numeric data types will result in masking as 0.
  • Define data protection rules by using the rule builder user interface. Data Virtualization supports the partial redaction options that are described in Governing virtual data with data protection rules in Data Virtualization.
  • To ensure correct results, use the appropriate data class assignments for the column definitions. For example, do not assign a Zip Code data class to a column of DECIMAL type.
  • Obfuscation of unsupported data classes will always perform Identifier masking, which masks letters with letters and digits with digits, and maintains letter case.
  • When NULL values are substituted or obfuscated, a random value is fabricated (non-deterministic) for the columns with supported data classes. NULL values in columns with unsupported data classes fall back to standard redaction.
  • When the Identifier masking method is used with obfuscation, empty values are redacted.
  • Aggregate functions such as SUM might cause numeric overflow in substitution masking. You can cast the column to a higher degree type as shown in the following example:
    SELECT SUM(CAST(VALUE AS DECIMAL(31,0))) FROM PROJECT1.CUSTOMERS
  • When you use the input validation option in obfuscation rules, any input value that is not known to the internal dictionaries will be considered as invalid and redaction will be performed instead.
  • Redaction yields the best performance. Substitution and obfuscation use more advanced algorithms, resulting in larger overhead.
  • Masked columns cannot be used in SELECT (including views) derived from a set operation that involves an EXCEPT, EXCEPT ALL, INTERSECT, INTERSECT ALL, or UNION set operator.

Limitations

  • Important:

    When you preview, download, or refine Data Virtualization data assets in Watson services other than Data Virtualization, in Cloud Pak for Data (for example, IBM Knowledge Catalog, Watson Studio, and Data Refinery), and in cases when data masking or row-level filtering applies, the preview is subject to the data protection rules and catalog or project access control only. Data Virtualization access controls are not enforced.

    Data Virtualization access control is not applied when data masking or row-level filtering applies to the preview in Watson services (other than Data Virtualization). The Data Virtualization internal access controls, which are controlled by using Manage access in the Data Virtualization UI, do not apply to the preview from the other Watson services with masking or row-level filtering. You must define your rules to manage access to the catalogs, projects, data assets, or connections for access control in the other Watson services.

  • The identifier masking method is not supported on multibyte characters. Original multibyte characters are returned.
  • You might see unexpected results when you mask multi-byte characters by single-byte or mask single-byte characters with multi-byte.
  • Masking of custom data classes is not supported in Data Virtualization. Identifier masking method is used instead.
  • Characters formatting options in obfuscation are not supported and Identifier masking, which masks letters with letters and digits with digits, and maintains letter case, is performed instead.
  • In some circumstances, masked preview in catalogs or projects might be different than in Data Virtualization.