Masking virtual data (Data Virtualization)

Tech preview You can use a data protection rule to mask virtual data. When queried, masked columns return disguised data.

Data masking applies to the result sets of the queries only. The original data in tables and columns remains untouched. Masking does not apply to query predicates. You can use data masking to avoid exposing sensitive data. However, data masking does not stop a Data Virtualization users from connecting to the service and running queries against that data. Users can join and group data, generate the reports, perform analytics and collect insights by using the raw data, while masking the result set only.

Data masking policies apply to the most basic virtual tables in Data Virtualization. Typically, basic virtual tables meet all of the following criteria:
  • Tables that were not virtualized in the strict virtualization mode. See Enabling the strict virtualization mode for details.
  • Tables that have one-to-one mapping with the original (non-virtualized) table. Therefore, data masking is not supported for grouped tables or tables that can be grouped before or during the virtualization process.
Depending on the method of data masking, data is redacted, substituted, or obfuscated. See Masking data with data protection rules for details. A lock icon (Lock icon) next to the column name in your virtualized data indicates that the data in the column is masked by a data protection rule.

Watch the following video for an overview of data masking in Data Virtualization:

When you apply masking rules to virtual data, keep in mind the following considerations:

Data type support for masking virtual data

Data Virtualization supports data masking on columns of several data types. Masking operations behave differently based on the operation and the data type of the column.

Redact action data mapping
The following table shows which data types are mapped for Redact action in Data Virtualization.
Table 1. Redact action data mapping
Data type Masked to
VARCHAR 'XXXXXXXXXX' or 'X' * original string length, whichever is shorter.
CHAR 'XXXXXXXXXX' or 'X' * column length, whichever is shorter.
Numeric 0
DATE '0001-01-01'
TIME '00:00:00'
TIMESTAMP DATE('0001-01-01')
BOOLEAN 0
Substitute action data mapping
VARCHAR
  • If the original value is at least 32 characters or longer, masked data is md5 of the original value.
  • If the original value is less than 32 characters, masked data falls back to the Redact action.
CHAR
  • If column length is at least 32 characters or longer, masked data is md5 of the original value.
  • If column length is less than 32 characters, masked data falls back to the Redact action.
Add other data
Falls back to the Redact action.
Obfuscate action data mapping
Obfuscation support is limited to the following data classes only:
  • Personal Information
    • Gender
    • Honorific
    • Name suffix
  • Contact Details
    • Email address
    • USPHN
    • STREET
    • ADDRESSLINE1
    • ADDRESSLINE2
    • ADDRESSLINE3
    • CITY
    • STATEPROVNAME
    • COUNTRY
    • Latitude
    • Longitude
  • Financial Accounts
    • Amex card
    • Diners club card
    • Discover card
    • Master card
    • VISA card
    • Credit card
    • Japan Credit Bureau (JCB)
  • Government Identities
    • SSN
    • SSN4
  • Organization Information
    • DUNS Number
    • CAGE CODE
    • USSIC
  • Connectivity Data
    • IP v4 address
  • Personal Demographic Information
    • RELIGION
    • ETHNICITY
    • EYECOLOR
    • HAIRCOLOR
    • MARITALSTATUS
    • HOBBY
    • EMPLOYMENTSTATUS
  • Other
    • USSCP
    • RELATIONSHIP
If obfuscation cannot be performed, the behavior falls back to Substitute.
Note: If obfuscated value is longer than the length of the original value, masked data falls back to the Redact action to avoid overflowing column length.

Limitations and known issues for masking virtual data

The following limitations and known issues apply to Data Virtualization for data masking policies.
Data masking support
Data masking is not supported for:
  • WHERE, GROUP BY, JOIN, HAVING, SELECT DISTINCT, and ORDER BY clauses in an SQL statement. Masking is not applied to query predicates.
  • Non-virtualized tables in the catalog on which virtual tables are based.
  • Data masking is not supported for grouped tables or tables that can be grouped before or during the virtualization process.
  • Tables that are virtualized with a subset of columns smaller than the original non-virtualized table. If you choose not to virtualize all table columns, you get this type of tables.
  • Views. As a workaround, you can anonymize tables that are referenced in the view definition:
    1. Ensure that the virtual tables that are referenced in a view definition are basic virtual tables by executing the following statement:
      SELECT BSCHEMA, BNAME, BTYPE FROM SYSCAT.TABDEP WHERE TABSCHEMA = <View schema> AND TABNAME =
              <View name>

      If BTYPE for all referenced objects is N, your view is referencing basic virtual tables only and you can use this workaround.

    2. Publish the referenced tables to the catalog and define masking rules that apply to these virtual tables.
    Note: When the objects that are referenced in the views are anonymized, the views will be anonymized as well.
Additionally, you must consider the following limitations:
  • Substitute masking action can be performed only on character strings with original length >= 32 characters. In all other cases, the Substitute action is downgraded to Redact.
  • Masked columns cannot be used in SELECT (including views) derived from a set operation that involves an EXCEPT, EXCEPT ALL, INTERSECT, INTERSECT ALL or UNION set operator.
Data masking support for data types
Data masking is limited to the following data types:
  • Character strings (CHAR, VARCHAR)
  • Numeric
  • Date/time
  • Boolean
Data masking in connections with personal credentials

When data assets in the catalog use connections with personal credentials, asset preview does not support masked data in the following components: Watson™ Knowledge Catalog, Watson Studio, and Data Refinery.

If you try to preview an asset in these conditions, you get the following error:
An error occurred attempting to preview this asset. The anonymized data for this data asset can't be retrieved. There might be a temporary server outage.

By default, Data Virtualization's DV connection in Watson Knowledge Catalog uses personal credentials.

Resolving the issue

To solve this issue, you must update the Data Virtualization's DV connection in the catalog to use shared credentials.
  1. Click Organize > All Catalogs.
  2. In the default catalog, edit the DV connection to set the catalog credentials to Shared.
    Important: By changing the catalog credentials to Shared, you allow all catalog collaborators, including collaborators with the Viewer role, to retrieve the connection credentials.

    Optionally, you can edit any other DV connection you’ve used to add data assets to the catalog and to your projects to enable asset preview of masked data.

  3. To apply these changes to existing data assets in the catalog, you can choose one of the following methods:
    • Wait 48 hours after switching the connection credentials to Shared.
      Note: You must not preview or use the data asset with the error in these 48 hours. Every time you use the data asset the expiration count is reset.
    • Delete and re-publish existing data assets to the catalog.
    • Add or modify a masking data protection rule applicable to the data asset with the error. For example, modify the masking action from Redact to Substitute. This modification forces the recompilation of the transformation.

Denying access to masked data

You can configure Data Virtualization to deny access to a data asset instead of applying the mask to applicable columns.
  1. Run the following command to run the dv-engine-0:
    oc exec -it dv-engine-0 -c dv-engine bash
  2. Modify $BIGSQL_HOME/conf/bigsql-conf.xml to add the following property:
    <property>
        <name>bigsql.wkc.allow_on_transform</name>
        <value>false</value>
    </property>
  3. Save the file.
  4. Restart Big SQL scheduler:
    bigsql stop -scheduler 
    bigsql start -scheduler
By denying access to a data asset, you avoid data masking limitations in Data Virtualization. Additionally, by doing so you deny access to the entire table, even if only a single column is masking data.