Use Optim software to manage privatized test data from design to deployment

Integrating InfoSphere Data Architect, Optim Development Studio, and Optim Test Data Management to enforce privacy standards

With Optim™ Development Studio 2.2, built-in integration with Optim Test Data Management enables developers and database administrators to use copy-and-paste function to generate an Optim Test Data Management request. The request can then be used for the efficient creation of a consistent, right-sized, referentially intact subset of data for testing. This methodology also enables you to enforce data privacy standards that were originally defined in the data model that used InfoSphere™ Data Architect. Developers can see which columns are private to ensure that the generated script enforces the requirement to include the Optim Data Privacy option for de-identifying private data. This article offers an end-to-end example scenario depicting a data architect, a database administrator, and a developer completing the steps needed for the solution.

Shawn Moe, Software Engineer, IBM

Shawn Moe is a software engineer at the IBM lab in Lenexa, Kansas. He is currently the architect for the new Optim Development Studio data and object movement functionality. Before working with the ODS team, he worked with the IBM Migration Toolkit and Informix database connectivity teams.



15 October 2009

Also available in Spanish

Introduction

Optim Development Studio (ODS) V2.2 and InfoSphere Data Architect (IDA) V7.5.2 include enhancements to provide role-specific support to ensure that your confidential and personally identifiable information is properly privatized when right-sized testing environments are generated. Optim Test Database Management (TDM) provides this functionality, as well as the capability to work with complete business objects to ensure that databases within development test environments are built as referentially intact entities. The enhancements in IDA and ODS provide the integration features for you to create jobs that can be executed using TDM (versions 6.5 or 7.1), which extends the user base of your TDM installation.

In many environments, database administrators implement and administer the TDM solution. Consequently, the solution might not be readily available for developers to use to create test environments from the production database. Often developers create databases on an ad-hoc basis, and they might even copy databases directly from production databases, without realizing the need to privatize the confidential or personal information contained therein. Using real, non-privatized data for development and testing puts corporations at risk, and it can violate corporate data governance and compliance regulations. Also, manually created databases that do not contain confidential or personal data generally do not ensure that the test data is referentially intact and a consistent representation of the application. This non-confidential data does not enable developers to accurately reproduce a defect or to perform realistic and suitable testing.

With the new support in IDA, data architects and database administrators can define the appropriate data privacy policies and rules needed to protect confidential and personally identifiable data. Developers can use ODS to leverage this data privacy metadata to create an Optim TDM request. Within ODS, this IDA support is integrated with the new data object copy functionality to copy database objects and data between homogeneous and heterogeneous databases, as described in the article "What's new and cool in Optim Development Studio" (see Resources).

When using the Optim TDM integration features, ODS copies the selected database table definitions and referentially related objects. ODS generates an Optim export file (OEF) that contains metadata about:

  • The selected tables
  • The data subset selection criteria
  • The data privacy metadata

The result is the complete set of required database objects created on the target database with a referentially complete subset of privatized data copied from the source database. Developers have a safer and more complete environment in which to perform their tests.

This article assumes you have basic knowledge of database terminology. Familiarity with IDA, ODS, and TDM products is helpful but not required. This article demonstrates the process in enough detail that complete product knowledge is not required.

The following section, Laying out the process, takes a high-level look at the flow of the process. The subsequent sections describe an example scenario that illustrates the steps in more detail.

Laying out the process

The example scenario assumes a typical set of roles in an organization, although your organization might have a different division of responsibilities.

  1. Using IDA, the data architect creates a physical data model for a particular database connection. The data architect then creates a domain model, along with a set of atomic domains, each describing the required data privacy masking methods for a particular domain, such as credit card data. The data architect can then associate these atomic domains to physical columns in the physical data model.
  2. Developers using ODS can then associate this physical data model with their individual database connections. Note that this association also enables developers to see which columns require special handling in the application because of the private nature of the data contained in them.
  3. To create a test database, developers can use the ODS data object copy-and-paste functionality between the source production database (which contains private data) and their target test databases (which contain only de-identified private data). The copy-and-paste operation generates the new target database objects and the Optim export file that contains metadata information that Optim TDM uses to generate the appropriately privatized data.
  4. The export file is imported into Optim TDM. The database administrator uses TDM to copy the appropriate data from the production database to the test database that the developer created.
  5. The developer can be confident that the test data is an intact subset and appropriately de-identified.

Introducing the example scenario

Tom, a fictional data architect for the Sample Outdoors Company, performs the data modeling for the organization. He has defined the data privacy masking rules and methods for the GSDB database that his developers use. He uses InfoSphere Data Architect (IDA) for his modeling tasks.

Roslyn, a developer at Sample Outdoors, supports a database application that runs with a key sales database called GSDB on the DB2® for Linux®, UNIX®, and Windows® platform. She is using ODS to help modify the application and database so that they can also run on the Oracle platform. She made some changes to her DB2 LUW V9.7 baseline, and now she needs to copy the updated database objects from her DB2 database to her Oracle 11g test database for verification and validation on that platform. She uses Optim Development Studio for her database development tasks.

Tom has determined that based on Roslyn's requirements, she will be working with the GOSALECT schema in the GSDB. The following fields from that schema need to be masked in Roslyn's test database:

  • Social security number
  • Credit card number
  • Customer email address
  • Customer last name

Eric, a database administrator at Sample Outdoors, is responsible for maintaining the GSDB application. One of his responsibilities is to create and maintain test database instances for GSDB developers. He uses Optim TDM with the Data Privacy Option.


Defining the privacy domains and associating them with the physical data model

  1. Using InfoSphere Data Architect, Tom creates a new data design project to contain all the artifacts (data models and metadata) for this particular project. From Data Project Explorer, he clicks New > Data Design Project to bring up the new data design project window, as shown in Figure 1. He names the project.
Figure 1. Create a new data design project from IDA
  1. Tom creates a new physical data model (PDM) within the new data design project by selecting Create > New > Physical Data model, as shown in Figure 2. Note that a new PDM is not strictly necessary, because the artifacts can be added to an existing model. In this scenario, Tom chooses to create a new model containing only the domains that contain this particular privacy masking metadata. His rationale is that he only wants to expose this particular metadata to the developers.
Figure 2. Create a new physical data model from IDA
  1. Using the wizard, Tom names the new PDM, and he determines that it should be compatible with his DB2 LUW V9.7 database instance. He wants to generate the model by reverse engineering the existing GSDB database. He selects this database in the connection pane, as shown in Figure 3.
Figure 3. PDM wizard screens
  1. The new PDM is created within the current project. Tom can now manage the PDM using the Data Project Explorer, as shown in Figure 4. The database objects selected from the Database Elements tab appear in the hierarchy. Tom can review the important GOSALES and GOSALESCT schemas from the GSDB database.
Figure 4. New physical data model in the Data Project Explorer
The new PDM is shown in the project explorer tree under Data Models
  1. Tom can now create a new domain model, as shown Figure 5.
Figure 5. Create a new domain model
From the Data project explorer select Data Models>New> Domain Model
  1. After clicking Finish, the new domain model appears in the Data Project Explorer. A default package, Package1, is also created.
  2. Tom now defines a new atomic domain for each of the privacy definitions that he would like to define by selecting Data Models > New > Domain Model, as shown in Figure 6. He iteratively creates the atomic domain definitions by selecting Add Data Object from the Package1 menu.
Figure 6. Create a new atomic domain
  1. The Atomic Domain window describes the key data privacy information for each of the data columns. On the General tab, Tom types the domain name, the base type, the base length, precision, and scale. Because Roslyn's usage will be with a particular database table, this information should correspond to the definition of the underlying database columns. At this point, the domain is not associated with any connection, so this information is not pre-loaded with the database information.

    The Data Privacy tab contains the actual data privacy information and metadata. For classification, Tom can specify whether the particular atom domain represents Personally Identifiable Information (PII) or confidential information. For enforcement, Tom can select whether the enforcement policy is Required, Not Required, or Best Practice. The privacy policy type contains several common categories of PII, including credit card number, social security number, random number, and so on. These categories help determine which of the data privacy masking algorithms to use for the specified data. Based on the privacy policy type, there are several policies that can be used to perform the masking. Depending on the application, some level of authenticity might be preferable, such as a random social security number with a valid source area number.

    Tom creates the atomic domain for the social security number (SSN), identifying this as a numeric field (DOUBLE). Some applications might represent SSNs as character data instead. Tom knows that GSDB performs numeric-based validation on this field, and he marks it as a DOUBLE. The general atomic domain window for SocialSecurityNumber is shown in Figure 7.

Figure 7. Enter General information about the Social Security Atomic Domain
On General tab Name: field contains SocialSecurityNumber, Base type is DOUBLE, Transform as is None
  1. On the Data Privacy tab, Tom indicates the following for social security number, as shown in Figure 8:
    • This domain is Personally Identifiable Information.
    • Enforcement is Required.
    • Data is treated as a Social Security Number (SSN).
    • Use the algorithm that generates a Random SSN with a valid source area number (first 3 digits).
Figure 8. Enter Data Privacy information about the Social Security atomic domain
  1. Tom creates an atomic domain for the credit card number, identifying this as a character field (CHAR 19). The general atomic domain window for CreditCardNumber is shown in Figure 9.
Figure 9. Enter general information about the Credit Card Number atomic domain
Name field contains CreditCard Number, Base Type is CHAR, length is 19, Transform AS is none
  1. On the Data Privacy tab, Tom indicates the following for the credit card number, as shown in Figure 10:
    • This domain is Personally Identifiable Information.
    • Enforcement is Required.
    • Data is treated as a Credit Card Number (CCN).
    • Use the algorithm that generates a Random CCN with the first four digits of issuer identified.
    Tom knows that the GSDB application performs validation on the credit card number, and he wants to ensure that the application logic will correctly identify the randomly generated credit card numbers.
Figure 10. Enter Data Privacy information about the Credit Card Number atomic domain
  1. On the Data Privacy tab, Tom indicates the following for the customer email address, identifying it as a VARCHAR (128) data type, as shown in Figure 11:
    • This domain is Personally Identifiable Information.
    • Enforcement is Required.
    • Use the algorithm that generates a Random e-mail address in lower case for this type of data.
Figure 11. Enter Data Privacy information about the Customer E-mail atomic domain
  1. Lastly, Tom creates the atomic domain for the Customer Last Name. On the Data Privacy tab, Tom indicates the following for the customer last name, identifying it as a VARCHAR (128) data type, as shown in Figure 12:
    • This domain is Personally Identifiable Information.
    • Enforcement is Required.
    • Use the algorithm that generates a Random shuffle algorithm to generate random customer names.
Figure 12. Enter Data Privacy information about the Customer Last Name atomic domain
  1. As the new atomic domains are created, they are added to the package and the domain model. The updated domain model appears in the Data Project Explorer, as shown in Figure 13.
Figure 13. Four new atomic domains shown in Data Project Explorer
Under Data Models > Domain Models >Package1, you can see CustLastName, CustEmail, CreditCardNumber, and SocialSecurityNumber domains listed
  1. Tom makes sure he saves the model, as shown in Figure 14.
Figure 14. Save the new domain model
From DomainModel.DDM* right click and select Save.
  1. Tom now associates the domain models with the specific columns in the physical data model that represent the GSDB schema. First he locates the desired columns in the (previously created) physical data model, the DatabaseModel.dbm. Tom locates the GOSALESCT.CUST_SSN column in the PDM, as shown in Figure 15.
Figure 15. Find columns for privacy masking
Navigate from Data Models, DatabaseModels.dbm, GSDB, GOSALESCT, CUST_CRDT_CHK to CUST_SSN[double]
  1. In the Properties View for this column, Tom selects the Type tab. For the Domain field, Tom selects the ellipsis (circled). A window shows the atomic domains that have been created in the project. Tom sees the four atomic domains he created. He selects the SocialSecurityNumber domain model, and he selects OK.
Figure 16. Associate atomic domain with CUST_SSN column
From Type tab click on the ellipses button which brings up a Select a Data type popup. Select SocialSecurityNumber

Note that this atomic domain is defined as DOUBLE, while the data type for the database column (determined using reverse engineering) is a BIGINT. DOUBLE is the generic type used in various models. BIGINT is the DB2 type. Different DBMS have different database types. Sometimes these are the same names, and other times (like this) they are not. For this particular scenario, these data types are compatible.

  1. Upon selecting OK, the column CUST_SSN is now depicted in the Data Project Explorer with a padlock icon, indicating that it now has privacy attributes, as shown in Figure 17.
Figure 17. CUST_SSN column now shows privacy masking icon
Same view as Figure 16 for list of columns but now the CUST_SSN[DOUBLE] has a tiny padlock icon to the left
  1. Tom repeats this process with the other three atomic domains he has created:
    • The CustEmail domain is associated with the CUST_EMAIL column of the GOSALESCT.CUST table.
    • The CreditCardNumber domain is associated with the CRDTCRD_PRIM_ACCT_NBR column of the GOSALESCT.CUST_CRDTCRD table.
    • The CustLastName domain is associated with the CUST_LAST_NAME column of the GOSALESCT.CUST table.

    When Tom completes the domain association of the four new atomic domains, the Data Project Explorer shows four columns in the physical model having privacy masking attributes, as shown in Figure 18.

Figure 18. New atomic domains associated with four columns
In the GOSALESCT schema shown in data project explorer: CUST.CUST_LAST_NAME and CUT_EMAIL has padlock. CUST_CRDT_CHK.CUST_SSN has padlock and CUST_CRDTCRD.CRDTCRD_PRIM_ACCT_NBR has padlock
  1. Tom wants his changes to be visible when others use this model. Tom saves his changes to the physical data model by right-clicking Database Models.dbm*, and selecting Save, as shown in Figure 19.
Figure 19. Save the new physical model
From Data Models, navigate to Database Models.dbm*, right click and select Save

Generating test databases

Tom has created a physical data model containing privacy masking domains to mask the data contained in the four columns of the GSDB database that he has determined to contain sensitive personally identifiable information. Now the Sample Outdoors developers, such as Roslyn, can use this model to generate test databases with Optim Development Studio and the Optim Test Data Management solution.

  1. Before associating the physical model with her GSDB connection, Roslyn examines her connection to GSDB and looks at the tables in the GOSALES and GOSALESCT schemas. In the Data Source Explorer of Optim Development Studio, the columns that have been identified for masking do not currently show the padlock icon, meaning that the data within these columns has not been masked, as shown in Figure 20. If she were to copy data at this point, they would not have the associated privacy metadata, so the data would not be properly de-identified. She must complete the model association to ensure that her subsequent data copy requests will result in de-identified data.
Figure 20. Roslyn explores initial source database connection before associating with model
The figure is similar to figure 18 except now we are in Data source explorer of Development Studio. None of the columns that Tom defined as priate has privacy icons (padlocks) yet.
  1. Because Tom indicated that he specified private data, Roslyn needs to ensure that her work is properly enforced by associating her database connection with Tom's model. She imports Tom's saved project from the project repository. She uses the Import option from the Data Project Explorer, and selects Existing Project into Workspace, as shown in Figure 21. Finally she browses to the location of the project, selects it, and clicks Finish. The project is then imported to her workspace.
Figure 21. Import the saved Data Design Project metadata
Right click in data project explorer white space, select Import..., then navigate to Existing Project into Workspace and click Next, Then Select the directory to search for projects and then click Finish.
  1. Roslyn associates the physical model from the imported project with her connection to the GSDB database. In the Data Source Explorer, she right-clicks the database connection and selects Properties, as shown in Figure 22.
Figure 22. Set the connection properties
From Data Source explorer navigate to GSDB database connection, right click and select Properties.
  1. From the Data Privacy Modeling tab on the Properties screen, Roslyn browses for the physical data model that she just imported and clicks OK to associate this physical data model with this connection, as shown in Figure 23.
Figure 23. Associate the saved PDM with the connection
From Properties for GSDB, select Data Privacy Modeling then Browse the physical data model file pulldown and navigate to the DB2_V9.7_GSDB pdm and select ok
  1. When Roslyn views the GOSALESCT schema, the columns in the CUST, CUST_CRDT_CHK, and CUST_CRDTCRD are now marked with the padlock icon showing that they have associated privacy metadata, as shown in Figure 24.

    Associating the privacy metadata should be a one-time-per-connection activity, unless the physical data model changes. In the example scenario, if Tom determines that additional columns need masking, he might refresh the model and then notify the impacted developers. Roslyn would then need to refresh her model as well. Now that she has associated the physical model with the GSDB connection in this work space, however, she should not need to do this activity again. Figure 24 shows the updated columns in the Data Source Explorer.

Figure 24. Updated columns in the ODS Data Source Explorer
The columns all have padlocks

Copy-and-paste supported database combinations

The new database object copy-and-paste support within ODS is supported between the following source and target database combinations:

  • DB2 on LUW to DB2 on LUW (Versions 9.5 and 9.7)
  • DB2 LUW 9.7 (compatibility mode) to Oracle 10 or 11g
  • Oracle to Oracle (10 or 11g)

Optim TDM supports additional databases. If you need to copy a privacy masked subset of referentially intact data involving another database platform, use the Optim TDM installation for the entire operation.

  1. Roslyn makes changes to the structure of the CUST, CUST_CRDT_CHK, and CUST_CRDTCRD tables in her GSDB database in her DB2 instance. She creates a test schema, DEMO, in her Oracle 11 database instance in order to continue her development on that platform. She wants to copy these tables to her Oracle test database in order to determine if the structure and application changes she has made are suitable in that environment. She uses the new ODS data object copy-and-paste functionality to copy and paste from the DB2 connection to her Oracle connection. She selects the source tables to be copied and selects Copy, as shown in Figure 25.
Figure 25. Optim Development Studio copy menu
Tables CUST, CUST_CRDT_CHK, and CUST_CRDTCRD are all selected then right click and select Copy...
  1. Roslyn right-clicks the DEMO schema in her Oracle 11 connection and selects Paste, as shown in Figure 26.
Figure 26. Optim Development Studio paste menu
From Data srouce explorere, select DEMO schema in Oracle connection, right click, and select Paste...

A note about drag-and-drop capability

The copy-and-paste functionality can also be invoked using drag-and-drop from within the Data Source Explorer, in which case no subsequent windows appear; the data objects and data are copied from source to target directly. Drag-and-drop is a shortcut for the option to paste directly into the target. As such, it does not use the privacy metadata associated with the physical data model, nor does it generate the export file.

  1. The Paste Database Objects window appears. This window is used to specify the following options:
    • Whether the paste should happen directly from source to target. The data objects are created immediately, and the data are copied directly from source to target. This option does not use the established privacy metadata or generate the export file for Optim TDM.
    • Whether a test data subset with optional privacy masking should be created. This option indicates that the privacy metadata is used and that an export file is generated for use with Optim TDM.
    • Whether to paste using the Change Management functionality of Optim Database Administrator (ODA). This option is available when ODS and ODA are shell sharing, and when the source-target combination can be processed using the ODA change management functionality. In the example environment, Roslyn does not have ODA installed, and because she has a source DB2 database and a target Oracle database, the combination would not be supported in ODA.

    The Upgrade link provides information about other IBM products that could be used to satisfy the requested copy-and-paste operation.

    Roslyn selects to Paste with a test data subset and optional privacy masking, as shown in Figure 27.

    Note that, although not needed in the example scenario, the option to paste directly into a target has a limit on the number of database objects and data that can be copied in a single operation. The limit is currently 100 objects and 10,000 rows of data. When using Optim TDM to create the test data subset, there is no application-enforced limit on the number of rows that can be copied.

Figure 27. Paste Options window
Paste with a test data subset and optional privacy masking is selected. Paste directly into target is NOT selected.
  1. Roslyn clicks the Next button to start the Advanced Options wizard, as shown in Figure 28. These optional screens enable some customization of the copy-and-paste operation. Some of the options in the wizard include the following:
    • In the Data and Object Options section, only the Copy Database Objects and Data Option is available, because Roslyn already selected to create the test subset with privacy masking. The masking is for the data, so it is evident that the table data should be part of this request.
    • In the Other Options section, the option to copy dependent database objects is selected by default. This is because in order to create the referentially intact subset of data, the objects on which the selected objects are dependent must also be copied. Contained objects are those that are fully contained within a selected database object in the Data Source Explorer. For example, triggers, constraints, and indexes are contained within a particular table.
    • If Roslyn wants to use a specific tablespace on the target database for the new tables, she specifies it in this window. If no tablespace is defined, the default tablespace is used for the new tables. If the Generate Tablespace Names box is checked and no name is supplied, the source tablespace name is used when generating the target table.
Figure 28. Advanced Options window showing Source/Target tab
In source/target tabl, copy database objects and data is selected and Copy contained database objects is checked.
  1. On the Type Mapping tab, Roslyn can customize the data type mapping that is used when generating the target database DDL statements and click Next, as shown in Figure 29. The first column represents the source data type (in Roslyn's case: DB2), and the second column represents the target (Oracle) data type. The recommended mappings are set by default.
Figure 29. Advanced Options window showing the Type Mapping tab
In Type Mapping tab, source types and target types are listed. defatuls are taken. Click next to go to next.
  1. From the Error Handling tab, Roslyn can customize how errors are reported, as shown in Figure 30. She selects to report both errors and warnings, which is the default setting. The transactional behavior can be specified In the Error Response section. The default transactional behavior is auto-commit mode, where each statement is committed as it is sent to the target server. If Roslyn wants the DDL execution to stop when the first error is encountered, she can select that option. Based on the target database capabilities, the DDL can also be rolled back. For the example scenario, however, this option is disabled, because Roslyn is creating tables in her Oracle instance, which means she cannot include the table creation DDL in a transaction.
Figure 30. Advanced Options window showing Error Handling tab
Advanced Options window showing default options
  1. In the next window, Roslyn sets the Optim TDM options to generate the referentially intact data subset, as shown in Figure 31. The fields in this window include:
    • Start table: This field contains the key table that is used as a basis to select the data subset. For example, if Roslyn is interested in her subset containing data for 1000 customers, she would select CUST as the start table. The records in the other tables are selected based on the start table.
    • Target schema: This field identifies where the data object and data will be copied. This is informational only.
    • Tables: This field is the list of tables that were selected in the copy operation. ODS determines the relationships between the selected tables and adds dependent tables to the list, if required. In the example scenario, Roslyn did not select the table CUST_STATE_TAX, but ODS determined that it was referentially related to the tables that were selected.
    • Reference tables: This field shows tables that are referentially related to the start table. Initially all of the tables are selected (with checkbox), but after a start table is selected, the table selections change.
    • Filter: This field is an SQL statement used to select rows from this table.
    • Row limit: This field indicates the suggested maximum number of rows that should be selected from this table. A value of 0 means an unlimited value. Limits on tables other than the start table may or may not be honored, because the values in the start table determine which (and how many) rows are selected from the other tables.
    • Documentation: This field show the comments associated with this table in the physical model. This is informational only.
    • Export location: This field is the location where the export file should be saved.
    • Physical Data Model: This field shows the name of the model associated with this connection.
Figure 31. Generate test data with the Optional Privacy Masking window
The Optional Privacy Masking window with defaults selected
  1. Roslyn completes the Privacy Masking window, as shown in Figure 32. She selects CUST as the start table, and she provides c:\temp\GSDB_CustInfo.txt for the local location to which the export file will be saved. She reviews the PDM and target schema to ensure these are correct. After reviewing her application code, she agrees that the CUST_STATE_TAX table should also be copied with the other tables she has selected. She leaves this table checked as a Referenced table.
Figure 32. Privacy masking window after completion
Start table is CUST, and CUST_STATE_TAX is the Reference table (checked). explort location is c:\temp\GSDB_CustInfo.txt.
  1. When Roslyn selects Next, the selected DB2 database objects are translated to the equivalent Oracle database objects, resulting in DDL statements that she can preview on the Preview DDL tab, as shown in Figure 33. She can run the DDL statements on the target server or open the file for editing, or she can do both. If she opens the file for editing, it appears in the default project (which can be changed using the Browse button) using the file name shown.
Figure 33. Preview DDL window
folder name is .sqlexeditor_project, filename is script2.sql. the preview ddl shows create table statements. The checkbox for Run DDL on server is checked.
  1. When Roslyn selects Finish, the DDL statements are executed on the target Oracle database. Four tables (CUST_STATE_TAX, CUST, CUST_CRDT_CHK, and CUST_CRDTCRD) are created, along with the indexes and constraints. The results of this operation can be reviewed in the SQL Results view, as shown in Figure 34.
Figure 34. SQL Results view
All the CREATE statements for the DDL show status 'succeeded'
  1. Using the Data Source Explorer, Roslyn can also view the new tables, indexes, and constraints in the DEMO schema of her Oracle 11 connection, as shown in Figure 35.
Figure 35. Target connection showing copied database objects
In Data source explorer, the tables that were copied all show under the Oracle demo schema.
  1. Roslyn can also preview the export file that is generated at the specified location, as shown in Figure 36. This is a plain text file, but it does not contain natural language instructions, because this file will be processed by Optim TDM to copy the data from the specified source (DB2) tables into the newly created target (Oracle) tables.
Figure 36. Export file to be used with Optim TDM
In Data source explorer, the tables that were copied all show under the oracle demo schema.

Deploying privatized test data

Import and export

The terms import and export are used in several places in the following section. This article describes a sequential series of steps, so in most cases, the export from one operation is then the result of an import operation in a subsequent step. These are mostly transient operations, and they are not the final import or export of the actual table data.

In order to complete the test data generation process, Eric, the database administrator (DBA) that supports GSDB at Sample Outdoors needs to import into Optim Test Data Manager the OEF file that Roslyn created in ODS. The following steps illustrate the process to deploy test data with privacy using the OEF file generated in the example scenario.

  1. From the TDM (Version 6.5) console, Eric selects Utilities > Import, as shown in Figure 37.
Figure 37. Importing the OEF file into TDM
From Optim console, go to Utilities and select Import....from toolbar..

The Import wizard enables the TDM user to import a variety of artifacts into a repository. Roslyn's OEF contains access definitions (AD) and an extract request. The extract request contains the metadata about the selected source database tables. The AD contains additional metadata about the relationships and dependencies between these tables, information about other tables, and privacy masking information.

  1. Using the Import wizard, Eric loads the input file by browsing to the OEF file on the file system. From the Definitions list, Eric selects to import the Access Definition and Extract requests. He clicks the Run icon (the running man icon). The status of the import is shown in the Import Progress list box. Eric selects the OEF and selects to import an access definition, as shown in Figure 38.
Figure 38. Import window
Running man in upper left is circled. Process tab is shown. Access Definition in Definitions box is checked and highlighted. input file is c:\temp\GSDB\CustInfo.txt
  1. Eric wants to work with the extract request that has just been imported. On the Optim console, he selects the Extract option from the Actions menu, as shown in Figure 39.
Figure 39. Extract the source data
From Optim console, go to Actions in toolbar and select Extract....
  1. Eric is presented with the Open an Extract Request window where he can specify which extract request from the repository he wants to work with, as shown in Figure 40. He knows that Roslyn is working with the GOSALES and GOSALESCT schemas in GSDB, so he specifies the GOSALES.% wildcard to locate the extract request. He sees the GOSALES extract request in the Identifier list box.
Figure 40. Open extract request from imported OEF file
  1. Eric clicks Open to open this extract request. The extract request is now opened in the repository, and the Extract Request Editor window for the GOSALES extract request appears with the default options, as shown in Figure 41. This window enables Eric to manage a particular extract request. The following list describes some of the options on the General tab:
    • The name of the current extract request is in the title bar.
    • A text description can be associated with a particular request to help differentiate requests for similar extract requests.
    • As the OEF is processed, it is logically split into separate repository objects. The extract file shows the repository object where the extraction information is contained.
    • The Access Definition Options section indicates either Local (locally generated) definitions or Named (as specified in the access definition) definitions. The Access Definition Name shows the name of the repository object containing the access definition metadata.
    • The Items to Extract field shows whether data, objects, or both are to be extracted.
    • The Row Limit field shows a row limit that should be enforced for the start table.
    • The other checkboxes specify behavior to be followed after the extraction.
Figure 41. Extract request options
Description is blank, extract file is EXTRACT.XF, access definition options is Named and the name is GOSALES.AD. The options to Compress Extract file and Generate Statistical Report are both checked.
  1. Instead of choosing the default to extract both data and objects, Eric wants to extract only data, because Roslyn has already copied the database objects from her DB2 instance to the target Oracle database. After validating all of the options, Eric selects the Run icon to start the extraction process.

    The extract request generates a detailed report about the extracted source data. Eric checks for serious errors in extract process. If Eric finds any errors, he fixes them and reruns the extract request.


Inserting the test data on the target with privacy masking

After Eric successfully extracts the data from the source DB2 database, the next step is to load it into Roslyn's target Oracle database. Note that in TDM, the extraction and inserting processes are distinct. One advantage to this approach is that Eric can use the same set of extracted data to load multiple target databases.

  1. On the Optim console, Eric selects Actions > Insert, as shown in Figure 42.
Figure 42. Insert menu
From Actions in toolbar, select Insert...
  1. The Open an Insert Request window appears, which is understandably similar to the extract request window that Eric just completed. He selects the GOSALES identifier from the repository and enters the name INSERT as the pattern to identify the insert request. He selects Open to open the insert request, as shown in Figure 43.
Figure 43. Opening an imported insert request
The open and insert popup has GOSALES identifier highlighted. the pattern is GOSALES.INSERT
  1. The insert request is located in the repository and opened. The Insert Request Editor appears, which is very similar to the Extract Request Editor. The editor is populated with default metadata based on the operation and names provided within the request. Eric reviews the default settings and selects the Run icon. A status window appears and is updated as the insertion starts and executes, as shown in Figure 44.
Figure 44. Insert Request Editor
The gosales.insert request status window is open.

Production PII

In his role as the GSDB DBA, Eric has access to the production database to view some of the personal identifiable information. His use of the Comparison Editor in this situation still must follow Sample Outdoors' procedures for working with PII.

  1. The insert request also generates a report at the end of processing. Eric checks for any errors, and if there are any, he fixes them and reruns the insert request.
  2. After the insert operation is completed, the status window shows how many rows were processed, how many were inserted, and how many failed. Eric reviews this information. Because using TDM in this cross-database scenario with IDA and ODS is new, he also uses the Comparison Editor to run a comparison between some of the unmasked private source data with the de-identified data in the target tables to ensure that the operation was successful.

    The Comparison Editor shows the results of the comparison of the data between the DB2 GOSALESCT.CUST.CUST_LAST_NAME column with the new CUST.CUST_LAST_NAME column in Roslyn's target Oracle database, as shown in Figure 45. It also shows the comparison of the CUST_EMAIL columns from the same tables. The comparison is done on a row-by-row basis. The source row is shown first in the table (identified as 1 in the Source column), followed by the target row (identified as 2 in the Source column).

Figure 45. Comparison Editor showing production and de-identified values for the CUST_LAST_NAME and CUST_EMAIL columns
Alternating rows of 1 and 2 are shown where you can see that last names and emails are not the same

Conclusion

This article described in detail the process of using IDA and ODS to establish privacy masking domains, associate them with columns in the database (physical) model, and then use that model when copying and pasting database objects and data between heterogeneous data sources (DB2 and Oracle). The article then showed how to use TDM to use the copy-and-paste metadata, including the privacy masking information to copy the appropriate rows from the source (production) DB2 GSDB to the target (test) Oracle database. This is useful functionality for any organization that is committed to keeping its customers' personal identifiable information private while providing a mechanism for software developers and testers to use realistic and referentially intact subsets of production data for their development activities.

Acknowledgments

Thanks to Sailaja Navvluru for her help on describing the interactions with Optim TDM. Thanks to Rick Buglio for his review and edits. Very special thanks to Kathryn Zeidenstein for her contributions to all aspects of this article.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=433828
ArticleTitle=Use Optim software to manage privatized test data from design to deployment
publish-date=10152009