Introduction to the data masking solutions
For companies that build software, their systems have evolved to follow a set of best practices for development. These generally include the following.
- Separating test and development environments to test changes before it affects your users.
- Using production data to populate the databases for these test and development environments to improve test quality and reduce environment costs.
- Limiting access to sensitive personal data to as few people as necessary.
The second and third best practices conflict with each other. By using production data in test and development environments, you are exposing sensitive data to software developers and testers. But, this being said, the ability to use production data in test environments is so compelling that the third best practice often takes a back seat to the second. For this reason, many companies are now being pressured to prevent the exposure of sensitive data to their testers and developers. This often comes in the form of privacy legislation from governments, or through industry regulatory organizations such as the Payment Card Industry (PCI) Security Standards Council.
IBM offers two solutions to eliminate the inherent conflict within these best practices. The solutions extract data from production and depersonalized it while still maintaining its realism for high-quality testing. The IBM products refer to this process as data masking. National IDs look like national IDs, names look like names, addresses are valid addresses, but all the data is no longer sensitive because all the personally identifiable information (PII) is now fictional. The two solutions that do this are as follows.
- InfoSphere Optim Test Data Management Solution - Data Masking option (now also sold as the InfoSphere Optim Data Privacy Enterprise and Workgroup Editions).
- The InfoSphere DataStage Pack for Data Masking.
The first products collectively referred to as the InfoSphere Optim products in the rest of this article, are licensed differently, but the underlying technology is the same. So, if both solutions solve the same problem, the natural question is, "What's the difference?" This article attempts to answer that question by looking first at the common core functions shared by the products, and then by examining the differences between them. Technical in nature, the aim of this article is to provide a guide for customers who are faced with the decision of which product to purchase, and which features of the products to use after they have identified the need to mask their data.
Test data masking essentials – The commonalities
Before examining the differences between the solutions, let's look at the fundamental functions that both provide. These functions are common between the solutions because they are all required to populate test environments with realistic, but fictional, data that can be used for testing. Many of these commonalities stem from the fact that both solutions use almost identical sets of data masking algorithms.
Some PII data follows a strict format and pattern. These fields include items such as credit card numbers, US Social Security Numbers, Canadian Social Insurance Numbers, or Brazil's Cadastro de Pessoas Físicas. Because the values follow a set of rules that determine their validity, they can be generated using an algorithm. Both solutions provide functions to mask credit card numbers from all of the major issuers, and National IDs from a variety of countries.
There is another set of PII fields that also follow a strict format but the values allowed are more flexible. One example is email addresses, where every address has a user name, a domain name and a '@' symbol. Both solutions provide functions to generate new and valid email addresses. In addition, an algorithm is available in both solutions to detect the format of data and replace the value with a new value of the same format. For example, it would detect the position of the space, numeric, and alphabetic characters in a Canadian postal code L6G 1C7, and replace the values with a generated L3R 9Z7, all without you specifying the format ahead of time.
Masking lookup functions
There are some PII fields that cannot be easily generated by an algorithm. These are things like first and last names, or postal addresses. For these, both solutions have lookup functions to look up values in pre-populated tables that contain things like names and addresses. The index of the value looked up to replace the original is either chosen randomly or by hashing an input value. Hashing an input value is performed in order to maintain consistency when masking.
Some of these data masking functions are shown in Figure 1.
Figure 1. A sampling of the Optim data masking algorithms for the data masking solution
Both solutions have masking algorithms that are designed for consistency. No matter when the masking process is run, the same values will result if the input values are the same. This is very useful in re-populating or adjusting your test data sets without breaking existing regression tests that may rely on certain values being present in the test environments.
Some values that are masked are located in multiple tables, and the applications that are tested rely on the values between those tables being the same. Both solutions are designed to allow you to mask values and then propagate the results to the other tables.
Both solutions allow customers to build custom transformation functions to extend the ones that come with the products. The InfoSphere Optim Data Masking option for Test Data Management allows you to build new data privacy functions using column map exits in C/C++, or by creating scripts in the Lua language. DataStage can be extended using C/C++ or BASIC in Transformer Stages, or by creating custom operators in C/C++.
Both of the solutions extract data, mask it, and then place it into a destination environment. Even so, how they move data is very different. The differences in the movement of data are the focus of the following section.
This articles has discussed how, in terms of data masking functionality, both solutions offer a similar set of functions. Both can mask data so it's no longer sensitive but still realistic. Both allow you to do this while maintaining consistency between data masking processes and referential integrity between tables. Both move data from production, mask it, and then place it into a target destination. The following section will examine what makes each of them special.
The InfoSphere Optim products
With the InfoSphere Optim products, there is a version for System z and a version for distributed systems. Because the distributed version was modeled on the System z version, aside from how they deal with IMS and flat file data, they look fairly similar.
Both the System z and Distributed Infosphere Optim products mask data that has been placed in files. The extraction process results in an extract file. The data is then masked using a convert process. At that point the masked extract file is then sent to destination environments. If a load request is built, load files are generated from the masked extract file and sent to the loader utility of the database in question. Figure 2 shows you this process.
Figure 2. The masking process for the Optim Masking for TDM product
The InfoSphere Optim products work their best when incorporated into a larger Test Data Management initiative rather than performing data masking alone. The solutions operate on what's known among InfoSphere Optim practitioners as the complete business object, which is a list of tables and relationships between those tables that define one end-to-end business process. Both Optim solutions are specifically designed to extract enough data for your test environments, and no more. It does this by traversing the relationships in the data and picking up related data elements. The Resources section has an article that explains the complete business object more fully.
A recent development for the InfoSphere Optim products is that the design time tooling has been completely updated and reworked into an Eclipse-based component called the InfoSphere Optim Designer, which is shown in Figure 3. At the same time, a web-based management framework has been constructed. Having a web-based interface that is separate from the design interface allows you to more easily give test data users control over when and how their test data is refreshed.
Figure 3. Applying data masking policies using the Optim designer
In summary, in terms of data selection for data movement, the InfoSphere Optim products are similar to surgical tools dissecting test cases from production. This is not to say that the Optim solutions cannot handle large amounts of data (they can and have in the past), but they have extensive sub-setting capabilities and weren't built with the same bulk data-movement abilities of Extract, Transform, and Load (ETL) tools.
InfoSphere DataStage Pack for data masking
The InfoSphere DataStage Pack for data masking is an add-on package for InfoSphere DataStage which, in turn, is part of the IBM Information Server suite. InfoSphere DataStage is an ETL tool built to move large amounts of data from one system to another.
Now, if you want all of your production data moved, masked, and loaded, and you want to do it very quickly in a system that scales to very, very, large amounts of data, InfoSphere DataStage with the addition of the pack for data masking can do that very well. It is able to do so by being built on top of enterprise grade ETL data movement architecture. While some parallelism does exist in Optim, it is much more extensive in DataStage, allowing for full utilization of symmetric multiprocessing (SMP), clustering, grid deployments, and massively parallel processing (MPP). DataStage excels at splitting workloads into multiple concurrent processes and computers. See the Resources section for an overview of DataStage capabilities for scaling.
Another important difference between InfoSphere Optim products and the DataStage Packs for data masking is that DataStage does not need to create intermediary extract files. Extracted data can be masked and sent to a destination database without writing the data to persistent storage. As such, jobs that are run with the DataStage can be less I/O bound when compared to the InfoSphere Optim products. Reducing disk I/O requirements is useful for environments that have constrained I/O resources, for example, a virtualized environment that shares its disk resources with many other virtual machines.
Because DataStage does not require the writing of data to persistent storage, it also allows for its processes to be pipelined, which means that extraction, masking, and insertion happen concurrently rather than being separate processes, helping reduce the total amount of time for the process to run. This is sometimes referred to as data pipelining. See Figure 4 for an illustration, and consider how this compares to Figure 2 which shows the InfoSphere Optim Product's Process. If desired, DataStage can also output intermediary files, but this is not a requirement.
Figure 4. The masking process for the InfoSphere DataStage Pack for data masking
Even so, there is a lot less flexibility in the InfoSphere DataStage Pack compared to the InfoSphere Optim products for selecting a specific subset of data to be masked. DataStage is driven by supplied SQL statements, whereas InfoSphere Optim is driven by traversing the database model, picking up related data elements from a starting point. So, while DataStage is built for maximum scalability and can move larger sets of data much faster, Optim is better at moving only what is needed for the test environments. DataStage is the chainsaw to Optim's scalpel.
It is also worth remembering that DataStage, complete with the InfoSphere DataStage Pack for data masking, can do a lot more than mask data. It is a complete ETL framework that can help you build systems like data warehouses by restructuring and moving data from transactional databases. Keep in mind too that DataStage is part of a larger data platform called InfoSphere Information Server. Aside from the DataStage ETL capabilities, Information Server contains tools to help you manage your metadata, improve data quality, build a common system vocabulary, and automate data integration tasks. You can purchase DataStage without the Information Server suite, but it is a major advantage for the product to be part of such an extensive and well integrated data platform.
Figure 5 shows one of the simplest data masking jobs you can create in DataStage.
Figure 5. A data masking job in DataStage
Figure 6 shows the masking of an address using a lookup function in the DataStage interface.
Figure 6. Masking an address in DataStage using a lookup table
Masking on demand for flexibility
If the common thread of the masking products is their use of the same data masking functions, you may be wondering if you can use those masking functions without using one of the data movement engines discussed previously in this article. The good news is that the InfoSphere Optim products now include these functions in an externally accessible API. These functions are the same ones used by the InfoSphere Optim products for both System z and Distributed, as well as the data masking packs for Data Stage.
One use case of the new API is the creation of data masking stored procedures inside of the database. Because no data movement has to occur in and out of the database, these stored procedures can mask data extremely quickly compared to other methods. Many customers favor in-place masking such as this because they may have already invested a great deal in the necessary infrastructure to rapidly refresh their test environments.
One thing to keep in mind when using these APIs, especially in the stored procedure use case, is that unmasked data may come into contact with a non-production environment. Even if this contact occurs for a short period of time, it is a security concern and should be planned for. In comparison, the processes for creating a clean separation between your unmasked data and non-production environments are well known when using the InfoSphere Optim products and InfoSphere DataStage.
It is also worth mentioning here that having an externally available API for data masking opens up other possibilities to help with managing test data. For example, the API can be used to help facilitate the creation of stubs and test services, or it can be used to mask data sources that are not directly supported by InfoSphere DataStage or the InfoSphere Optim products.
Table 1 shows the comparison of the three IBM data masking solutions.
Table 1. A Comparison of the three IBM data masking solutions
|Feature||InfoSphere Optim data masking option for test data management (Distributed and IBM for z/OS)||InfoSphere DataStage Pack for data masking|
|Realistic masking algorithms||YES||YES|
|Consistent masking across systems and time periods||YES||YES|
|Maintain referential integrity||YES||YES|
|Can be customized||YES (C, C++, or Lua for Distributed. Assembler, VS COBOL II, PL/I, C, or Lua for z/OS).||YES (C/C++/BASIC)|
|Comes with externally callable data privacy functions||YES||NO|
|Works with native database load utilities||YES||YES|
|Pipelined processes (reduced masking server I/O)||NO||YES|
|Works on the concept of a complete business object (allows for efficient subset creation)||YES||NO|
|Built for symmetric multiprocessing (SMP), clustering, grid deployments, and massively parallel processing (MPP)||NO (but, there is some SMP support).||YES|
|Heterogeneous data source support (see the Resources section for lists of platforms)||YES||YES|
This article has explored the primary functions that are required for a data masking solution. These include extensive data masking algorithms that not only mask the data, but do so realistically while maintaining referential integrity in the data, and consistency over time and between databases. It also discussed how these functions are present in both IBM solutions for data masking: The InfoSphere Optim Data Masking option for Test Data Management, and the InfoSphere DataStage Pack for Data Masking.
The article then discussed the differences between the solutions. The InfoSphere Optim products excel at surgically removing small amounts of data for masking. The InfoSphere DataStage Pack for Data Masking was built on top of DataStage, an enterprise class ETL tool that excels in parallelism and scalability. Finally, the article discussed the use of the data masking API provided by the InfoSphere Optim products. Using these allows you to provide you own data movement engine, and can give additional flexibility for masking your data in non-production environments.
- Thank you to my wife, Erin Haldeman, for her constant encouragement and for her help in turning my fragments into sentences.
- Thank you to the Polly Lau, Martin Dizon, and Alan Fischer e Silva from the InfoSphere Optim Technology Ecosystem team out of the IBM Canada Lab for reviewing the article and providing their valuable comments on its contents.
- Thank you to Aarti Borkar and Jim Lee from the IBM InfoSphere Optim Product Management Team for answering my questions on how the products are licensed.
- Thank you to my colleagues: Greg Marshall, David Slater, Doug Mogck, and DuQuay Allen at Information Insights for their feedback on the article.
- A special thank you to my colleague, Matt Simons, for his assistance in keeping me up to date with the developments in masking with the InfoSphere Optim products.
- Read an overview of parallel processing in InfoSphere Information Server and DataStage and how it uses parallelism for scalability.
- View the InfoSphere Optim Solutions Library.
- Learn about Optim complete business objects and how they help with test data extraction.
- Review the supported platforms for InfoSphere Optim Test Data Management 9.1, including supported database types. These are the same if you were to add the Data Masking option.
- Learn how to use Lua to mask data in the InfoSphere Optim Designer.
- Review the announcement letter for InfoSphere DataStage 8.5. The Connectivity section of the letter outlines which database types have native database connectors in DataStage.
- Read the InfoSphere DataStage Pack for Data Masking documentation in the Information center.
- Visit the developerWorks Information Management zone to find more resources for DB2 developers and administrators.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Dig deeper into Information management on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.