Big Data

Are you into testing?

Share this post:

Test data is data being specifically identified for use in tests, typically of a computer program/application.

Some data may be used in a confirmatory way, typically to verify that a given set of inputs to a given function produces the expected result. Other data may be used in order to challenge the ability of the program to respond to unusual, extreme, exceptional, or unexpected input.

Test data may be produced in a focused or systematic way (as is typically the case in domain testing), or by using other, less-focused approaches (as is typically the case in high-volume randomized automated tests). Test data may be produced by the tester, or by a program or function that aids the tester. Test data may be recorded for re-use, or used once and then forgotten.

However, a very likely situation is where test data is cloned of the production systems representing a true picture and data models in an organization. It is not always possible to produce enough data for testing. The amount of data to be tested is determined or limited by considerations such as time, cost and quality. Time to produce, cost to produce and quality of the test data, and efficiency.

Software testing is an important part of the Software Development Life Cycle today. It is labor-intensive and also accounts for nearly half of the cost of the system development. Hence, it is desired that parts of testing should be automated. An important problem in testing is that of generating quality test data and is seen as an important step in reducing the cost of software testing.

Given time, cost and quality, would it not be an ideal solution to do subsetting of test data directly linked to what is required to test the various applications?

Data subsetting lets you subset your production databases to create smaller sets of data for test or development databases, based on the Application Data Model.

Given regulatory compliance like GDPR would it not be an ideal solution to mask your PI data when using production data?

The main reason for applying masking to a data field is to protect data that is classified as personally identifiable information, sensitive personal data, or commercially sensitive data. However, the data must remain usable for the purposes of undertaking valid test cycles. It must also look real and appear consistent. It is more common to have masking applied to data that is represented outside of a corporate production system. In other words, where data is needed for the purpose of application development, building program extensions and conducting various test cycles. It is common practice in enterprise computing to take data from the production systems to fill the data component, required for these non-production environments.

The primary concern from a corporate governance perspective is that personnel conducting work in these non-production environments are not always security cleared to operate with the information contained in the production data. This practice represents a security hole where data can be copied by unauthorized personnel and security measures associated with standard production level controls can be easily bypassed. This represents an access point for a data security breach.

The overall practice of Data Masking at an organizational level should be tightly coupled with the Test Management Practice and should incorporate processes for the distribution of masked test data subsets, however, coupling only with the Test Management Practise is not enough. This is where the Data Governance, classify and define data helps a great deal to locate the PI data required to perform production-like testing after having passed the data for masking.

In short, subsetting is important to not release more test data than required for the application to perform thorough testing to keep the time and cost down and masking is important to protect PI information to be regulatory compliant.

Let’s look at 3 different ways to produce test data.

Test Data Management

Test Data Management is used to extract, copy, privatize, and move sets of relationally intact data from source tables to corresponding destination tables. You can also use the test data management solution to browse and edit data and compare test results with the original data.

Using the solution, you can work with complex data models of any number of tables and relationships heterogeneously and ensure a referentially intact set of data for use in application testing and data migration. In addition, using data privacy features, you can obfuscate or mask sensitive data while maintaining its validity for testing your applications. (For example, credit card numbers can be masked so that the resulting numbers have valid check digits and issuer identifiers).

Test Data fabrication

Test Data Fabrication can help your organization address the challenges of creating high-quality test data. It leverages deep expertise and years of experience in constraint satisfaction and automatic test-case generation. The solution quickly and efficiently creates high-quality test data while minimizing the risks related to using sensitive production data. and supports multiple use cases. For example, you can augment existing data sets, generating test data when no real data is available. You can use the solution to help protect private information while avoiding the need to mask or otherwise alter production data. It can also support an agile workflow, enabling you to iteratively modify data to improve its quality and address evolving requirements and quickly produce the data you need with rules-based fabrication The fabrication engine generates data by following rules that you set for it. You define the type of data, the volume of data, the relationships among different columns in databases, the resources for populating new data columns and the data transformations required. The solution can output flat files, data that can be used in relational databases and files with a variety of format extensions, such as XML, CSV, and DML. After the data is generated, you can use standard data loading utilities and access methods, such as JDBC, to insert the data into your target environments.

Virtual Data Pipeline

Provision dozens of near instant virtual copies of production databases, with minimal storage consumption and with Self-service access. Once the administrator gives access rights, developers and QA engineers have the ability to login to the VDP user interface with their own account. They will be able to browse only the ‘masked’ data sets of granted databases set by the administrator using roles-based access controls. A masked virtual copy is selected, mounted to their test server, and they can start accessing the data set. This whole process, being self-service, is not only fast but also eliminates the burden on IT staff and DBAs.

Enable developer and testers to test on most recent copies of production data-sets with automated refresh and provision masked copies of on-premise data-sets to remote locations or cloud environments, enabling development and testing where it is required and at the speed it is required, removing the struggle of setting up test environments on-prem. Check out the Virtual Data Pipelinesolution by clicking at the link.

There is no doubt in my mind that DevOps, continuous engineering, development and speed to market demand quick turn around of new applications or modernization of existing applications. Regardless, the test part is becoming vital, and it is not necessarily a question on selecting one or the other, but the best fit to quickly deliver business value with quality.

It becomes crucial to be able to quickly spin up dev and test environments and moving ahead cloud instances are the perfect fit for development and test.

Govern your data, reach for the sky and test on the cloud.

If you have any further questions, please do not hesitate to contact me at

Technical Sales Manager Cloud, Data & AI at IBM

More Big Data stories

Private cloud or public cloud? New server technology offers more choice

In September, we launched the new IBM Power E1080 high-end server, for corporate use based on the  new Power10 architecture, the Power E1080. The server can – among many other things – handle a large number of applications and workloads securely, at scale and with highest availability. Going into the spring of 2022, we will […]

Continue reading

10 Questions regarding SDG to the company’s management and board

We have all together manged to create the most serious sustainability deficit and our greatest challenge is the ecological debt – a dept which we are running up by overusing and depleting our natural resources and thereby threatening our ability to meet the needs of future generations.  Worldwide, the strains on key resources, from fresh […]

Continue reading

New server platform with plenty of power under the hood – and future-proof security

The shutdown of companies and nations during the pandemic has opened many people’s eyes to how vulnerable a modern organization can be. Not only when fundamental assumptions change, but also when habits change in the wake of a crisis. There is little doubt, for example, that working from home and a widespread use of online […]

Continue reading