My current customer has reminded me of a common problem many app development teams face: You want to test your app before putting it into production (always a good idea), but you don't have suitable test data. How do you develop an adequate set of test data?
You could test with production data, but especially if the tests change the data, the users won't appreciate their records being mangled. You could test with a copy of the production data, but this approach risks exposing private data with competitive advantage to developers, testers, etc., and the opportunity for the data to be stolen. This risk is especially unacceptable for apps using classified military data and development teams whose members do not have security clearances. You could write a program to generate random data; while it would be of the proper schema and size (if you generated enough of it), it would not necessarily be of representative complexity to model the joins and unions in production data.
Many projects resort to producing a token amount of test data and using that. This is inadequate for testing typical performance and scaling issues. A query which locks a table and/or iterates over all rows may work fine for a 100-row table, but may fail in production when the table has 10,000 rows. Likewise, the code processing a single Customer record with one Order containing one Product may work for testing, but fail in production when a typical Customer has 100's of Orders, each containing 100's or 1000's of Products that are parts of 100's of other Orders.
What is often needed is a set of test data that contains generic values, but is comparable in size and complexity to production data. How can such data be created?
The solution: Use what I'll call a Data Obfuscator.
This is a program which runs on the production data (by someone with sufficient authorization) and produces a set of generic test data. It is configured specifically for a particular data set and its schema. It knows which rows are confidential--such as name, address, social security number, salary, etc. Meanwhile, more generic rows like event timestamps, quantities, etc. can probably be considered non-confidential. The obfuscator has a set of unique generic values equal (or greater) in size to the set of unique values in a confidential column. When converting the data, for each new confidential value, it substitutes a randomly-selected generic value, and consistently uses that generic value in place of that specific confidential value. How random this is may need to be tuned/configured by business rules and constraints, such as ensuring that an Order that contains a small number of items also has a small total price.
In this way, the size of the data set and the complexity of the data are both preserved. But the new data set is generic and the conversion cannot be reversed (at least not easily). This produces a non-confidential set of data which still serves as a basis for very realistic testing.
What do you think: Does this solution seem reasonable? Have you seen this solution in practice? Do you know of other solutions to this problem? Let me know.
Pattern: Data Obfuscator