Generate test data using SQL
Generate a large amount of test data using SQL
Whether you are prototyping to prove a concept, developing a new application from the ground up, or just studying SQL, you need test data to run your application against. To test an applicationâs performance effectively, you should have enough test data to expose potential performance problems. It is always preferable to use real data for test purposes whenever available. If there is no real data to work with, in many cases it is possible to generate enough hypothetical data. Usually making up a large amount of data from scratch is an easy task that you can do quickly by yourself.
This article provides several examples of how to generate test data leveraging SQL scripts, which is a good exercise in SQL by itself. It also discusses some issues you should pay attention to in order to make generated data as realistic as possible.
Generate a large number of records.
Even if the database is newly created and still empty, system tables and views are always present, so you can use them as follows:
CREATE TABLE DB2ADMIN.SALES (CUSTOMER_ID INT NOT NULL, ITEM_ID INT NOT NULL, SALE_QUANTITY SMALLINT NOT NULL, SALE_DATE DATE NOT NULL); INSERT INTO SALES SELECT SYSFUN.RAND()*500 + 1 AS CUSTOMER_ID, SYSFUN.RAND()*100 + 1 AS ITEM_ID, 1 + SYSFUN.RAND()*10 AS SALE_QUANTITY, DATE('01/01/2003') + (SYSFUN.RAND()*200) DAYS AS SALE_DATE FROM SYSCAT.COLUMNS;
The SALES table has exactly as many rows as SYSCAT.COLUMNS. Note that random values were used to populate several columns. For example, all the values for SALE_QUANTITY column are between 1 and 10, with approximately 10% of the records having each of the 10 distinct values. If you need more records, this INSERT statement can be repeated as many times as needed. You may also use a CROSS JOIN to get more records per statement, like this:
INSERT INTO SALES SELECT SYSFUN.RAND()*500 + 1 AS CUSTOMER_ID, SYSFUN.RAND()*100 + 1 AS ITEM_ID, 1 + SYSFUN.RAND()*10 AS SALE_QUANTITY, DATE('01/01/2003') + (SYSFUN.RAND()*2000) DAYS AS SALE_DATE FROM SYSCAT.TABLES T1 JOIN SYSCAT.TABLES T2;
Note: In this case tables T1 and T2 are joined without any condition, so every row in T1 matches every row in T2. This type of join is called a CROSS JOIN. For more information on cross joins, refer to Joe Celko's book, SQL for Smarties.
Note: The transaction issued by this INSERT statement might be so large that your server won't be able to handle it. If you encounter a "log full" situation (SQL0964C The transaction log for the database is full), you might want to either increase log space or to have a smaller transaction by specifying some WHERE clause on either T1 or T2 or both.
Using this method you can generate a lot of records, yet this approach is somewhat simplistic, since all the values are distributed evenly and there is no correlation between them.
Populate a child table
You will probably have many-to-one relationships in your database. The example below shows how to populate a child table so that every parent record has a random number of child records.
CREATE TABLE DB2ADMIN.PARENT_TABLE(PARENT_ID INT NOT NULL, NUM_CHILDREN
INT NOT NULL); INSERT INTO DB2ADMIN.PARENT_TABLE SELECT ROW_NUMBER() OVER(), SYSFUN.RAND()*5 + 1 FROM SYSCAT.TABLES; ALTER TABLE DB2ADMIN.PARENT_TABLE ADD PRIMARY KEY(PARENT_ID); CREATE TABLE DB2ADMIN.CHILD_TABLE(PARENT_ID INT NOT NULL, CHILD_NUM INT
NOT NULL); INSERT INTO DB2ADMIN.CHILD_TABLE SELECT PARENT_ID, SEQUENCE_TABLE.NUM FROM DB2ADMIN.PARENT_TABLE JOIN (SELECT ROW_NUMBER() OVER() AS NUM FROM SYSCAT.TABLES) AS SEQUENCE_TABLE ON AUXILIARY_TABLE.NUM<NUM_CHILDREN;
As a result of the last INSERT statement, every parent record has between 1 and 6 child records. SEQUENCE_TABLE is a table expression. For more on table expressions, refer to Sheryl Larsen's article.
Imitate data skew using an auxiliary table
If some values for a column occur much more frequently than others, the data has data skew. For example:
SELECT CITY, COUNT(*) FROM CUSTOMER GROUP BY CITY ORDER BY COUNT(*) DESC CHICAGO 236 MILWAKEE 95 ROCKFORD 4 NAPERVILLE 3 SPRINGFIELD 3 (snip) 279 rows selected
Whenever you have a reason to expect data skew in production data, you might want to reproduce data skew in your test data. First, store the estimated frequencies in a table:
CREATE TABLE COLOR_FREQUENCY(COLOR CHAR(10), FREQUENCY SMALLINT); INSERT INTO COLOR_FREQUENCY VALUES ('RED', 37), ('SILVER',12), ('AMBER', 3), ('GREEN', 3),
Second, create an auxiliary table (to be more specific, a sequence table).
CREATE TABLE CONSECUTIVE_NUMBER(NUM INT NOT NULL); INSERT INTO CONSECUTIVE_NUMBER SELECT ROW_NUMBER() OVER() AS NUM FROM SYSCAT.COLUMNS;
Note: There is a chapter on auxiliary tables in Joe Celko's book SQL for Smarties. Now let us join these two tables:
SELECT COLOR, FREQUENCY, NUM FROM COLOR_FREQUENCY JOIN CONSECUTIVE_NUMBER ON NUM BETWEEN 1 AND FREQUENCY ORDER BY FREQUENCY, COLOR; COLOR FREQUENCY NUM ---------- --------- ----------- BLACK 1 1 BLUE 1 1 WHITE 2 1 WHITE 2 2 AMBER 3 1 AMBER 3 2 AMBER 3 3 (SNIP)
As we have seen, every row in the COLOR_FREQUENCY table joins with the FREQUENCY rows in the CONSECUTIVE_NUMBER table. This example produces exactly what you need to get the required distribution of values:
CREATE TABLE T_SHIRT(COLOR VARCHAR(30) NOT NULL, SIZE CHAR(1) NOT NULL); INSERT INTO T_SHIRT SELECT COLOR, 'M' AS SIZE FROM COLOR_FREQUENCY JOIN CONSECUTIVE_NUMBER ON NUM BETWEEN 1 AND FREQUENCY; SELECT COLOR, COUNT(*) FROM T_SHIRT GROUP BY COLOR; COLOR 2 ------------------------------ ----------- AMBER 3 BLACK 1 BLUE 1 GREEN 3 RED 37 SILVER 12 WHITE 2 ;
As a result, the T_SHIRT table now has 37+12+3+3+2+1+1 = 57 rows. The table has exactly the required distribution of values.
Using the table from the previous chapter, you can also specify the distribution for the SIZE column:
Generate data with given distribution of values for several columns
CREATE TABLE SIZE_FREQUENCY(SIZE CHAR(1), FREQUENCY SMALLINT); INSERT INTO SIZE_FREQUENCY VALUES ('S', 5), ('M',7), ('L', 9);
and populate the T_SHIRT table using two table expressions:
INSERT INTO T_SHIRT SELECT COLOR, SIZE FROM (SELECT COLOR FROM COLOR_FREQUENCY JOIN CONSECUTIVE_NUMBER ON NUM
BETWEEN 1 AND FREQUENCY) C, (SELECT SIZE FROM SIZE_FREQUENCY JOIN CONSECUTIVE_NUMBER ON NUM BETWEEN
1 AND FREQUENCY) S
The first table expression produces 57 rows, and the second table expression produces 5+7+9=21 rows. Since I have not specified any join condition, each row from the first result set will join with each row from the second one, producing 57*21 rows.
Note: CROSS JOIN may generate too many rows. As a result, the transaction will be too large for the server to handle. In this case you might want to have several smaller INSERT statements, for example having this table expression:
(SELECT SIZE FROM SIZE_FREQUENCY JOIN CONSECUTIVE_NUMBER ON NUM BETWEEN
1 AND FREQUENCY AND SIZE='L') S
in the first INSERT, and changing that table expression to
(SELECT SIZE FROM SIZE_FREQUENCY JOIN CONSECUTIVE_NUMBER ON NUM BETWEEN
1 AND FREQUENCY AND SIZE<>'L') S
in the second INSERT statement.
Assume that we need to generate some rows to populate the CAR table:
Generate data with correlated columns
CREATE TABLE CAR( MAKE VARCHAR(20) NOT NULL, MODEL VARCHAR(20) NOT NULL, OTHER_DATA VARCHAR(20));
If we try the approach from the previous chapter, we would end up with a lot of impossible MAKE/MODEL combinations, such as "TOYOTA METRO" and ""GEO CAMRY." This situation is called correlation between the MAKE and MODEL columns. The correct approach is to specify valid pairs (MAKE, MODEL) and their frequencies:
CREATE TABLE MAKE_MODEL_FREQUENCY(MAKE VARCHAR(20), MODEL VARCHAR(20),
FREQUENCY SMALLINT); INSERT INTO MAKE_MODEL_FREQUENCY VALUES ('TOYOTA','CAMRY', 40), ('HONDA','ACCORD',40), ('CHEVY', 'PRIZM', 5),
('GEO','PRIZM', 5), ('CHEVY', 'METRO', 5), ('GEO', 'METRO', 10);
Once that is done, we can join the CONSECUTIVE_NUMBER and MAKE_MODEL_FREQUENCY tables in exactly the same way as we did previously.
Manipulate cluster factor
The physical order of rows affects performance of almost all queries running against a table. It is important that the generated data has a realistic physical order of rows. If you expect an index to have a high cluster factor, just reorganize the table on that index. If, on the contrary, you expect the index to have a low cluster factor, it is also very easy to shuffle a table in random order, so that the index would have a cluster factor close to 0:
CREATE TABLE NAMES( FULL_NAME VARCHAR(50) NOT NULL, ORDERED_ON INT); INSERT INTO NAMES(FULL_NAME, ORDERED_ON) SELECT TABNAME || ', ' || COLNAME AS FULL_NAME, SYSFUN.RAND() * 10000 AS ORDERED_ON FROM SYSCAT.COLUMNS; CREATE INDEX NAMES_FULL_NAME ON NAMES(FULL_NAME); CREATE INDEX NAMES_ORDER ON NAMES(ORDERED_ON); REORG TABLE DB2ADMIN.NAMES INDEX DB2ADMIN.NAMES_ORDER; RUNSTATS ON TABLE DB2ADMIN.NAMES AND DETAILED INDEXES ALL;
After the reorganization, the index NAMES_FULL_NAME will have a very low cluster factor (close to 0), because the rows are now stored in random order.
Note: It is also quite possible to reorganize the table so that the cluster factor of the index NAMES_FULL_NAME will be close to any given value between 0 and 1, but that is outside the scope of this article.
This article discusses how to make up a test data set so that it is large enough to be useful for testing and has the expected distribution of values and correlation between columns.
- Read "Meet the Experts: Sheryl Larsen on the Power of Table Expressions" (developerWorks, April 2002) for examples on table expressions for flexibility and performance.
- Get Joe Celko's SQL for Smarties: Advanced SQL Programming from Amazon.com.