Preparing data

Preparing data for use in a CP Optimizer application involves ensuring that you have realistic, representative data and that the format is specified.

Start with clean, realistic data. If you don't have access to real data, you should consider fabricating realistic data for this purpose.

For example, imagine that you are developing a rostering application for nurses in a hospital, where the roster covers six months for 30 nurses with different levels of skill; six nurses are highly qualified, 20 have standard qualifications, four are beginners. The data set does not need to represent the nurses individually in detail, but it needs to satisfy the number of nurses per day for each service and the level of qualification of the nurses. Realistic data for this rostering application must involve the same proportion of qualified nurses, the same type of service requests and so on.

Realistic data must also be representative even when you are testing a reduced data set on a smaller version of the problem. This principle means that some data in a smaller version of a problem can simply be reduced, but other data must be reduced only in ways that respect the proportions of the original problem because changing the proportions among those data would effectively change the problem to solve. To understand this difference between data that can be reduced arbitrarily and representative data that must respect proportions when it is reduced, consider the nurse rostering example again. One way to reduce the size of the problem is simply to consider a shorter period of time, for example, one month instead of six months. In other words, from a constraint programming point of view, the period of time can be reduced almost arbitrarily. In contrast, if you reduce the number of nurses in your test data in order to work with a smaller problem, your test data must still respect the proportions among their levels of skill. For example, if you decide to test your application on half the number of nurses (15 instead of 30), then a representative data set must still include three highly qualified, ten with standard qualifications and two beginners in order to respect the proportions of the original problem.

The solution of a combinatoric problem is quite sensitive to variations in data, so you need to run, test and optimize an CP Optimizer application with respect to multiple sets of data to have a reliable effect. In fact, the robustness of your application will depend heavily on tests run over several sets of data. This point about using multiple sets of data to test your program is so important that if the client for your application cannot supply multiple sets of real data, then you should consider generating multiple sets of realistic data, for example, by random variation.

Early in development, you should also settle the format of data. If, for example, it is straightforward and quick to sort an array by posing a few constraints, it will be even quicker to use a conventional sorting technique instead. This guideline can be generalized: most ordinary preprocessing of data (unrelated to constraint programming) can be handled more efficiently in your chosen programming language rather than in CP Optimizer.

Use multiple sets of clean, valid, realistic data to validate the model that you design. After you have validated it, your first model itself will play the role of a reference. It will enable you to test new solutions that you get from the implementations you develop.

Later, multiple sets of data may also help you tune performance, as variations between data sets can highlight different aspects of your application that may allow improvement.