Case-control matching is a popular technique used to pair records in the "case" sample with similar records in a typically much larger "control" sample based on a set of key variables. This post discusses the FUZZY extension command for SPSS Statistics that implements this technique and some recent enhancements to it.
In this discussion I will refer to records rather than what SPSS usually calls cases in order to avoid confusion with case as in case control.
FUZZY takes two datasets as input (the demander and supplier datasets), matches the records according to a set of BY variables, and provides various ways of writing the output. It does not have a dialog box interface, but running FUZZY /HELP displays the complete syntax. Matches can be required to be exact on all variables, or a tolerance or "fuzz" factor, which could be zero, can be specified for each matching variable. (String matches can only have fuzz 0.)
Matches are not always possible: missing vaues or blank strings in a BY variable preclude matching. There might not be a close enough record in the supplier dataset to pair with a demander record. You might run out of eligible supplier records before the demander records are all paired. Unpaired output is set to the system missing value or, for strings, to blank.
FUZZY proceeds by finding for each demander record all of the supplier records that are close enough on the BY variables. This requires a lot of comparisons! It then proceeds through the demander records and picks a supplier record at random from all those eligible for that record. (You can request multiple supplier records for each demander if needed.) No attempt is made to find the closest eligible record, since there is no measure of closeness across the set of BY variables and for other reasons.
If sampling from the suppliers is done without replacement, which is the default behavior, then using a supplier record makes it unavailable for matching in later demander records and could result in a later record having no match. With fuzzy matches, it could be that both records would have been matched if the order were reversed.
While one would generally want to specify an exact match for categorical variables, at least those with nominal measurement level, continuous variables such as income or age might require some fuzz. New output from FUZZY can help to diagnose which BY specifications cause a record to go unmatched. Here is a table produced by FUZZY that shows how the BY criteria restrict the matches.
In this example, we are matching two datasets about vehicles on the variables origin and cylinder. We require an exact match on origin, but allow a difference of up to 3 in the number of cylinders as shown in the first column.
Th first row records the number of comparisons between demander and supplier records testing for an exact match. 95% of these comparisons were not exact matches. Comparing each demander record to each supplier record, only 5% matched exactly.
Next, the table shows that among comparisons after removing the 5% that matched exactly, 85% did not match on origin. Then, considering only the records where there was an exact match on origin, 75% of the comparisons did not match on cylinder. Each row of the table is based on the comparisons that passed, i.e., were within tolerance, on all of the preceding rows.
This table, which precedes the actual pairing step, can be useful in finding the variables that are most important in preventing matches. You may need to increase the tolerance, or you may just be out of luck if there are insufficient matches. The table does not tell you how successful the actual pairing step will be, because supplier records will be used up as the pairing pass proceeds, but it gives insight into how the variables filter the matches. The tries and rejection columns are only produced when you specify fuzz, but you could set all the fuzz values to 0 to see the results with exact matching.
The next table shows the distribution of eligible matches for the pairing pass (this example is based on a very small dataset). It shows how many eligible records there were for each demander record in the pairing pass. It shows that there were two demander records for which there were zero eligible supplier records, three where there was only one, and one where there were two to ten eligibles. This gives you a good idea of how rich the supplier dataset is in matchables, but it doesn't say anything about which variables have the biggest effects on pairing.
If you are lucky, all your demander records will find a match, but if they don't what can you do? Recall that the pairing stage is first come, first serve. With fuzzy matching, reordering the demander cases might work better. While FUZZY can't find an optimal order, one more output feature can help you improve the results. Specifying DRAWPOOLSIZE=varname will add a variable to the demander dataset that records how many eligible supplier records there were for each demander record. You can then study the characteristics of the demander records where suppliers are scarce to see where the supplier dataset is too thin. A good start, though, to improving the pairing percentage is to sort the demander dataset by the newly created poolsize variable. That puts the least paired demander records first in line for a match when you rerun the process and will generally reduce the number of unmatched cases if that is possible.
There are statistical issues regarding how you choose the BY variables that are not addressed here. Searching the web for something like "case control matching" will turn up numerous references.
FUZZY is available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) and requires the Python Essentials for your version of SPSS Statistics.