Identifying Duplicate Cases
"Duplicate" cases may occur in your data for many reasons, including:
- Data entry errors in which the same case is accidentally entered more than once.
- Multiple cases share a common primary ID value but have different secondary ID values, such as family members who all live in the same house.
- Multiple cases represent the same case but with different values for variables other than those that identify the case, such as multiple purchases made by the same person or company for different products or at different times.
Identify Duplicate Cases allows you to define duplicate almost any way that you want and provides some control over the automatic determination of primary versus duplicate cases.
To Identify and Flag Duplicate Cases
- From the menus choose:
- Select one or more variables that identify matching cases.
- Select one or more of the options in the Variables
to Create group.
Optionally, you can:
- Select one or more variables to sort cases within groups defined by the selected matching cases variables. The sort order defined by these variables determines the "first" and "last" case in each group. Otherwise, the original file order is used.
- Automatically filter duplicate cases so that they won't be included in reports, charts, or calculations of statistics.
Define matching cases by. Cases are considered duplicates if their values match for all selected variables. If you want to identify only cases that are a 100% match in all respects, select all of the variables.
Sort within matching groups by. Cases are automatically sorted by the variables that define matching cases. You can select additional sorting variables that will determine the sequential order of cases in each matching group.
- For each sort variable, you can sort in ascending or descending order.
- If you select multiple sort variables, cases are sorted by each variable within categories of the preceding variable in the list. For example, if you select date as the first sorting variable and amount as the second sorting variable, cases will be sorted by amount within each date.
- Use the up and down arrow buttons to the right of the list to change the sort order of the variables.
- The sort order determines the "first" and "last" case within each matching group, which determines the value of the optional primary indicator variable. For example, if you want to filter out all but the most recent case in each matching group, you could sort cases within the group in ascending order of a date variable, which would make the most recent date the last date in the group.
Indicator of primary cases. Creates a variable with a value of 1 for all unique cases and the case identified as the primary case in each group of matching cases and a value of 0 for the nonprimary duplicates in each group.
- The primary case can be either the last or first case in each matching group, as determined by the sort order within the matching group. If you don't specify any sort variables, the original file order determines the order of cases within each group.
- You can use the indicator variable as a filter variable to exclude non-primary duplicates from reports and analyses without deleting those cases from the data file.
Sequential count of matching cases in each group. Creates a variable with a sequential value from 1 to n for cases in each matching group. The sequence is based on the current order of cases in each group, which is either the original file order or the order determined by any specified sort variables.
Move matching cases to the top. Sorts the data file so that all groups of matching cases are at the top of the data file, making it easy to visually inspect the matching cases in the Data Editor.
Display frequencies for created variables. Frequency tables containing counts for each value of the created variables. For example, for the primary indicator variable, the table would show the number of cases with a value 0 for that variable, which indicates the number of duplicates, and the number of cases with a value of 1 for that variable, which indicates the number of unique and primary cases.
For numeric variables, the system-missing value is treated like any other value—cases with the system-missing value for an identifier variable are treated as having matching values for that variable. For string variables, cases with no value for an identifier variable are treated as having matching values for that variable.
Filter conditions are ignored. Filtered cases are included in the evaluation of duplicate cases. If you want to exclude cases, define selection rules with Delete unselected cases.and select