Overview (MATCH FILES command)

MATCH FILES combines variables from 2 up to 50 IBM® SPSS® Statistics data files. MATCH FILES can make parallel or nonparallel matches between different files or perform table lookups. Parallel matches combine files sequentially by case (they are sometimes referred to as sequential matches). Nonparallel matches combine files according to the values of one or more key variables. In a table lookup, MATCH FILES looks up variables in one file and transfers those variables to a case file.

The files specified on MATCH FILES can be IBM SPSS Statistics data files or open datasets in the current session. The combined file becomes the new active dataset.

In general, MATCH FILES is used to combine files containing the same cases but different variables. To combine files containing the same variables but different cases, use ADD FILES. To update existing IBM SPSS Statistics data files, use UPDATE.

Options

Variable Selection. You can specify which variables from each input file are included in the new active dataset using the DROP and KEEP subcommands.

Variable Names. You can rename variables in each input file before combining the files using the RENAME subcommand. This permits you to combine variables that are the same but whose names differ in different input files or to separate variables that are different but have the same name.

Variable Flag. You can create a variable that indicates whether a case came from a particular input file using IN. You can use the FIRST or LAST subcommands to create a variable that flags the first or last case of a group of cases with the same value for the key variable.

Variable Map. You can request a map showing all variables in the new active dataset, their order, and the input files from which they came using the MAP subcommand.

Basic specification

The basic specification is two or more FILE subcommands, each of which specifies a file to be matched. In addition, BY is required to specify the key variables for nonparallel matches. Both BY and TABLE are required to match table-lookup files.

  • All variables from all input files are included in the new active dataset unless DROP or KEEP is specified.

Subcommand order

  • RENAME and IN must immediately follow the FILE or TABLE subcommand to which they apply.
  • Any BY, FIRST, LAST, KEEP, DROP, and MAP subcommands must follow all of the TABLE, FILE, RENAME, and IN subcommands.

Syntax rules

  • RENAME can be repeated after each FILE or TABLE subcommand and applies only to variables in the file named on the immediately preceding FILE or TABLE.
  • IN can be used only for a nonparallel match or for a table lookup. (Thus, IN can be used only if BY is specified.)
  • BY can be specified only once. However, multiple variables can be specified on BY. When BY is used, all files must be sorted in ascending order of the key variables named on BY.
  • MAP can be repeated as often as desired.

Operations

  • MATCH FILES reads all files named on FILE or TABLE and builds a new active dataset.
    • If the current active dataset is included and is specified with an asterisk (FILE=*), the new merged dataset replaces the active dataset. If that dataset is a named dataset, the merged dataset retains that name. If the current active dataset is not included or is specified by name (for example, FILE=Dataset1), a new unnamed, merged dataset is created, and it becomes the active dataset. For information on naming datasets, see DATASET NAME.
  • The new active dataset contains complete dictionary information from the input files, including variable names, labels, print and write formats, and missing-value indicators. The new file also contains the documents from each of the input files. See DROP DOCUMENTS for information on deleting documents.
  • Variables are copied in order from the first file specified, then from the second file specified, and so on.
  • If the same variable name is used in more than one input file, data are taken from the file specified first. Dictionary information is taken from the first file containing value labels, missing values, or a variable label for the common variable. If the first file has no such information, MATCH FILES checks the second file, and so on, seeking dictionary information.
  • All cases from all input files are included in the combined file. Cases that are absent from one of the input files will be assigned system-missing values for variables unique to that file.
  • BY specifies that cases should be combined according to a common value on one or more key variables. All input files must be sorted in ascending order of the key variables.
  • If BY is not used, the program performs a parallel (sequential) match, combining the first case from each file, then the second case from each file, and so on, without regard to any identifying values that may be present.
  • If the active dataset is named as an input file, any N and SAMPLE commands that have been specified are applied to that file before files are matched.
  • Filter conditions defined by the FILTER command are preserved. Filtered cases are included in the merged file. If you want to exclude cases, use the SELECT IF command to specify selection criteria before running MATCH FILES.

Limitations

  • Maximum 50 files can be combined on one MATCH FILES command.
  • Maximum one BY subcommand. However, BY can specify multiple variables.
  • The TEMPORARY command cannot be in effect if the active dataset is used as an input file.

Examples

Example

MATCH FILES FILE='/data/part1.sav' 
  /FILE='/data/part2.sav' 
  /FILE=*.
  • MATCH FILES combines three files (the active dataset and two external IBM SPSS Statistics data files) in a parallel match. Cases are combined according to their order in each file.
  • The new active dataset contains as many cases as are contained in the largest of the three input files.

Example

GET FILE='/examples/data/mydata.sav'.
SORT CASES BY ID.
DATASET NAME mydata.
GET DATA /TYPE=XLS 
  /FILE='/examples/data/excelfile.xls'.
SORT CASES BY ID.
DATASET NAME excelfile.
GET DATA /TYPE=ODBC /CONNECT=
 'DSN=MS Access Database;DBQ=/examples/data/dm_demo.mdb;'+
 'DriverId=25;FIL=MS Access;MaxBufferSize=2048;PageTimeout=5;'
 /SQL='SELECT * FROM main'.
SORT CASES BY ID.
MATCH FILES
 /FILE='mydata'
 /FILE='excelfile'
 /FILE=*
 /BY ID.
  • A data file in IBM SPSS Statistics format and assigned the dataset name mydata. Since it has been assigned a dataset name, it remains available for subsequent use even after other data sources have been opened.
  • An Excel file is then read and assigned the dataset name excelfile. Like the IBM SPSS Statistics data file, since it has been assigned a dataset name, it remains available after other data sources have been opened.
  • Then a table from a database is read. Since it is the most recently opened or activated dataset, it is the active dataset.
  • The three datasets are then merged together with MATCH FILES command, using the dataset names on the FILE subcommands instead of file names.
  • An asterisk (*) is used to specify the active dataset, which is the database table in this example.
  • The files are merged together based on the value of the key variable ID, specified on the BY subcommand.
  • Since all the files being merged need to be sorted in the same order of the key variables, SORT CASES is performed on each dataset.