Record operations nodes are used to make changes to data at the record level.
These operations are important during the Data Understanding and Data Preparation
phases of data mining because they allow you to tailor the data to your particular business need.
For example, based on the results of the data audit conducted using the Data
Audit node (Output palette), you might decide that you would like customer purchase records for the
past three months to be merged. Using a Merge node, you can merge records based on the values of a
key field, such as Customer ID. Or you might discover that a database containing information
about Web site hits is unmanageable with over one million records. Using a Sample node, you can
select a subset of data for use in modeling.
The Record Operations palette contains the following nodes:
The Select node selects or discards a subset of records from the data stream based on a
specific condition. For example, you might select the records that pertain to a particular sales
region.
The Sample node selects a subset of records. A variety of sample types are supported,
including stratified, clustered, and nonrandom (structured) samples. Sampling can be useful to
improve performance, and to select groups of related records or transactions for analysis.
The Balance node corrects imbalances in a dataset, so it conforms to a specified condition.
The balancing directive adjusts the proportion of records where a condition is true by the factor
specified.
The Aggregate node replaces a sequence of input records with summarized, aggregated output
records.
The Recency, Frequency, Monetary (RFM) Aggregate node enables you to take customers'
historical transactional data, strip away any unused data, and combine all of their remaining
transaction data into a single row that lists when they last dealt with you, how many transactions
they have made, and the total monetary value of those transactions.
The Sort node sorts records into ascending or descending order based on the values of one or
more fields.
The Merge node takes multiple input records and creates a single output record containing
some or all of the input fields. It is useful for merging data from different sources, such as
internal customer data and purchased demographic data.
The Append node concatenates sets of records. It is useful for combining datasets with
similar structures but different data.
The Distinct node removes duplicate records, either by passing the first distinct record to
the data stream or by discarding the first record and passing any duplicates to the data stream
instead.
The Streaming Time Series node builds and scores time series models in one step. You can use
the node with data in either a local or distributed environment; in a distributed environment you
can harness the power of IBM® SPSS® Analytic Server
Space-Time-Boxes (STB) are an extension of Geohashed spatial locations. More specifically, an
STB is an alphanumeric string that represents a regularly shaped region of space and time.
The Streaming TCM node builds and scores temporal causal models in one step.
The CPLEX Optimization node provides the ability to use complex mathematical
(CPLEX) based optimization via an Optimization Programming Language (OPL) model file. This
functionality was available in the IBM Analytical Decision Management product, which is no longer
supported. But you can also use the CPLEX node in SPSS Modeler without requiring IBM Analytical Decision Management.
Many of the nodes in the Record Operations palette require you to use a CLEM expression. If you are familiar
with CLEM, you can type an
expression in the field. However, all expression fields provide a button that opens the CLEM Expression Builder, which helps
you create such expressions automatically.