Extension Transform node

With the Extension Transform node, you can take data from an SPSS Modeler flow and apply transformations to the data by using scripts written in R, Python, or Python for Spark.

When the data has been modified, it's returned to the flow for further processing, model building, and model scoring. The Extension Transform node makes it possible to transform data by using algorithms that are written in one of the languages, and you can use the node to develop data transformation methods that are tailored to a particular problem.

After adding the node to your canvas, double-click the node to open its properties.

Syntax tab

Select your type of syntax – R, Python, or Python for Spark. Then enter or paste your custom script for transforming data. When your syntax is ready, you can run the node. The following options are available for R syntax:

Convert flag fields. Specifies how flag fields are treated. There are two options: Strings to factor, Integers and Reals to double, and Logical values (True, False). If you select Logical values (True, False) the original values of the flag fields are lost. For example, if a field has values Male and Female, these are changed to True and False.
Convert missing values to the R 'not available' value (NA). When selected, any missing values are converted to the R NA value. The value NA is used by R to identify missing values. Some R functions that you use might have an argument that can control how the function behaves when the data contains NA. For example, the function might allow you to choose to automatically exclude records that contain NA. If this option isn't selected, any missing values are passed to R unchanged, and might cause errors when your R script runs.
Convert date/time fields to R classes with special control for time zones When selected, variables with date or datetime formats are converted to R date/time objects. You must select one of the following options:
- R POSIXct. Variables with date or datetime formats are converted to R POSIXct objects.
- R POSIXlt (list). Variables with date or datetime formats are converted to R POSIXlt objects.
Note: The POSIX formats are advanced options. Use these options only if your R script specifies that datetime fields are treated in ways that require these formats. The POSIX formats don't apply to variables with time formats.

Console Output tab

The Console Output tab contains any output that's received when the R script or Python script runs (for example, if using an R script, it shows output received from the R console when the R script in the R Syntax field on the Syntax tab is executed). This output might include R or Python error messages or warnings that are produced when the R script or Python script is executed. The output can be used, primarily, to debug the script. The Console Output tab also contains the script from the R Syntax or Python Syntax field.

Every time the Extension Transform script runs, the content of the Console Output tab is overwritten with the output received from the R or Python console. You can't edit the output.

Control theory transformations, signal processing, and feature mappings

SPSS Modeler supports advanced data preprocessing, time series analysis, and feature engineering through control theory transformations, signal processing filters, and feature mappings. These techniques are essential for preparing data, extracting meaningful patterns from signals, and creating informative features.

To see samples of these techniques in an SPSS Modeler flow, you can download the sample extension-transform-node.zip. It contains several stream files that you can import into SPSS Modeler. For more information about importing, see Importing an SPSS Modeler stream. Then, open the Extension Transform node properties to see the example syntax.

Simple transformations

Simple transformations are mathematical transformations that modify data distributions, reduce skewness, stabilize variance, and prepare features for modeling. These transformations are fundamental preprocessing steps.

Log transformation (logarithmic)

Description: A log transformation applies a natural logarithm (ln) or log base 10 to data values. This transformation can be useful for right-skewed data with positive values.
Example scenario: A retailer analyzes customer purchase behavior across their platform. They notice that order values are heavily right-skewed, with most orders between $20-$100, but some orders exceed $10,000. To better understand purchasing patterns and build predictive models, they apply log transformation to the order value data. This transformation normalizes the distribution, making it easier to identify meaningful patterns and reduces the influence of extreme high-value orders on their analysis.

Square root transformation

Description: A square root transformation takes the square root of data values. This transformation is suitable for count data and moderate newness.
Example scenario: A manufacturing plant tracks the number of defects found during daily inspections. The defect counts follow a Poisson-like distribution with moderate right skewness, ranging from 0 to 50 defects per day. By applying square root transformation to the defect count data, the variance is stabilized, which creates a more symmetric distribution. This distribution enables more accurate control charts and better detection of unusual patterns that might indicate process problems.

Reciprocal transformation (inverse)

Description: A reciprocal transformation computes 1/x for each value. It is useful for data with hyperbolic relationships or when smaller values need more weight.
Example scenario: A delivery company analyzes delivery times for their fleet of vehicles. Instead of working with time-to-delivery (which can vary widely), they transform the data using reciprocal transformation to create a "delivery rate" metric. This transformation allows them to compare efficiency across different routes and help identify bottlenecks where small improvements in time can yield significant gains in throughput.

Power transformation

Description: A power transformation raises values to a specified power (x^p). This transformation can increase or decrease skewness depending on the power value.
Example scenario: A scientist is researching the relationship between pollution levels and the distance from industrial sites. The relationship appears to follow a power law, where pollution decreases rapidly near the source but more gradually at greater distances. By applying power transformation with carefully selected exponents to the distance measurements, the scientist linearizes the relationship, making it suitable for standard regression analysis and easier to interpret the rate of pollution decay.

Standardization (normalization)

Description: This transformation re-scales features to have specific statistical properties.
Example scenario: A customer segmentation model for a retail company uses features with vastly different scales: annual income ($20,000-$500,000), age (18-80 years), and number of purchases (0-200). Without standardization, distance-based algorithms like k-means clustering would be dominated by the income feature due to its large numeric range. When standardization is used to transform each feature, all the features contribute equally to the clustering algorithm. This transformation enables the model to identify meaningful customer segments based on the combined patterns across all features, rather than being skewed by scale differences.

Winsorization (clipping)

Description: This transformation limits extreme values by capping values at specified percentiles (for example, the 1st and 99th percentiles). You can use this transformation to reduce outlier impact without removing data points.
Example scenario: A financial analyst examines daily stock returns for portfolio optimization. The dataset includes several extreme outliers from market crashes and flash crashes that could distort risk calculations. Rather than removing these data points, the analyst applies winsorization at the 1st and 99th percentiles, capping extreme values while preserving the overall sample size. This approach provides more robust estimates of volatility and correlation for portfolio construction.

Polynomial (interaction terms)

Description: You can use polynomial/interaction terms to create new features by raising existing features to powers or multiplying features together. You can use this transformation to capture non-linear relationships and feature interactions.
Example scenario: A marketing firm wants to know how advertising spend across different channels (TV, digital, print) affects sales. They suspect that the channels are synergistic. For example, TV ads might amplify the effectiveness of digital campaigns. By creating interaction terms (TV × Digital, TV × Print, Digital × Print) and polynomial terms (TV², Digital²), they want to capture these non-linear relationships and interaction effects, which improves the predictive accuracy of their marketing mix model.

Libraries for simple transformations

These simple transformations all use the following R and Python libraries.

Python

- numpy
- scipy
- sklearn.preprocessing
- pandas

R

Base R functions, scale()
poly()
DescTools::Winsorize()

Control theory filters

You can use advanced techniques for analyzing time series data, filtering noise, extracting frequency components, and modeling dynamic systems. These methods are useful for sensor data, financial time series, and control systems.

Kalman filter

Description

A Kalman filter is an optimal recursive algorithm that estimates the state of a dynamic system from noisy measurements. It combines predictions from a mathematical model with noisy observations to produce optimal estimates.

Example scenario

A navigation system is being developed for vehicle, which will use GPS sensors to track position. However, GPS signals are noisy and occasionally unavailable. The navigation system can use a Kalman filter that combines GPS measurements with predictions from the vehicle's motion model (based on speed and steering angle). The filter optimally weighs these information sources, providing smooth, accurate position estimates even when GPS signals are degraded, and can maintain reasonable position estimates during brief GPS outages.

Libraries

Python

filterpy library (comprehensive Kalman filter implementations)

R

KFAS (Kalman Filter and Smoother for Exponential Family State Space Models)
dlm (Dynamic Linear Models)

Low-pass filter

Description

A low-pass filter is a digital filter that selectively allows certain low-frequency ranges to pass while attenuating high-frequencies. You can use it for noise reduction and signal conditioning.

Example scenario

A technologist analyzes electrocardiogram (ECG) signals from patients to detect heart abnormalities. The raw ECG signals contain high-frequency noise from muscle movements, electrical interference, and sensor artifacts that obscure the underlying heart rhythm. By applying a low-pass Butterworth filter with a cutoff frequency of 40 Hz, the technologist removes high-frequency noise while preserving the essential cardiac waveforms, which enables accurate heart rate measurement and arrhythmia detection.

Libraries

Python

scipy.signal (butter, cheby1, cheby2, bessel, ellip functions)

R

signal package

High-pass filter

Description

A high-pass filter is a digital filter that selectively allows certain high-frequency ranges to pass while attenuating low-frequency ranges below a threshold. You can use it for noise reduction and signal conditioning.

Example scenario

An audio engineer processes recordings that contain low-frequency rumble from air conditioning systems and traffic noise. These low-frequency components (below 80 Hz) don't contribute to speech intelligibility, but they consume dynamic range and can cause distortion. By applying a high-pass filter with an 80 Hz cutoff, the engineer removes the rumble while preserving the full clarity of human speech (typically 100-8,000 Hz). Using the filter gives cleaner, more professional-sounding recordings.

Libraries

Python

scipy.signal (butter, cheby1, cheby2, bessel, ellip functions)

R

signal package

Band-pass filter

Description

A band-pass filter is a digital filter that selectively allows certain frequency ranges to pass while attenuating frequencies outside of that range. You can use it for noise reduction and signal conditioning.

Example scenario

A seismologist analyzes earthquake data to distinguish between different types of seismic waves. Primary waves (P-waves) typically occur in the 1-5 Hz range. Surface waves appear at lower frequencies (0.1-1 Hz). The seismologist can apply a band-pass filter centered on 1-5 Hz to isolate P-wave arrivals from the complex seismograph.

Libraries

Python

scipy.signal (butter, cheby1, cheby2, bessel, ellip functions)

R

signal package

Exponential smoothing (moving average)

Description

This filter uses smoothing techniques on a time series to reduce noise while preserving underlying trends. It uses weighted averages to give more importance to recent observations.

Example scenario

A retail chain forecasts weekly sales for inventory planning. Historical sales data shows both underlying trends and significant week-to-week volatility due to promotions, weather, and random factors. They apply exponential smoothing with α=0.3, which gives more weight to recent observations while smoothing out short-term fluctuations. This smoothing produces stable forecasts that capture genuine trends without overreacting to temporary spikes.

Libraries

Python

scipy.signal (butter, cheby1, cheby2, bessel, ellip functions)

R

signal package

Wavelet transforms

Description

Wavelet transforms use multi-resolution analysis techniques to decompose signals into time-frequency representations by using wavelets (localized wave-like functions). It also provides both time and frequency information.

Example scenario

A technologist analyzes MRI scans to detect brain abnormalities. The images contain important features at multiple scales, for example, large-scale anatomical structures and fine-scale tissue variations. The technologist can use a discrete wavelet transform with Daubechies wavelets to decomposes the images into multiple resolution levels. This multi-scale analysis reveals subtle patterns that might be missed in standard spatial analysis.

Libraries

Python

pywt (PyWavelets) library

R

wavelets package
waveslim

State-space models

Description

A state-space model is a mathematical framework that represents dynamic systems as sets of input, output, and state variables related by differential or difference equations. It provides a unified approach to time series analysis and control.

Example scenario

An economic model of a country's GDP dynamics uses multiple interrelated factors: consumption, investment, government spending, and net exports. These components evolve over time and influence each other in complex ways. By formulating the problem as a state-space model, it can represent the economic state of the country and its evolution, while accounting for measurement errors in the observed economic indicators. The models enables sophisticated forecasting that properly handles missing data and provides uncertainty quantification.

Libraries

Python

statsmodels.tsa.statespace module

R

KFAS
dlm packages

Signal Processing

You can use signal processing techniques to detect or pinpoint components of interest in a measured signal.

Fast Fourier Transform (FFT)

Description

You can use a Fast Fourier Transform to convert time-domain signals into frequency-domain representations, which can reveal periodic components and spectral characteristics.

Example scenario

A manufacturer wants to diagnose problems in industrial machinery by analyzing vibration sensor data. Different types of faults produce characteristic vibration patterns at specific frequencies. By applying FFT to the time-series vibration data, they can transform it into the frequency domain, which shows the amplitude of vibrations at each frequency. Peaks at specific frequencies indicate particular fault types, which the manufacturer can use for targeting maintenance before catastrophic failures occur.

Libraries

Python

numpy.fft module
scipy.fft

R

fft() function (base R),
spectrum()

Mappings and feature transformations

You can also use advanced feature engineering techniques that create informative features, capture complex relationships, and improve model performance.

Box-Cox Transformation

Description

A family of power transformations that automatically determines the optimal transformation parameter (lambda) to make data more normally distributed and stabilize variance.

Example scenario

A market analyst builds a regression model to predict house prices based on features like square footage, number of bedrooms, and lot size. The price distribution is right-skewed and exhibits non-constant variance. The analyst applies a Box-Cox transformation to automatically determine the optimal lambda parameter (λ=0.23) that best normalizes the distribution and stabilizes variance. The transformation improves the regression model's assumptions and prediction accuracy.

Libraries

Python

scipy.stats.boxcox()
sklearn.preprocessing.PowerTransformer

R

boxcox() from MASS package
car::powerTransform()

Rank and quantile mapping

Description

Transforms data based on rank order or quantile position. Quantile mapping transforms the data to make it more uniform or match a normal distribution based on quantiles. Rank mapping transforms data into a standardized scale by using percentile rank.

Example scenario

A credit scoring model combines multiple risk indicators with different distributions: some are normally distributed, others are highly skewed, and some have heavy tails. The model applies a quantile transformation to each indicator to create a robust composite score . This approach is not sensitive to outliers and ensures that each indicator contributes equally to the final score, regardless of its original scale or distribution shape.

Libraries

Python

scipy.stats.rankdata()
pandas.qcut()
sklearn.preprocessing.QuantileTransformer

R

rank()
quantile()
cut()

Feature interaction

Description

You can use feature interaction to create new features by combining existing features through mathematical operations. You can use feature interactions to capture synergistic effects and non-linear relationships that individual features cannot represent.

Example scenario

A car insurance company builds a model to predict claim amounts. They have features for driver age, vehicle value, and annual mileage. Analysis reveals that the relationship between these factors and claims is not simply additive. For example, high mileage is particularly risky for young drivers, but it is less risky for experienced drivers. By creating interaction features (Age × Mileage, Age × Vehicle_Value, Mileage × Vehicle_Value), the model captures these synergistic effects, which improves the prediction accuracy and enables more nuanced risk-based pricing.

Libraries

Python

sklearn.preprocessing.PolynomialFeatures
Custom functions

R

Formula interface (x1:x2)
poly()
Custom functions

Custom Nonlinear Mappings

Description: You can define transformations that apply custom mathematical functions or domain-specific transformations to features.
Example scenario: A telecommunications company wants to analyze customer patterns to predict churn. Their shows a non-linear relationship between churn and customers contacting customer service. Customers who contacted customer service very recently have a high churn risk. Those customers who contacted customer service in the recent past have low risk because issues were likely resolved, and those who haven't contacted in months have an increasing risk. The company creates a custom U-shaped transformation that captures this complex relationship, which improves predictions of churn.
Libraries: Implemented through custom scripts and functions