Caching options for nodes

To optimize the running of flows, you can set up a cache on any nonterminal node. When you set up a cache on a node, the cache is filled with the data that passes through the node the next time you run the data flow. From then on, the data is read from the cache (which is stored temporarily) rather than from the data source.

Caching is most useful following a time-consuming operation such as a sort, merge, or aggregation. For example, suppose that you have a Data Asset node that imports sales data from a database and an Aggregate node that summarizes sales by location. You can set up a cache on the Aggregate node rather than on the Data Asset node because you want the cache to store the aggregated data rather than the entire dataset.

Note: Although caching at import nodes, like a Data Asset node, stores a copy of the original data as it is read into SPSS Modeler, caching the data does improve performance in most circumstances.

Nodes with caching enabled are displayed with a special circle-backslash icon. When the data is cached at the node, the icon changes to a checkmark.

Shows a node with an empty cache and a node with a full cache — Figure 1. Node with empty cache versus node with full cache

If you change the flow after data is cached, the flow does not automatically run again to refresh the cache. For example, after you cache data at the Aggregate node, you then change a flow parameter that affects the data in several nodes. If you run the flow, the cached data for the Aggregate node is still used. To see the changes that are related to the flow parameter, disable caching or flush the caches.

To enable a cache

Hover over the node in your flow, then click the overflow menu and select Cache > Enable.

You can turn off caching at any time by disabling it.

Caching nodes in a database

For flows that run in a database, you can cache data mid-flow to a temporary table in the database rather than the file system. When combined with SQL optimization, this caching can result in significant gains in performance. For example, the output from a flow that merges multiple tables to create a data mining view can be cached and reused as needed. By automatically generating SQL for all downstream nodes, performance can be further improved.

To take advantage of database caching, both SQL optimization and database caching must be enabled.

With caching enabled on a database, you can cache data at any nonterminal node. The cache is created automatically directly in the database the next time the flow runs. If database caching or SQL optimization is not enabled, the cache is written to the file system instead.

Note: The following databases support temporary tables for caching: Db2, Oracle, SQL Server, and Teradata. Other databases, such as Netezza, use a normal table for database caching.

To flush a cache

A circle-backslash icon by node indicates that its cache is empty. When the cache is full, the icon becomes a checkmark. If you want to replace the contents of the cache, you must first flush the cache and then rerun the data flow to refill it.

Hover over the node in your flow, then click the overflow menu and select Cache > Flush.