Stream performance

Many factors can impact how your SPSS® Modeler streams perform.

Keep these general tips in mind:

  • Where possible, consider minimizing the size of your data by limiting processing to only those fields that are needed by using Filter nodes and the Filter tab in source nodes.
  • Leverage in-database processing capability whenever possible to reduce the amount of data pulled in to SPSS Modeler.
  • Minimize the network distance between your IBM® SPSS Modeler Server and the source data.
  • Certain data sources require more overhead than others. For example, the Excel source node takes longer to access the same data than a CSV file. XML data is inherently wasteful and shouldn't be used for storing large amounts of data.
  • If using Python-based nodes or R-based nodes, note that there are internal data transfers that must take place. This can sometimes slow processing.
  • Accomplishing your tasks with the fewest number of nodes is usually preferable to more nodes.
  • Use Type nodes only when necessary. This is especially true when Hadoop is the data source because each Type node processes the entire data flow. See What is instantiation?.
  • Certain statistical modeling nodes might be slow, especially with data sets that have many categorical fields.
  • Changing the order of nodes can influence processing speed, so experiment with node order. For example, if you have a stream with nodes that reduce data by subsetting or reducing the number of fields, move them as early in the stream as possible.
  • If a modeling node you're using has a corresponding -AS version, use the -AS node instead because it's multi-threaded and can improve processing.