Toolkit com.ibm.streamsx.sparkmllib 1.3.1
Specialized toolkits - release 4.3.1.0-i20200220 > com.ibm.streamsx.sparkmllib 1.3.1
General Information
Apache Spark is a fast general purpose clustering system that is well suited for machine learning algorithms. MLlib is a machine learning library provided with Spark with support for common machine learning algorithms including classification, regression, collaborative filtering and others. This toolkit allows Spark's MLlib library to be used for real-time scoring of data in InfoSphere Streams.
The toolkit is based on org.apache.spark.spark-mllib 2.4.4 (Scala 2.11; hadoop 2.6.5)
This set of operators that provide real-time scoring based on Apache Sparkmllib. Currently, each operator initializes a SparkContext which makes it behave as a spark "driver". Hence, there is 1 spark job for every operator under which scoring occurs. By default, the job runs locally in thread but you can optionally pass a spark master location (parameter sparkMaster) to run it on a cluster.
Some models like the collaborative filtering one store their internal state as RDDs so when we ask for scoring, the action runs against these internal state RDDs. In most cases, RDDs are also created during the model load process. If you pass a remote spark master then the operation may be performed on one or more remote worker machines known to the master node and the results passed back to the driver (which in our case is the operator).
Actually can't have multiple spark contexts in a single JVM, so you can't fuse multiple operators into one PE. To make sure that multiple sparkmllib operators use different PE's use operator configuration placement clause (partitionExlocation).
- Version
- 1.3.1
- Required Product Version
- 4.0.1.0