Extracting features from relational data with OneButtonMachine
Features are essential for training machine learning models. Engineering meaningful features from complex relational data is often tedious and time-consuming. Consider using the WML for z/OS automated feature engineering for relational data tool, or OneButtonMachine, to construct and extract features automatically.
Before you begin
Sample relational data sets are shipped with WML for z/OS. You can download and use them to try out the OneButtonMachine tool. Make sure that you have completed the following tasks:
- Install and configure WML for z/OS as described in Roadmap for installing and configuring WML for z/OS.
- Obtain the IP address of your WML for z/OS repository service.
- Ensure that you are authorized to access WML for z/OS as described in Managing users and privileges.
Procedure
- Sign into the WML for z/OS web user interface with your user name and password.
-
Download the "group_customer" sample data sets from the
mlz-samples
project.- From the sidebar, navigate to
Projects - View all Projects
. - From the
Project List
, select and clickmlz-samples
to open the project. - Click Assets (number + text) to display all the assets for the project, including notebooks, models, and data sets.
- Download the following data files from the
Data Sets
list:- group_customer_transactions.csv
- group_customer_purchase.csv
- group_customer_products.csv
- group_customer_main.csv
- group_customer_customers.csv
Click the
ACTION
menu for each data set and select Export to download it.These five data files contain the transaction data of a group of customers who purchase products together, including members' demographics, their purchase history, and the products that are purchased. The transaction data will be used to extract features for predicting future purchases by the group.
The group_customer_main.csv file contains the main table while the other four contain contextual tables. The main table includes a prediction target column called
next_purchase
and connects to the contextual tables through foreign keys.
- From the sidebar, navigate to
-
Locate or create a new AutoFE project.
- From the sidebar, navigate to
Projects - View all Projects
and locate the AutoFE project. - If the project does not exist yet, create a new one by clicking New Project.
- Select the
Blank
tab, enter AutoFE as the project name, and click Create.The new project opens to its main workspace page with links to Assets, Environments, Data Sources, and Collaborators pages.
- From the sidebar, navigate to
-
Create a new notebook in Scala.
- Click Assets on the AutoFE project page and then add notebook on the Assets page.
- Select the
Blank
tab to create a notebook from scratch. - Enter groupCustomerFeatures as the notebook name. Select
IBM_Open_Data_Analytics_for_zOS
for Environment andScala
for Language. - Click Create to create the new notebook.
The groupCustomerFeatures notebook is saved and opened in the Notebook editor.
-
Add the "group_customer" sample data sets you downloaded to the AutoFE
project.
- Click the pull-down menu
(v)
of theCreate new (+)
tool and select Add data set. - Drag and drop the sample data sets into the
Local File
box and upload them to the AutoFE project.
- Click the pull-down menu
-
Prepare the project environment by clicking the pull-down menu
(v)
of theCreate new (+)
tool and selecting Insert project context. -
Define your repository service by its IP address as shown in the following example:
val repositoryIP = pc.repositoryIp
-
Create a new OneButtonMachine object in
JSON format from the
automl
package as shown in the following example:import com.ibm.analytics.automl.OneButtonMachine val json = """{ "Entity_Graph":{ "nodes":[ {"table_name": "main"}, {"table_name": "customers"}, {"table_name": "transactions"}, {"table_name": "purchases"}, {"table_name": "products"} ], "edges":[ {"from":"main", "to": "customers", "from_column": ["group_customer_id"], "to_column": ["group_customer_id"]}, {"from":"main", "to": "transactions", "from_column": ["transaction_id"], "to_column": ["transaction_id"]}, {"from":"main", "to": "purchases", "from_column": ["group_id"], "to_column": ["group_id"]}, {"from":"transactions", "to": "products", "from_column": ["product_id"], "to_column": ["product_id"]} ] }, "Tables":{ "main":{ "column_format":{"time":"posix"}, "primary_key": "id", "timestamp_column_name": "time", "table_source": {"project_name": "AutoFE", "table_name": "group_customer_main.csv", "data_type": "local"} }, "customers":{ "table_source": {"project_name": "AutoFE", "table_name": "group_customer_customers.csv", "data_type": "local"} }, "transactions":{ "table_source": {"project_name": "AutoFE", "table_name": "group_customer_transactions.csv", "data_type": "local"} }, "products":{ "primary_key": "product_id", "table_source": {"project_name": "AutoFE", "table_name": "group_customer_products.csv", "data_type": "local"} }, "purchases":{ "column_format":{"time":"posix"}, "timestamp_column_name": "time", "table_source": {"project_name": "AutoFE", "table_name": "group_customer_purchase.csv", "data_type": "local"} } }, "OneButtonMachine":{ "main_table": "main", "target_column": "next_purchase", "max_depth": 2, "data_source": "mlz", "number_partitions": 100, "problem_type": "regression" } }""" // create a runner val onebm = new OneButtonMachine() // parse runner configuration from a Json string onebm.parse(json)
-
Authorize the OneButtonMachine object access
to your repository service and data through the project context:
val metaServicePath = "https://"+ repositoryIP +":443" // authenticate with given authToken, set repositoryIP and metaServicePath onebm.authenticate(authToken) onebm.set_repository_ip(repositoryIP) onebm.set_meta_service_path(metaServicePath)
-
Load data into the runner and run automated feature extraction:
// load data onebm.load_data(spark) // extract features var features = onebm.extract_features(spark, is_feature_selection = true, is_categorical_transformation = true)
The OneButtonMachine tool will create 66 features from the specified relational database.
-
Evaluate the automatically extracted features by using them to train a Random Forest
model:
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor} import org.apache.spark.ml.feature.VectorAssembler // fill missing values in the features with -9999.0 val column_types = features.dtypes.toMap val feature_columns = features.columns.filter(c => {c != "next_purchase" && c != "id" && column_types(c) != "StringType"}) features = features.na.fill(feature_columns.map(c => (c, -9999.0)).toMap) val vectorAssembler_features = new VectorAssembler(). setInputCols(feature_columns).setOutputCol("features") // split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = features.randomSplit(Array(0.7, 0.3)) // train a RandomForest model. val rf = new RandomForestRegressor().setLabelCol("next_purchase"). setFeaturesCol("features") // chain indexer and forest in a Pipeline. val pipeline = new Pipeline().setStages(Array(vectorAssembler_features, rf)) // train model. This also runs the indexer. val model = pipeline.fit(trainingData) // make predictions. val predictions = model.transform(testData) // select example rows to display. predictions.select("prediction", "next_purchase", "features").show(5) val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]
-
Print the RMSE and the top 10 features:
// select (prediction, true label) and compute test error. val evaluator = new RegressionEvaluator().setLabelCol("next_purchase") .setPredictionCol("prediction").setMetricName("rmse") val rmse = evaluator.evaluate(predictions) //output the value of Root Mean Squared Error (RMSE) println("RMSE on test data = " + rmse) // print top features discovered by OneBM val feature_importances = feature_columns.zip(rfModel. featureImportances.toArray).sortBy(-_._2) for (i <- 0 until 10){ println(feature_importances(i)._1 + " " + feature_importances(i)._2) }
The name of a feature indicates how it was created and extracted. See OneButtonMachine API for details on feature naming conventions.