Extracting features from relational data with OneButtonMachine

Features are essential for training machine learning models. Engineering meaningful features from complex relational data is often tedious and time-consuming. Consider using the WML for z/OS automated feature engineering for relational data tool, or OneButtonMachine, to construct and extract features automatically.

Before you begin

Sample relational data sets are shipped with WML for z/OS. You can download and use them to try out the OneButtonMachine tool. Make sure that you have completed the following tasks:

Install and configure WML for z/OS as described in Roadmap for installing and configuring WML for z/OS.
Obtain the IP address of your WML for z/OS repository service.
Ensure that you are authorized to access WML for z/OS as described in Managing users and privileges.

Procedure

Sign into the WML for z/OS web user interface with your user name and password.
Download the "group_customer" sample data sets from the mlz-samples project.
1. From the sidebar, navigate to Projects - View all Projects.
2. From the Project List, select and click mlz-samples to open the project.
3. Click Assets (number + text) to display all the assets for the project, including notebooks, models, and data sets.
4. Download the following data files from the Data Sets list:
  - group_customer_transactions.csv
  - group_customer_purchase.csv
  - group_customer_products.csv
  - group_customer_main.csv
  - group_customer_customers.csv
  Click the ACTION menu for each data set and select Export to download it.
  
  These five data files contain the transaction data of a group of customers who purchase products together, including members' demographics, their purchase history, and the products that are purchased. The transaction data will be used to extract features for predicting future purchases by the group.
  
  The group_customer_main.csv file contains the main table while the other four contain contextual tables. The main table includes a prediction target column called next_purchase and connects to the contextual tables through foreign keys.
Locate or create a new AutoFE project.
1. From the sidebar, navigate to Projects - View all Projects and locate the AutoFE project.
2. If the project does not exist yet, create a new one by clicking New Project.
3. Select the Blank tab, enter AutoFE as the project name, and click Create.
  The new project opens to its main workspace page with links to Assets, Environments, Data Sources, and Collaborators pages.
Create a new notebook in Scala.
1. Click Assets on the AutoFE project page and then add notebook on the Assets page.
2. Select the Blank tab to create a notebook from scratch.
3. Enter groupCustomerFeatures as the notebook name. Select IBM_Open_Data_Analytics_for_zOS for Environment and Scala for Language.
4. Click Create to create the new notebook.
  The groupCustomerFeatures notebook is saved and opened in the Notebook editor.
Add the "group_customer" sample data sets you downloaded to the AutoFE project.
1. Click the pull-down menu (v) of the Create new (+) tool and select Add data set.
2. Drag and drop the sample data sets into the Local File box and upload them to the AutoFE project.
Prepare the project environment by clicking the pull-down menu (v) of the Create new (+) tool and selecting Insert project context.
Define your repository service by its IP address as shown in the following example:
```
val repositoryIP = pc.repositoryIp
```

Create a new OneButtonMachine object in JSON format from the automl package as shown in the following example:


import com.ibm.analytics.automl.OneButtonMachine
   val json = """{ 
    "Entity_Graph":{
        "nodes":[
                {"table_name": "main"},
                {"table_name": "customers"},
                {"table_name": "transactions"},
                {"table_name": "purchases"},
                {"table_name": "products"}
            ],
        "edges":[
                {"from":"main", "to": "customers", "from_column": 
                     ["group_customer_id"], "to_column": ["group_customer_id"]},
                {"from":"main", "to": "transactions", "from_column": 
                     ["transaction_id"], "to_column": ["transaction_id"]},
                {"from":"main", "to": "purchases", "from_column": 
                     ["group_id"], "to_column": ["group_id"]},
                {"from":"transactions", "to": "products", "from_column": 
                     ["product_id"], "to_column": ["product_id"]}
            ]
        },
        "Tables":{
           "main":{
                "column_format":{"time":"posix"},
                "primary_key": "id",
                "timestamp_column_name": "time",
                "table_source": {"project_name": "AutoFE", "table_name": 
                     "group_customer_main.csv", "data_type": "local"}
          },
        "customers":{
                "table_source": {"project_name": "AutoFE", "table_name": 
                     "group_customer_customers.csv", "data_type": "local"}         
                },
        "transactions":{
                "table_source": {"project_name": "AutoFE", "table_name": 
                     "group_customer_transactions.csv", "data_type": "local"}         
                },
         "products":{
                "primary_key": "product_id",
                "table_source": {"project_name": "AutoFE", "table_name": 
                     "group_customer_products.csv", "data_type": "local"}   
         },
         "purchases":{
                "column_format":{"time":"posix"},
                "timestamp_column_name": "time",
                "table_source": {"project_name": "AutoFE", "table_name": 
                     "group_customer_purchase.csv", "data_type": "local"}         
            }       
         },
         "OneButtonMachine":{
                "main_table": "main",
                "target_column": "next_purchase",
                "max_depth": 2, 
                "data_source": "mlz",
                "number_partitions": 100,
                "problem_type": "regression"
      }
}"""

// create a runner
val onebm = new OneButtonMachine()

// parse runner configuration from a Json string
onebm.parse(json)

Authorize the OneButtonMachine object access to your repository service and data through the project context:


val metaServicePath = "https://"+ repositoryIP +":443"

// authenticate with given authToken, set repositoryIP and metaServicePath
onebm.authenticate(authToken)
onebm.set_repository_ip(repositoryIP)
onebm.set_meta_service_path(metaServicePath)

Load data into the runner and run automated feature extraction:


// load data
onebm.load_data(spark)

// extract features
var features = onebm.extract_features(spark, is_feature_selection = true, 
    is_categorical_transformation = true)

The OneButtonMachine tool will create 66 features from the specified relational database.

Evaluate the automatically extracted features by using them to train a Random Forest model:


import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
     RandomForestRegressor}
import org.apache.spark.ml.feature.VectorAssembler

// fill missing values in the features with -9999.0
val column_types = features.dtypes.toMap
val feature_columns = features.columns.filter(c => 
    {c != "next_purchase" && c != "id" && column_types(c) != "StringType"})
    features = features.na.fill(feature_columns.map(c => (c, -9999.0)).toMap)

val vectorAssembler_features = new VectorAssembler().
    setInputCols(feature_columns).setOutputCol("features")

// split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = features.randomSplit(Array(0.7, 0.3))

// train a RandomForest model.
val rf = new RandomForestRegressor().setLabelCol("next_purchase").
    setFeaturesCol("features")

// chain indexer and forest in a Pipeline.
val pipeline = new Pipeline().setStages(Array(vectorAssembler_features, rf))

// train model. This also runs the indexer.
val model = pipeline.fit(trainingData)

// make predictions.
val predictions = model.transform(testData)

// select example rows to display.
predictions.select("prediction", "next_purchase", "features").show(5)
val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]

Print the RMSE and the top 10 features:


// select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator().setLabelCol("next_purchase")
    .setPredictionCol("prediction").setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)

//output the value of Root Mean Squared Error (RMSE)
println("RMSE on test data = " + rmse)

// print top features discovered by OneBM
val feature_importances = feature_columns.zip(rfModel.
    featureImportances.toArray).sortBy(-_._2)
for (i <- 0 until 10){
    println(feature_importances(i)._1 + " " + feature_importances(i)._2)
}

The name of a feature indicates how it was created and extracted. See OneButtonMachine API for details on feature naming conventions.