Contents


Offloading your Informix data in Spark, Part 5

Machine Learning will help you extrapolate future orders

Add a little intelligence

Comments

Content series:

This content is part # of # in the series: Offloading your Informix data in Spark, Part 5

Stay tuned for additional content in this series.

This content is part of the series:Offloading your Informix data in Spark, Part 5

Stay tuned for additional content in this series.

Where are we? Where are we going?

In the first four parts of this series, you learned how to:

You might be wondering what else there is to do. It's true that this tutorial series has covered many aspects of Apache Spark, but there is so much more to discover with this analytics platform. It's time to explore one of the key features of Spark: its support for machine learning.

Through the previous parts of this series, you've used sales data for a store as an example. The idea is simply to show you what you can do with machine learning: Forecast future orders based on previous orders.

What you'll need:

  • The Spark ML (for machine learning) library, which is in the project on GitHub.
  • The stores_demo data set included with every Informix® database.
    Note: Don't worry if you don't have Informix knowledge. You do not need it to read and understand this tutorial. Nevertheless, feel free to consider IBM Informix as the RDBMS for your next project.
  • The code you used for parts 1-4
  • For this part, the labs are in the. net.jgp.labs.informix2spark.l5x0 package on GitHub.

Mathematics

I love math. Coming from the French centralized education system, you better love math if you want to access the best engineering schools. This is also true in many other places, but France seems to be an extreme. Unfortunately for me, the mathematics behind machine learning (ML) is a lot of statistics and probabilities. As much as I enjoyed statistics, I am not the greatest fan of probabilities. Perhaps this is because of their random aspects in some scenarios.

Therefore, I always try to look at minimizing the math impact on ML. I find this makes ML understandable by most of us.

Linear regression

Linear regression is the concept you will implement. Imagine the following graph: your x-axis (abscissa) is the week number, your y-axis (ordinate) is the total amount of orders for this week. It should look like the image below.

Image shows orders plot chart
Image shows orders plot chart

The idea behind linear regression is to draw a straight line, which is the least distant from all the points on the chart.

In this context, the regression line is:

Image shows orders plot chart with regression line
Image shows orders plot chart with regression line

You can now imagine that we will continue this line to see where it goes. However, first you need to know how we got the data.

Getting the data with Spark

Use the examples in the previous parts of this series and adapt them to get the orders then group the sales amount by week. The output should look like this:

+----------+----------------+
|order_week|sum(total_price)|
+----------+----------------+
|        21|         4387.00|
|        22|         2144.00|
|        23|          940.00|
|        24|          450.00|
|        25|         1366.80|
|        26|         2544.00|
|        28|         3652.97|
|        30|         2670.00|
+----------+----------------+

Note: In this part of the series, I'm not explaining every part of the code. By this time, you should feel comfortable reading the code without breaking it into small chunks. That said, if you have issues, please ask questions in the comments.

Your code should look like:

package net.jgp.labs.informix2spark.l500;

import static org.apache.spark.sql.functions.lit;
import static org.apache.spark.sql.functions.weekofyear;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.jdbc.JdbcDialect;
import org.apache.spark.sql.jdbc.JdbcDialects;

import net.jgp.labs.informix2spark.utils.Config;
import net.jgp.labs.informix2spark.utils.ConfigManager;
import net.jgp.labs.informix2spark.utils.InformixJdbcDialect;
import net.jgp.labs.informix2spark.utils.K;

public class OrdersPerWeekApp {

  public static void main(String[] args) {
    OrdersPerWeekApp app = new OrdersPerWeekApp();
    app.start();
  }

  private void start() {
    SparkSession spark;

    spark = SparkSession
        .builder()
        .appName("Sales per week")
        .master("local")
        .getOrCreate();

    // List of all tables we want to work with
    List<String> tables = new ArrayList<>();
    tables.add("orders");
    tables.add("items");

    // Specific Informix dialect
    JdbcDialect dialect = new InformixJdbcDialect();
    JdbcDialects.registerDialect(dialect);

    // Let's connect to the database
    Config config = ConfigManager.getConfig(K.INFORMIX);

    // Let's build our datalake
    Map<String, <Dataset<Row> datalake = new HashMap<>();
    for (String table : tables) {
      System.out.print("Loading table [" + table + "] ... ");
      Dataset<Row> df = spark.read()
          .format("jdbc")
          .option("url", config.getJdbcUrl())
          .option("dbtable", table)
          .option("user", config.getUser())
          .option("password", config.getPassword())
          .option("driver", config.getDriver())
          .load();

      datalake.put(table, df);
      System.out.println("done");
    }

    System.out.println("We have loaded " + datalake.size()
        + " table(s) in our data lake");

    // Let's look at the content
    Dataset<Row> ordersDf = datalake.get("orders");
    Dataset<Row> itemsDf = datalake.get("items");

    // Builds the datasets in 2 steps, first with the week number...
    Dataset<Row> allDf = ordersDf
        .join(
            itemsDf,
            ordersDf.col("order_num").equalTo(itemsDf.col("order_num")),
            "full_outer")
        .drop(ordersDf.col("customer_num"))
        .drop(itemsDf.col("order_num"))
        .withColumn("order_week", lit(weekofyear(ordersDf.col("order_date"))));
    // ... then with
    allDf = allDf
        .groupBy(allDf.col("order_week"))
        .sum("total_price")
        .orderBy(allDf.col("order_week"));

    allDf.show(50);
    }
}

The "meat" of this app starts when you create the allDF dataframe. First, create a column named order_week based on order_date. Use the weekofyear() static method to determine the week number from a date and the lit() static method to create a column from scratch in your dataframe. Both methods are statically imported at the beginning of the code.

Data quality

It is always important to look at the data. You might not catch every anomaly by doing so, especially with big data. But by looking at it, you can see that weeks 27 and 29 are missing. From this observation, you have (at least) two decisions to make:

  1. Ignore the missing data. Perhaps the central system has not been updated yet, it is not the first time it happened, or maybe it's the intern who crashed the system the other day.
  2. Assume there weren't any orders; it would mean that you have to insert two rows with an amount of 0.

I recommend you go with the first solution: don't blame it on the interns, but keep track of your decision.

The two-second introduction to machine learning

ML algorithms can be complex. However, the principle is really easy. You build (or train) a model, then you apply this model to a data set to predict an outcome. In this scenario, you will only execute step 2, but you can easily imagine different scenarios where the model does not change and can be reused in step 3, 4, etc.

Image shows step 1: learning phase
Image shows step 1: learning phase

Of course, as a data professional, you can imagine the full spectrum of lifecycle activities deriving from this model: validating, refining, testing, etc. However, these activities are slightly outside the scope of this primer.

Building the model

You have the data, and you have the theory. Now you can practice.

Your first step is to prepare the data for the ML trainer to digest. It varies depending on the type of algorithm, but the linear regression expects features and labels.

In essence, the label is what you are studying, and the features define it. So, if you look at the orders of week 28 where you made $3,652.97, the label is 3652.97, and one of its features is 28.

Image shows orders with regression line order total (label) and 28 features
Image shows orders with regression line order total (label) and 28 features

You can add more features, such as:

  • Temperature
  • Precipitation level
  • Total of orders during the same week of previous years
  • Number of days before or after a holiday, etc.

I remember a friend of mine who sold swimming pools. He had roughly a six months' lead time. He sold more pools when it was sunny, so applying the amount of sunshine to his model made sense.

A common mistake is to confuse the label and features, especially in a case like this one when you only have one feature.

To use a linear regression, Spark expects a vector of features, even if your vector contains only one element. Basically, Spark expects the following dataframe:

+----------+----------------+-------+--------+
|order_week|sum(total_price)|  label|features|
+----------+----------------+-------+--------+
|        21|         4387.00|4387.00|  [21.0]|
|        22|         2144.00|2144.00|  [22.0]|
|        23|          940.00| 940.00|  [23.0]|
|        24|          450.00| 450.00|  [24.0]|
|        25|         1366.80|1366.80|  [25.0]|
|        26|         2544.00|2544.00|  [26.0]|
|        28|         3652.97|3652.97|  [28.0]|
|        30|         2670.00|2670.00|  [30.0]|
+----------+----------------+-------+--------+

You could have simply renamed the sum(total_price) column to label, but because both the label and features columns are merely technical constraints for the linear regression algorithm, I prefer to keep the data separate from the technical constraints.

To build the vector, you can use a user-defined function (UDF). This extension creates a vector from the original value.

package net.jgp.labs.informix2spark.l520;

import org.apache.spark.ml.linalg.Vector;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.sql.api.java.UDF1;

public class VectorBuilderInteger implements UDF1<Integer, Vector> {
  private static final long serialVersionUID = -2991355883253063841L;

  @Override
  public Vector call(Integer t1) throws Exception {
    double d = t1.doubleValue();
    return Vectors.dense(d);
  }
}

The UDF implements a UDF1 of Integer (the input type) and Vector (the return type). Vectors expect double value, so you need to transform the integer to a double.

Before using the UDF, you have to register it in the Spark session. Make sure to register the UDF right after you create the Spark session.

    spark.udf().register("vectorBuilder", new VectorBuilderInteger(), new VectorUDT());

In this scenario:

  • vectorBuilder is the name of the function you are adding to Spark SQL.
  • VectorBuilderInteger is the class implementing the UDF.
  • VectorUDT is the return type.

In your transformation code, you can simply call the vectorBuilder() function to create the column.

    
    Dataset<Row> df = allDf
        .withColumn("values_for_features", allDf.col("order_week"))
        .withColumn("label", allDf.col("sum(total_price)"))
        .withColumn("features", callUDF("vectorBuilder", col("values_for_features")))
        .drop(col("values_for_features"));

Now that you have the data in the correct form, creating your model only takes two lines of code.

    LinearRegression lr = new LinearRegression().setMaxIter(20);
    LinearRegressionModel model = lr.fit(df);

What about a little introspection?

This section is optional. Imagine that I added it to raise the suspense towards the end goal of discovering future orders — but I also added it for those math lovers who want to understand that there really is no crystal ball but some methodology and science.

Spark provides the tools needed to inspect your model. First, apply the model to the full dataframe you had:

    model.transform(df).show();

This adds a prediction column (the value on the linear regression line).

+----------+----------------+-------+--------+------------------+
|order_week|sum(total_price)|  label|features|        prediction|
+----------+----------------+-------+--------+------------------+
|        21|         4387.00|4387.00|  [21.0]|2101.3694797687876|
|        22|         2144.00|2144.00|  [22.0]|2144.7183236994233|
|        23|          940.00| 940.00|  [23.0]|2188.0671676300585|
|        24|          450.00| 450.00|  [24.0]|2231.4160115606937|
|        25|         1366.80|1366.80|  [25.0]|2274.7648554913294|
|        26|         2544.00|2544.00|  [26.0]|2318.1136994219646|
|        28|         3652.97|3652.97|  [28.0]|2404.8113872832355|
|        30|         2670.00|2670.00|  [30.0]| 2491.509075144506|
+----------+----------------+-------+--------+------------------+

Here's a look at the different mathematical computations associated to the model:

    LinearRegressionTrainingSummary trainingSummary = model.summary();
System.out.println("numIterations: " + trainingSummary.totalIterations());
System.out.println("objectiveHistory: " +
    Vectors.dense(trainingSummary.objectiveHistory()));
trainingSummary.residuals().show();
System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError());
System.out.println("r2: " + trainingSummary.r2());

This code returns:

numIterations: 1
objectiveHistory: [0.0]
+-------------------+
|          residuals|
+-------------------+
| 2285.6305202312124|
|-0.7183236994233084|
|-1248.0671676300585|
|-1781.4160115606937|
| -907.9648554913294|
|  225.8863005780354|
| 1248.1586127167643|
|  178.4909248554941|
+-------------------+

RMSE: 1246.0139337359603
r2: 0.009719742211204974

Let's look at one of those criteria. The root-mean-square error (RMSE), also called root-mean-square deviation (RMSD), is used to measure the differences between values (sample and population values) predicted by a model or an estimator and the values observed. Because this is a distance, the smaller the number the better it is. And, when you compare it to the value of the labels, it means that you are pretty far off, which is not good.

While this is not good, the explanation is easy. There is a great disparity in the labels because there is a limited number of features. This is definitely not big data.

The other parameters define the line: the intercept, the regression parameter, and the convergence tolerance of iterations.

double intercept = model.intercept();
System.out.println("Intersection: " + intercept);
double regParam = model.getRegParam();
System.out.println("Regression parameter: " + regParam);
double tol = model.getTol();
System.out.println("Tol: " + tol);

And the results are:

Intersection: 1191.0437572254443
Regression parameter: 0.0
Tol: 1.0E-6

The not-that-magic crystal ball

You are now ready to predict orders for the next three weeks. You are about to discover the complex code to do so. Letting the suspense grow, the code first:

for (double feature = 31.0; feature < 34; feature++) {
  Vector features = Vectors.dense(feature);
  double p = model.predict(features);

  System.out.printf("Total orders prediction for week #%d is $%4.2f.\n",
      Double.valueOf(feature).intValue(), 
      p);
}

Remember what you saw before: features are stored in a vector. So, even if you only have one feature (the week number), it still takes a vector, so remember to build this vector with the feature.

Then you can call the predict() method from the model, using the vector. That's it; you just made your first prediction!

It takes seven lines of code to apply the model to the new features (here, the week number). Among these seven lines, three are for display and two are for the loop. The results:

Total orders prediction for week #31 is $2534.86.
Total orders prediction for week #32 is $2578.21.
Total orders prediction for week #33 is $2621.56.

Give yourself a high-five! You followed this long, and hopefully not too painful, tutorial series and you even discovered that your company orders are on the rise.

What you learned

This fifth part of the Offloading your Informix Data in Spark series taught you:

  • A use case of ML based on RDBMS data and that ML does not always need tons of data.
  • A little bit of mathematics, like that the RMSE measures the quality of your model.
  • The importance of data quality (and I should probably teach more DQ).
  • A linear regression is an easy form of ML.
  • ML has a simple process: train the model and reuse the model on new data.

Farewell

This is also the last part of this series. I sincerely hope you enjoyed it. I took great pleasure in writing each part of this series and wanted to thank the support team at IBM and especially Robin Wood, who showed patience, tolerance from my Frenglish, and brought help. Thanks, Robin.

Let's keep in touch via Twitter (@jgperrin), email jgp@jgp.net (I reply to all emails), or in the comments below. See you for more Spark content and at IBM Think in 2018!

Go further

More readings and info:


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Open source
ArticleID=1055277
ArticleTitle=Offloading your Informix data in Spark, Part 5: Machine Learning will help you extrapolate future orders
publish-date=12152017