UPDATE: Click here to get sample code and data file
In this blog, I'm going to describe the steps you can take to have an IBM Data Science Experience with TensorFlow. You can use these steps to create a Jupyter Python notebook that
- reads training data from a BigSQL table into a Pandas dataframe
- uses TensorFlow to train a simple machine learning model with the data
- saves the machine learned model to a local dataset within IBM Data Science Experience
- restores the machine learned model from the local dataset
- performs inferences with the restored machine learned model
As a preliminary, I downloaded the California housing data from the same source as the scikit learn fetch_california_housing() method. This gave me a CSV file with a sample dataset that maps house prices to several predictor variables such as house age, number of bedrooms, and municipal population. That scikit learn method does a lot of work to download the data and parse it into a numpy array needed as input to TensorFlow, but I'd like to take it a little less easy and instead describe the process as if we were getting the data from an Enterprise Data Lake or similar data source.
Once I had the raw CSV file (no post-processing by fetch_california_housing()), I used the internal IBM Enterprise Data Lake lake web browser interface to upload it to HDFS. Then, I used my Eclipse database development tool to create a hadoop table called HOUSINGDATA in a shared BigSQL database within the internal IBM Enterprise Data Lake. Then, I loaded the table with the CSV file content. The SQL code I ran looks like this:
CREATE HADOOP TABLE IF NOT EXISTS BOYERJ_CA_IBM_COM.HOUSINGDATA (
longitude DOUBLE, latitude DOUBLE, housingMedianAge DOUBLE,
totalRooms DOUBLE, totalBedrooms DOUBLE, population DOUBLE,
households DOUBLE, medianIncome DOUBLE, medianHouseValue DOUBLE)
stored as parquet location '/opt/sandboxes/boyerj_ca_ibm_com/housingdata';
LOAD HADOOP USING FILE URL '/opt/sandboxes/boyerj_ca_ibm_com/csvBackup/cal_housing_data.csv'
WITH SOURCE PROPERTIES('field.delimiter' = ',' )
INTO TABLE BOYERJ_CA_IBM_COM.HOUSINGDATA OVERWRITE;
Alongside the IBM Enterprise Data Lake, we have a deployment of IBM Data Science Experience Local, but you can replicate these experiences using the public cloud IBM Data Science Experience.
You can get started by pressing "Create Project" to create a workspace that you can work in and, if you like, share with others. The projects are backed by GIT, so you have a lightweight collaboration method in which you are able to commit and accept changes with others. You can also "Export" the project as a zip file containing all assets. Once you click on a project to get into it, you can create various kinds of assets, including data sources, data sets, Jupyter and Zeppelin notebooks in Python, Scala and R, etc.
The first thing I did was go to the project's "Data Sources" list so I could create a connection to the BigSQL database containing my HOUSINGDATA table. This is where I set just the "JDBC URL" and my user credentials. The JDBC URL looks something like this "jdbc:db2://SharedServer.ibm.com:52000/BIGSQL:sslConnection=true". My JDBC URL also contains a parameter that tells the path to an additional trust store containing more certificate authority certificates, but this is because I needed to point to a shared internal database. And my user credentials authenticate me with that database. I named the data source "MY SHARED DATABASE" and then hit "Create".
The next thing I did was go to the project's "Assets" list, and scrolled down to the "Data Sets" list so I could "add data set". I switched to "Remote Data Set". I gave the name "HOUSING DATA from MY SHARED DATABASE" to the dataset. Then, for the remote settings, I chose the "MY SHARED DATABASE" data source and set the table name to "HOUSINGDATA". In my case, my tables are stored in a schema associated with my user ID, to keep them separate from other users of the same shared database, so I also entered that schema name, then hit "Save" to create the data set.
Now that the data source and data set have been created in the project, we can use them in Python. Again in the project's "Assets" list, I hit "add notebook". I gave the notebook a name of "TensorFlow Sample" and chose Jupyter Python for the tool and language, then hit "Create".
Once I clicked on the notebook to get into it, I was able to insert automatically generated code to get my HOUSINGDATA into a Pandas dataframe. One can also choose a Spark Dataframe, but the Pandas dataframe was sufficient for this sample. I clicked the insert code "10"/"01" menu icon, and changed to the "Remote" list. On the "HOUSING DATA from MY SHARED DATABASE" data set, I clicked "Insert to code" and then clicked "Insert Panda DataFrame". This inserts Python code that performs a few relevant imports, makes the database connection, and performs a "select * from" query on the "HOUSINGDATA" table into a Pandas DataFrame. The important bits for being able to understand the TensorFlow code look like this:
import pandas as pd
# conn is a jaydebeapi.connect() connection to the shared server BigSQL data source
# fqTableName is equal to BOYERJ_CA_IBM_COM.HOUSINGDATA
df1 = pd.read_sql('select * from ' + fqTableName, con=conn)
Now, we start getting into interesting code. This first snippet just imports numpy and then extracts data from the Pandas dataframe into numpy arrays, which is what TensorFlow needs as input. Because we'll be training a simple model with only 20,640 rows of data, we're loading it all at once, but for larger training sets, you'd do this in smaller epochs. The "housing_data" are the 20,460 values for each of the 8 predictor variables, and the "housing_target" is the vector of 20,640 house values that we will be machine learning how to predict. The remaining two lines are just a little house-keeping.
import numpy as np
housing_data = df1.as_matrix(columns=[u'LONGITUDE', u'LATITUDE', u'HOUSINGMEDIANAGE', u'TOTALROOMS', u'TOTALBEDROOMS', u'POPULATION', u'HOUSEHOLDS', u'MEDIANINCOME'])
housing_target = df1.as_matrix(columns=[u'MEDIANHOUSEVALUE'])
m, n = housing_data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing_data]
So, now we're going to do the world's simplest machine learning model because we'd like to be able to see the elements of TensorFlow-based machine learning with as little problem complexity as possible getting in the way of understanding. The word 'tensor' just means n-dimensional array, and TensorFlow is a library that makes it easy to specify a computational 'flow' of tensors and then to execute that flow in the most efficient way possible given the compute power available to TensorFlow. In essence, the data scientist describes what computations must occur, and then TensorFlow determines how to do the computations efficiently.
We're going to start by defining the 'flow' or computation graph that TensorFlow will run. In this case, we're going to define the compute tree for training a multiple linear regression using the 8 predictor variables and the housing value variable that we'd like to learn how to predict. Here's what that looks like:
import tensorflow as tf
# Make the compute graph
X = tf.constant(housing_data_plus_bias, dtype=tf.float64, name="X")
XT = tf.transpose(X)
y = tf.constant(housing_target.reshape(-1, 1), dtype=tf.float64, name="y")
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)
The X variable is the matrix of 8 predictors by the 20,640 samples. XT is a transpose needed in the linear regression computation. The 'y' variable is the dependent variable, and it is assigned to the 20,640 housing values we have in the training data. The 'theta' variable is the vector of linear regression equation coefficients that will result from the series of matrix operations on the righthand side formula.
It is important to note that the code above just specifies the compute graph, i.e. the tensor flow. To perform the flow, you then run the following code:
# Run the compute graph
with tf.Session() as sess:
theta_value = theta.eval()
If you then run a line of code to output theta_value, you will get an output like this:
For a linear regression, this is the machine learned model. It gives the coefficients of a linear equation that is the best fit to the training data. Given values for the 8 predictor variables like house age and number of bedrooms, these coefficients can be used to predict a house value. We'll see how to do that below, but first, we're going to see how to save and reload the model in TensorFlow because you would typically want to save a model trained in IBM Data Science Experience and then transport it to a production deployment environment, where you'd want to restore it before actually using it for inference (prediction).
The first time I ran my notebook in IBM Data Science Experience, I used this line to create a subdirectory in datasets where I could save the TensorFlow model from this notebook:
!mkdir "../datasets/Linear Regression"
Then, to save the model, I defined a second simple TensorFlow compute model that just assigned the theta_value vector to a variable called "model". The code below creates and then executes this simple tensor flow, and then saves the result in the subdirectory created above.
model = tf.Variable(tf.constant(theta_value, dtype=tf.float64), name="model")
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as saver_sess:
theta_value = model.eval()
save_path = saver.save(saver_sess, "../datasets/Linear Regression/Linear Regression.ckpt")
The save method we're using here is useful to know about because it is the same "checkpoint" method that you would use if you were incrementally training a larger model in epochs. It's also useful to understand that what we're saving is the compute graph and the tf.Variable TensorFlow variables and values defined in the model we're checkpointing. In other words, what gets saved is specific to the type of model you're training because the type of model affects the compute graph, or tensor flow, that you specified. In a neural net, for example, you'd have to save the structure of the net in addition to the weights and biases. For a linear regression, we already know the structure is a linear equation, so we just save the coefficients. Regardless of what is being saved, TensorFlow actually saves four files, as shown by the line of code below and its output:
!ls "../datasets/Linear Regression"
checkpoint Linear Regression.ckpt.index
Linear Regression.ckpt.data-00000-of-00001 Linear Regression.ckpt.meta
Now, suppose you were to move these four files to a production deployment environment. Below is code that you could use to reload the model so that you could use it for inference:
sess_restore = tf.Session()
saver = tf.train.import_meta_graph('../datasets/Linear Regression/Linear Regression.ckpt.meta')
theta_value = sess_restore.run('model:0')
At last, you can now perform inferences using the 'theta_value' vector. To simulate making a prediction in the code below, I've used the 0th row of the housing_data for the values of the predictor values. I initialize 'predicted_value' to the constant coefficient of the linear equation, and then the remaining coefficients of the theta_value are placed in 'linear_coefficients' to make the loop easier to read. The loop then multiplies each predictor variable value housing_data[j] by the corresponding coefficient (each coefficient 'c' in the for loop iteration of linear_coefficients is, unfortunately, an array of size 1, so c is used to get the actual value of the coefficient).
predicted_value = theta_value
linear_coefficients = theta_value[1:]
for j, c in enumerate(linear_coefficients):
predicted_value += c * housing_data[j]
If you now run a line of Python code to see the value of predicted_value, you will get output like this:
Finally, it's worth noting that, for a larger kind of model, you can also use TensorFlow to perform the inference. But because this is a linear regression involving only 9 coefficients, using TensorFlow would probably just slow it down. Still, it is an easy tensor flow to write... an exercise for the reader!