Posted in: AI, Open Source

Probabilistic Programming with Edward in WML

Edward is a deep probabilistic programming language (DPPL), that is, a language for specifying both deep neural networks and probabilistic models. DPPLs draw upon programming languages, Bayesian statistics, and deep learning to ease the development of powerful AI applications.

Probabilistic languages let the user express a probabilistic model as a program with an intuitive formalism and dedicated constructs. Probabilistic models are a powerful tool to represent and reason about uncertain behaviors, such as simplification of complex system (e.g., nature simulation) and prediction based on past observations (e.g., polls, weather forecast).

With the advent of machine learning, more and more advanced probabilistic models now involve deep learning networks. DPPLs such as Edward aim to combine the benefits of PPLs and deep learning frameworks (Tensorflow in the case of Edward).

This post illustrates, using a simple but complete example, how to run Edward code on the IBM Watson Machine Learning (WML) platform. WML is a cloud service that allows developers to efficiently train, deploy, and monitor machine learning models on fast GPUs. You can download the complete code here.

Quick start with WML

Edward is now available in WML with Tensorflow 1.7, enabling it to run Edward code as easily as any TensorFlow program, using GPUs. Before you start, you need a ready-to-use WML environment, that is, access to WML, and a Cloud Object Storage service. To launch an Edward job, first export the following environment variables with your Watson Machine Learning credentials:

export ML_ENV=xxxxxxxxxxxxxxx
export ML_INSTANCE=xxxxxxxxxxxxxxx
export ML_USERNAME=xxxxxxxxxxxxxxx
export ML_PASSWORD=xxxxxxxxxxxxxxx

Your configuration file should look like the following (filled with your object storage credentials):

model_definition:
  author:
    email: me@myself.com
    name: Me
  description: Simple MLP in Edward for classifying MNIST
  execution:
    command: python mlp_edward.py
    compute_configuration:
      name: k80
  framework:
    name: tensorflow
    version: '1.7'
  name: edward_mnist_mlp
training_data_reference:
  connection:
    access_key_id: xxxxxxxxxxxxxxx
    endpoint_url: https://myobjectstorage.com
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_data_reference_name
  source:
    bucket: xxxxxxxxxxxxxxx
  type: s3
training_results_reference:
  connection:
    access_key_id: xxxxxxxxxxxxxxx
    endpoint_url: https://myobjectstorage.com
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_results_reference_name
  target:
    bucket: xxxxxxxxxxxxxxx
  type: s3

Then simply run:

bx ml train code.zip manifest.yml

where code.zip is an archive containing all the Python source files (e.g., edward.py, data-loaders, etc…). This command returns an id (e.g., training-xxxxxxxxx) that you can use to monitor the running job:

bx ml monitor training-runs training-xxxxxxxxx

The model

As an example consider a simple Bayesian Neural Network, that is, a neural network where all the weights and biases are treated as random variables. We thus learn a complete distribution for each parameter of the network. These distributions can be used to measure the uncertainty associated to the output of the network which can be critical for decision-making systems.

The task is now to infer the posterior distribution of each of these weights and biases given a set of observed data.

More concretely consider the image classification task on the MNIST hand-written digits dataset. After the inference, we can sample a set of weights and biases from the learned distribution. The corresponding networks form an ensemble of predictors that can be used to compute a prediction distribution for unseen data.

The network is defined in TensorFlow. In this example the parameters are stored in a Python dictionary. The network is a simple multi-layer perceptron (MLP) with one hidden layer.

def mlp(theta, x):
  h = tf.nn.relu(tf.matmul(x, theta["Wh"]) + theta["bh"])
  yhat = tf.matmul(h, theta["Wy"]) + theta["by"]
  log_pi = tf.nn.log_softmax(yhat)
  return log_pi

Using Edward, we can specify the prior distributions of the weights and biases, here normal distributions centered on 0.

theta = {
    'Wh': Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])),
    'bh': Normal(loc=tf.zeros(nh), scale=tf.ones(nh)),
    'Wy': Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])),
    'by': Normal(loc=tf.zeros(ny), scale=tf.ones(ny)),
}

The complete model is the following:

x = tf.placeholder(tf.float32, [batch_size, nx])
l = tf.placeholder(tf.int32, [batch_size])
lhat = Categorical(logits=mlp(theta, x))

The placeholders x and l correspond to the images and the labels of the training data. The result lhat is a categorical distribution over the possible labels for an image x.

Inference

In Edward it is possible to use variational inference instead of sampling techniques to infer the posterior distribution. Variational inference turns the inference into an optimization problem, by trying to find the member of a family of simpler distribution that is the closest to the true posterior.

We thus need to define the family or guide. In our case all the parameters follow a normal distribution.

def vrandn(*shape):
    return tf.Variable(tf.random_normal(shape))

qtheta = {
    'Wh': Normal(loc=vrandn(nx, nh), scale=tf.nn.softplus(vrandn(nx, nh))),
    'bh': Normal(loc=vrandn(nh), scale=tf.nn.softplus(vrandn(nh))),
    'Wy': Normal(loc=vrandn(nh, ny), scale=tf.nn.softplus(vrandn(nh, ny))),
    'by': Normal(loc=vrandn(ny), scale=tf.nn.softplus(vrandn(ny))),
}

inference.initialize()
tf.global_variables_initializer().run()

Note the use of tf.Variable to define the parameters of the family (here the means and scale of the normal distribution). After the training, these variables contain the parameters of the distribution that is the closest to the true posterior.

Data

Before starting the training, let us import the MNIST dataset.

mnist = input_data.read_data_sets('MNIST', one_hot=False)

Training

First we initialize the inference method with pairs of prior:posterior for all the parameters, the data, and the output of the model.

inference = ed.KLqp({theta["Wh"]: qtheta["Wh"], theta["bh"]: qtheta["bh"],
                     theta["Wy"]: qtheta["Wy"], theta["by"]: qtheta["by"]},
                    data={lhat: l})

The training then iterates through the dataset to update the inference.

for epoch in range(num_epochs):
    running_loss = 0.0
    for j in range(num_batches):
        X_batch, Y_batch = mnist.train.next_batch(batch_size)
        info_dict = inference.update(feed_dict={x: X_batch, l: Y_batch})
        loss = info_dict['loss']

Prediction

When the training is complete, we draw samples of the parameters from the posterior distribution and execute the MLP with each concrete model. The final score is the average of the scores returned by the MLPs.

def predict(x):
    theta_samples = [{"Wh": qtheta["Wh"].sample(), "bh": qtheta["bh"].sample(),
                      "Wy": qtheta["Wy"].sample(), "by": qtheta["by"].sample(), }
                     for _ in range(args.num_samples)]
    yhats = [mlp(theta_samp, x).eval()
             for theta_samp in theta_samples]
    mean = np.mean(yhats, axis=0)
    return np.argmax(mean, axis=1)


X_test = mnist.test.images
Y_test = mnist.test.labels
Y_pred = predict(X_test)
print("accuracy:", (Y_pred == Y_test).mean() * 100)

That’s it! You can now run:

zip code.zip mlp_edward.py
bx ml train code.zip manifest.yml

and monitor the job to see the result:

bx ml monitor training-runs training-xxxxxxxxx

More about Edward and DPPLs

If you are curious to find out more about probabilistic programming languages, check out this paper.