#### AI

# Probabilistic Programming with Edward in WML

November 2, 2018 | Written by: Guillaume Baudart and Benjamin Herta

Categorized: AI | Open Source

Share this post:

Edward is a deep probabilistic programming language (DPPL), that is, a language for specifying both deep neural networks and probabilistic models. DPPLs draw upon programming languages, Bayesian statistics, and deep learning to ease the development of powerful AI applications.

Probabilistic languages let the user express a probabilistic model as a program with an intuitive formalism and dedicated constructs. Probabilistic models are a powerful tool to represent and reason about uncertain behaviors, such as simplification of complex system (e.g., nature simulation) and prediction based on past observations (e.g., polls, weather forecast).

With the advent of machine learning, more and more advanced probabilistic models now involve deep learning networks. DPPLs such as Edward aim to combine the benefits of PPLs and deep learning frameworks (Tensorflow in the case of Edward).

This post illustrates, using a simple but complete example, how to run Edward code on the IBM Watson Machine Learning (WML) platform. WML is a cloud service that allows developers to efficiently train, deploy, and monitor machine learning models on fast GPUs. You can download the complete code here.

# Quick start with WML

Edward is now available in WML with Tensorflow 1.7, enabling it to run Edward code as easily as any TensorFlow program, using GPUs. Before you start, you need a ready-to-use WML environment, that is, access to WML, and a Cloud Object Storage service. To launch an Edward job, first export the following environment variables with your Watson Machine Learning credentials:

`export ML_ENV=xxxxxxxxxxxxxxx`

export ML_INSTANCE=xxxxxxxxxxxxxxx

export ML_USERNAME=xxxxxxxxxxxxxxx

export ML_PASSWORD=xxxxxxxxxxxxxxx

Your configuration file should look like the following (filled with your object storage credentials):

model_definition: author: email: me@myself.com name: Me description: Simple MLP in Edward for classifying MNIST execution: command: python mlp_edward.py compute_configuration: name: k80 framework: name: tensorflow version: '1.7' name: edward_mnist_mlp training_data_reference: connection: access_key_id: xxxxxxxxxxxxxxx endpoint_url: https://myobjectstorage.com secret_access_key: xxxxxxxxxxxxxxx name: training_data_reference_name source: bucket: xxxxxxxxxxxxxxx type: s3 training_results_reference: connection: access_key_id: xxxxxxxxxxxxxxx endpoint_url: https://myobjectstorage.com secret_access_key: xxxxxxxxxxxxxxx name: training_results_reference_name target: bucket: xxxxxxxxxxxxxxx type: s3

Then simply run:

bx ml train code.zip manifest.yml

where code.zip is an archive containing all the Python source files (e.g., edward.py, data-loaders, etc…). This command returns an id (e.g., training-xxxxxxxxx) that you can use to monitor the running job:

bx ml monitor training-runs training-xxxxxxxxx

# The model

As an example consider a simple Bayesian Neural Network, that is, a neural network where all the weights and biases are treated as random variables. We thus learn a complete distribution for each parameter of the network. These distributions can be used to measure the uncertainty associated to the output of the network which can be critical for decision-making systems.

The task is now to infer the posterior distribution of each of these weights and biases given a set of observed data.

More concretely consider the image classification task on the MNIST hand-written digits dataset. After the inference, we can sample a set of weights and biases from the learned distribution. The corresponding networks form an ensemble of predictors that can be used to compute a prediction distribution for unseen data.

The network is defined in TensorFlow. In this example the parameters are stored in a Python dictionary. The network is a simple multi-layer perceptron (MLP) with one hidden layer.

def mlp(theta, x): h = tf.nn.relu(tf.matmul(x, theta["Wh"]) + theta["bh"]) yhat = tf.matmul(h, theta["Wy"]) + theta["by"] log_pi = tf.nn.log_softmax(yhat) return log_pi

Using Edward, we can specify the prior distributions of the weights and biases, here normal distributions centered on 0.

theta = { 'Wh': Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])), 'bh': Normal(loc=tf.zeros(nh), scale=tf.ones(nh)), 'Wy': Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])), 'by': Normal(loc=tf.zeros(ny), scale=tf.ones(ny)), }

The complete model is the following:

x = tf.placeholder(tf.float32, [batch_size, nx]) l = tf.placeholder(tf.int32, [batch_size]) lhat = Categorical(logits=mlp(theta, x))

The placeholders `x`

and `l`

correspond to the images and the labels of the training data. The result `lhat`

is a categorical distribution over the possible labels for an image `x`

.

# Inference

In Edward it is possible to use variational inference instead of sampling techniques to infer the posterior distribution. Variational inference turns the inference into an optimization problem, by trying to find the member of a family of simpler distribution that is the closest to the true posterior.

We thus need to define the family or guide. In our case all the parameters follow a normal distribution.

def vrandn(*shape): return tf.Variable(tf.random_normal(shape)) qtheta = { 'Wh': Normal(loc=vrandn(nx, nh), scale=tf.nn.softplus(vrandn(nx, nh))), 'bh': Normal(loc=vrandn(nh), scale=tf.nn.softplus(vrandn(nh))), 'Wy': Normal(loc=vrandn(nh, ny), scale=tf.nn.softplus(vrandn(nh, ny))), 'by': Normal(loc=vrandn(ny), scale=tf.nn.softplus(vrandn(ny))), } inference.initialize() tf.global_variables_initializer().run()

Note the use of `tf.Variable`

to define the parameters of the family (here the means and scale of the normal distribution). After the training, these variables contain the parameters of the distribution that is the closest to the true posterior.

# Data

Before starting the training, let us import the MNIST dataset.

mnist = input_data.read_data_sets('MNIST', one_hot=False)

# Training

First we initialize the inference method with pairs of prior:posterior for all the parameters, the data, and the output of the model.

inference = ed.KLqp({theta["Wh"]: qtheta["Wh"], theta["bh"]: qtheta["bh"], theta["Wy"]: qtheta["Wy"], theta["by"]: qtheta["by"]}, data={lhat: l})

The training then iterates through the dataset to update the inference.

for epoch in range(num_epochs): running_loss = 0.0 for j in range(num_batches): X_batch, Y_batch = mnist.train.next_batch(batch_size) info_dict = inference.update(feed_dict={x: X_batch, l: Y_batch}) loss = info_dict['loss']

# Prediction

When the training is complete, we draw samples of the parameters from the posterior distribution and execute the MLP with each concrete model. The final score is the average of the scores returned by the MLPs.

def predict(x): theta_samples = [{"Wh": qtheta["Wh"].sample(), "bh": qtheta["bh"].sample(), "Wy": qtheta["Wy"].sample(), "by": qtheta["by"].sample(), } for _ in range(args.num_samples)] yhats = [mlp(theta_samp, x).eval() for theta_samp in theta_samples] mean = np.mean(yhats, axis=0) return np.argmax(mean, axis=1) X_test = mnist.test.images Y_test = mnist.test.labels Y_pred = predict(X_test) print("accuracy:", (Y_pred == Y_test).mean() * 100)

That’s it! You can now run:

zip code.zip mlp_edward.py bx ml train code.zip manifest.yml

and monitor the job to see the result:

bx ml monitor training-runs training-xxxxxxxxx

# More about Edward and DPPLs

If you are curious to find out more about probabilistic programming languages, check out this paper.

**Guillaume Baudart**

Research Staff Member, IBM Research

**Benjamin Herta**

IBM Research

### IBM Sets New Transcription Performance Milestone on Automatic Broadcast News Captioning

IBM sets new performance records for automatic captioning of broadcast news audio, with error rates of 6.5% and 5.9% on two broadcast news benchmarks.

### Leveraging Temporal Dependency to Combat Audio Adversarial Attacks

A new approach to defend against adversarial attacks in non-image tasks, such as audio input and automatic speech recognition.

### Unifying Continual Learning and Meta-Learning with Meta-Experience Replay

Meta-Experience Replay (MER) integrates meta-learning and experience replay to achieve state-of-the-art performance on continual learning benchmarks.