Probabilistic Programming with Edward in WML

Share this post:

Edward is a deep probabilistic programming language (DPPL), that is, a language for specifying both deep neural networks and probabilistic models. DPPLs draw upon programming languages, Bayesian statistics, and deep learning to ease the development of powerful AI applications.

Probabilistic languages let the user express a probabilistic model as a program with an intuitive formalism and dedicated constructs. Probabilistic models are a powerful tool to represent and reason about uncertain behaviors, such as simplification of complex system (e.g., nature simulation) and prediction based on past observations (e.g., polls, weather forecast).

With the advent of machine learning, more and more advanced probabilistic models now involve deep learning networks. DPPLs such as Edward aim to combine the benefits of PPLs and deep learning frameworks (Tensorflow in the case of Edward).

This post illustrates, using a simple but complete example, how to run Edward code on the IBM Watson Machine Learning (WML) platform. WML is a cloud service that allows developers to efficiently train, deploy, and monitor machine learning models on fast GPUs. You can download the complete code here.

Quick start with WML

Edward is now available in WML with Tensorflow 1.7, enabling it to run Edward code as easily as any TensorFlow program, using GPUs. Before you start, you need a ready-to-use WML environment, that is, access to WML, and a Cloud Object Storage service. To launch an Edward job, first export the following environment variables with your Watson Machine Learning credentials:

export ML_ENV=xxxxxxxxxxxxxxx
export ML_INSTANCE=xxxxxxxxxxxxxxx
export ML_USERNAME=xxxxxxxxxxxxxxx
export ML_PASSWORD=xxxxxxxxxxxxxxx

Your configuration file should look like the following (filled with your object storage credentials):

    name: Me
  description: Simple MLP in Edward for classifying MNIST
    command: python
      name: k80
    name: tensorflow
    version: '1.7'
  name: edward_mnist_mlp
    access_key_id: xxxxxxxxxxxxxxx
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_data_reference_name
    bucket: xxxxxxxxxxxxxxx
  type: s3
    access_key_id: xxxxxxxxxxxxxxx
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_results_reference_name
    bucket: xxxxxxxxxxxxxxx
  type: s3

Then simply run:

bx ml train manifest.yml

where is an archive containing all the Python source files (e.g.,, data-loaders, etc…). This command returns an id (e.g., training-xxxxxxxxx) that you can use to monitor the running job:

bx ml monitor training-runs training-xxxxxxxxx

The model

As an example consider a simple Bayesian Neural Network, that is, a neural network where all the weights and biases are treated as random variables. We thus learn a complete distribution for each parameter of the network. These distributions can be used to measure the uncertainty associated to the output of the network which can be critical for decision-making systems.

The task is now to infer the posterior distribution of each of these weights and biases given a set of observed data.

More concretely consider the image classification task on the MNIST hand-written digits dataset. After the inference, we can sample a set of weights and biases from the learned distribution. The corresponding networks form an ensemble of predictors that can be used to compute a prediction distribution for unseen data.

The network is defined in TensorFlow. In this example the parameters are stored in a Python dictionary. The network is a simple multi-layer perceptron (MLP) with one hidden layer.

def mlp(theta, x):
  h = tf.nn.relu(tf.matmul(x, theta["Wh"]) + theta["bh"])
  yhat = tf.matmul(h, theta["Wy"]) + theta["by"]
  log_pi = tf.nn.log_softmax(yhat)
  return log_pi

Using Edward, we can specify the prior distributions of the weights and biases, here normal distributions centered on 0.

theta = {
    'Wh': Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])),
    'bh': Normal(loc=tf.zeros(nh), scale=tf.ones(nh)),
    'Wy': Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])),
    'by': Normal(loc=tf.zeros(ny), scale=tf.ones(ny)),

The complete model is the following:

x = tf.placeholder(tf.float32, [batch_size, nx])
l = tf.placeholder(tf.int32, [batch_size])
lhat = Categorical(logits=mlp(theta, x))

The placeholders x and l correspond to the images and the labels of the training data. The result lhat is a categorical distribution over the possible labels for an image x.


In Edward it is possible to use variational inference instead of sampling techniques to infer the posterior distribution. Variational inference turns the inference into an optimization problem, by trying to find the member of a family of simpler distribution that is the closest to the true posterior.

We thus need to define the family or guide. In our case all the parameters follow a normal distribution.

def vrandn(*shape):
    return tf.Variable(tf.random_normal(shape))

qtheta = {
    'Wh': Normal(loc=vrandn(nx, nh), scale=tf.nn.softplus(vrandn(nx, nh))),
    'bh': Normal(loc=vrandn(nh), scale=tf.nn.softplus(vrandn(nh))),
    'Wy': Normal(loc=vrandn(nh, ny), scale=tf.nn.softplus(vrandn(nh, ny))),
    'by': Normal(loc=vrandn(ny), scale=tf.nn.softplus(vrandn(ny))),


Note the use of tf.Variable to define the parameters of the family (here the means and scale of the normal distribution). After the training, these variables contain the parameters of the distribution that is the closest to the true posterior.


Before starting the training, let us import the MNIST dataset.

mnist = input_data.read_data_sets('MNIST', one_hot=False)


First we initialize the inference method with pairs of prior:posterior for all the parameters, the data, and the output of the model.

inference = ed.KLqp({theta["Wh"]: qtheta["Wh"], theta["bh"]: qtheta["bh"],
                     theta["Wy"]: qtheta["Wy"], theta["by"]: qtheta["by"]},
                    data={lhat: l})

The training then iterates through the dataset to update the inference.

for epoch in range(num_epochs):
    running_loss = 0.0
    for j in range(num_batches):
        X_batch, Y_batch = mnist.train.next_batch(batch_size)
        info_dict = inference.update(feed_dict={x: X_batch, l: Y_batch})
        loss = info_dict['loss']


When the training is complete, we draw samples of the parameters from the posterior distribution and execute the MLP with each concrete model. The final score is the average of the scores returned by the MLPs.

def predict(x):
    theta_samples = [{"Wh": qtheta["Wh"].sample(), "bh": qtheta["bh"].sample(),
                      "Wy": qtheta["Wy"].sample(), "by": qtheta["by"].sample(), }
                     for _ in range(args.num_samples)]
    yhats = [mlp(theta_samp, x).eval()
             for theta_samp in theta_samples]
    mean = np.mean(yhats, axis=0)
    return np.argmax(mean, axis=1)

X_test = mnist.test.images
Y_test = mnist.test.labels
Y_pred = predict(X_test)
print("accuracy:", (Y_pred == Y_test).mean() * 100)

That’s it! You can now run:

bx ml train manifest.yml

and monitor the job to see the result:

bx ml monitor training-runs training-xxxxxxxxx

More about Edward and DPPLs

If you are curious to find out more about probabilistic programming languages, check out this paper.

Research Staff Member, IBM Research

Benjamin Herta

IBM Research

More AI stories

Linear-Complexity Earth Mover’s Distance Approximations for Efficient Similarity Search

Measuring distances between entities is a fundamental function that powers many mission-critical applications across various sectors. Search, data mining, data discovery, clustering and classification all rely on discriminative distance measures between data objects. In addition to being accurate, today’s vast databases demand distance measures that are also scalable, i.e. can be computed efficiently even across […]

Continue reading

AI Enables Foreign Language Study Abroad, No Travel Required

A student learning to speak Mandarin wanders into a marketplace on the streets of China on a sunny summer afternoon. Before long, two vendors approach and begin hawking products, trying to outbid one another. The student must now grasp what’s being said and formulate an appropriate response using proper pronunciation to avoid being misunderstood. It’s […]

Continue reading

Text2Scene: Generating Compositional Scenes from Textual Descriptions

At CVPR 2019, IBM researchers introduce techniques to interpret visually descriptive language to generate compositional scene representations from textual descriptions.

Continue reading