Probabilistic Programming with Edward in WML

Share this post:

Edward is a deep probabilistic programming language (DPPL), that is, a language for specifying both deep neural networks and probabilistic models. DPPLs draw upon programming languages, Bayesian statistics, and deep learning to ease the development of powerful AI applications.

Probabilistic languages let the user express a probabilistic model as a program with an intuitive formalism and dedicated constructs. Probabilistic models are a powerful tool to represent and reason about uncertain behaviors, such as simplification of complex system (e.g., nature simulation) and prediction based on past observations (e.g., polls, weather forecast).

With the advent of machine learning, more and more advanced probabilistic models now involve deep learning networks. DPPLs such as Edward aim to combine the benefits of PPLs and deep learning frameworks (Tensorflow in the case of Edward).

This post illustrates, using a simple but complete example, how to run Edward code on the IBM Watson Machine Learning (WML) platform. WML is a cloud service that allows developers to efficiently train, deploy, and monitor machine learning models on fast GPUs. You can download the complete code here.

Quick start with WML

Edward is now available in WML with Tensorflow 1.7, enabling it to run Edward code as easily as any TensorFlow program, using GPUs. Before you start, you need a ready-to-use WML environment, that is, access to WML, and a Cloud Object Storage service. To launch an Edward job, first export the following environment variables with your Watson Machine Learning credentials:

export ML_ENV=xxxxxxxxxxxxxxx
export ML_INSTANCE=xxxxxxxxxxxxxxx
export ML_USERNAME=xxxxxxxxxxxxxxx
export ML_PASSWORD=xxxxxxxxxxxxxxx

Your configuration file should look like the following (filled with your object storage credentials):

    name: Me
  description: Simple MLP in Edward for classifying MNIST
    command: python
      name: k80
    name: tensorflow
    version: '1.7'
  name: edward_mnist_mlp
    access_key_id: xxxxxxxxxxxxxxx
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_data_reference_name
    bucket: xxxxxxxxxxxxxxx
  type: s3
    access_key_id: xxxxxxxxxxxxxxx
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_results_reference_name
    bucket: xxxxxxxxxxxxxxx
  type: s3

Then simply run:

bx ml train manifest.yml

where is an archive containing all the Python source files (e.g.,, data-loaders, etc…). This command returns an id (e.g., training-xxxxxxxxx) that you can use to monitor the running job:

bx ml monitor training-runs training-xxxxxxxxx

The model

As an example consider a simple Bayesian Neural Network, that is, a neural network where all the weights and biases are treated as random variables. We thus learn a complete distribution for each parameter of the network. These distributions can be used to measure the uncertainty associated to the output of the network which can be critical for decision-making systems.

The task is now to infer the posterior distribution of each of these weights and biases given a set of observed data.

More concretely consider the image classification task on the MNIST hand-written digits dataset. After the inference, we can sample a set of weights and biases from the learned distribution. The corresponding networks form an ensemble of predictors that can be used to compute a prediction distribution for unseen data.

The network is defined in TensorFlow. In this example the parameters are stored in a Python dictionary. The network is a simple multi-layer perceptron (MLP) with one hidden layer.

def mlp(theta, x):
  h = tf.nn.relu(tf.matmul(x, theta["Wh"]) + theta["bh"])
  yhat = tf.matmul(h, theta["Wy"]) + theta["by"]
  log_pi = tf.nn.log_softmax(yhat)
  return log_pi

Using Edward, we can specify the prior distributions of the weights and biases, here normal distributions centered on 0.

theta = {
    'Wh': Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])),
    'bh': Normal(loc=tf.zeros(nh), scale=tf.ones(nh)),
    'Wy': Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])),
    'by': Normal(loc=tf.zeros(ny), scale=tf.ones(ny)),

The complete model is the following:

x = tf.placeholder(tf.float32, [batch_size, nx])
l = tf.placeholder(tf.int32, [batch_size])
lhat = Categorical(logits=mlp(theta, x))

The placeholders x and l correspond to the images and the labels of the training data. The result lhat is a categorical distribution over the possible labels for an image x.


In Edward it is possible to use variational inference instead of sampling techniques to infer the posterior distribution. Variational inference turns the inference into an optimization problem, by trying to find the member of a family of simpler distribution that is the closest to the true posterior.

We thus need to define the family or guide. In our case all the parameters follow a normal distribution.

def vrandn(*shape):
    return tf.Variable(tf.random_normal(shape))

qtheta = {
    'Wh': Normal(loc=vrandn(nx, nh), scale=tf.nn.softplus(vrandn(nx, nh))),
    'bh': Normal(loc=vrandn(nh), scale=tf.nn.softplus(vrandn(nh))),
    'Wy': Normal(loc=vrandn(nh, ny), scale=tf.nn.softplus(vrandn(nh, ny))),
    'by': Normal(loc=vrandn(ny), scale=tf.nn.softplus(vrandn(ny))),


Note the use of tf.Variable to define the parameters of the family (here the means and scale of the normal distribution). After the training, these variables contain the parameters of the distribution that is the closest to the true posterior.


Before starting the training, let us import the MNIST dataset.

mnist = input_data.read_data_sets('MNIST', one_hot=False)


First we initialize the inference method with pairs of prior:posterior for all the parameters, the data, and the output of the model.

inference = ed.KLqp({theta["Wh"]: qtheta["Wh"], theta["bh"]: qtheta["bh"],
                     theta["Wy"]: qtheta["Wy"], theta["by"]: qtheta["by"]},
                    data={lhat: l})

The training then iterates through the dataset to update the inference.

for epoch in range(num_epochs):
    running_loss = 0.0
    for j in range(num_batches):
        X_batch, Y_batch = mnist.train.next_batch(batch_size)
        info_dict = inference.update(feed_dict={x: X_batch, l: Y_batch})
        loss = info_dict['loss']


When the training is complete, we draw samples of the parameters from the posterior distribution and execute the MLP with each concrete model. The final score is the average of the scores returned by the MLPs.

def predict(x):
    theta_samples = [{"Wh": qtheta["Wh"].sample(), "bh": qtheta["bh"].sample(),
                      "Wy": qtheta["Wy"].sample(), "by": qtheta["by"].sample(), }
                     for _ in range(args.num_samples)]
    yhats = [mlp(theta_samp, x).eval()
             for theta_samp in theta_samples]
    mean = np.mean(yhats, axis=0)
    return np.argmax(mean, axis=1)

X_test = mnist.test.images
Y_test = mnist.test.labels
Y_pred = predict(X_test)
print("accuracy:", (Y_pred == Y_test).mean() * 100)

That’s it! You can now run:

bx ml train manifest.yml

and monitor the job to see the result:

bx ml monitor training-runs training-xxxxxxxxx

More about Edward and DPPLs

If you are curious to find out more about probabilistic programming languages, check out this paper.

Research Staff Member, IBM Research

Benjamin Herta

IBM Research

More AI stories

Adversarial Robustness 360 Toolbox v1.0: A Milestone in AI Security

IBM researchers published the first major release of the Adversarial Robustness 360 Toolbox (ART). Initially released in April 2018, ART is an open-source library for adversarial machine learning that provides researchers and developers with state-of-the-art tools to defend and verify AI models against adversarial attacks. ART addresses growing concerns about people’s trust in AI, specifically the security of AI in mission-critical applications.

Continue reading

Making Sense of Neural Architecture Search

It is no surprise that following the massive success of deep learning technology in solving complicated tasks, there is a growing demand for automated deep learning. Even though deep learning is a highly effective technology, there is a tremendous amount of human effort that goes into designing a deep learning algorithm.

Continue reading

Pushing the boundaries of convex optimization

Convex optimization problems, which involve the minimization of a convex function over a convex set, can be approximated in theory to any fixed precision in polynomial time. However, practical algorithms are known only for special cases. An important question is whether it is possible to develop algorithms for a broader subset of convex optimization problems that are efficient in both theory and practice.

Continue reading