Probabilistic Programming with Edward in WML

Share this post:

Edward is a deep probabilistic programming language (DPPL), that is, a language for specifying both deep neural networks and probabilistic models. DPPLs draw upon programming languages, Bayesian statistics, and deep learning to ease the development of powerful AI applications.

Probabilistic languages let the user express a probabilistic model as a program with an intuitive formalism and dedicated constructs. Probabilistic models are a powerful tool to represent and reason about uncertain behaviors, such as simplification of complex system (e.g., nature simulation) and prediction based on past observations (e.g., polls, weather forecast).

With the advent of machine learning, more and more advanced probabilistic models now involve deep learning networks. DPPLs such as Edward aim to combine the benefits of PPLs and deep learning frameworks (Tensorflow in the case of Edward).

This post illustrates, using a simple but complete example, how to run Edward code on the IBM Watson Machine Learning (WML) platform. WML is a cloud service that allows developers to efficiently train, deploy, and monitor machine learning models on fast GPUs. You can download the complete code here.

Quick start with WML

Edward is now available in WML with Tensorflow 1.7, enabling it to run Edward code as easily as any TensorFlow program, using GPUs. Before you start, you need a ready-to-use WML environment, that is, access to WML, and a Cloud Object Storage service. To launch an Edward job, first export the following environment variables with your Watson Machine Learning credentials:

export ML_ENV=xxxxxxxxxxxxxxx
export ML_INSTANCE=xxxxxxxxxxxxxxx
export ML_USERNAME=xxxxxxxxxxxxxxx
export ML_PASSWORD=xxxxxxxxxxxxxxx

Your configuration file should look like the following (filled with your object storage credentials):

    name: Me
  description: Simple MLP in Edward for classifying MNIST
    command: python
      name: k80
    name: tensorflow
    version: '1.7'
  name: edward_mnist_mlp
    access_key_id: xxxxxxxxxxxxxxx
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_data_reference_name
    bucket: xxxxxxxxxxxxxxx
  type: s3
    access_key_id: xxxxxxxxxxxxxxx
    secret_access_key: xxxxxxxxxxxxxxx
  name: training_results_reference_name
    bucket: xxxxxxxxxxxxxxx
  type: s3

Then simply run:

bx ml train manifest.yml

where is an archive containing all the Python source files (e.g.,, data-loaders, etc…). This command returns an id (e.g., training-xxxxxxxxx) that you can use to monitor the running job:

bx ml monitor training-runs training-xxxxxxxxx

The model

As an example consider a simple Bayesian Neural Network, that is, a neural network where all the weights and biases are treated as random variables. We thus learn a complete distribution for each parameter of the network. These distributions can be used to measure the uncertainty associated to the output of the network which can be critical for decision-making systems.

The task is now to infer the posterior distribution of each of these weights and biases given a set of observed data.

More concretely consider the image classification task on the MNIST hand-written digits dataset. After the inference, we can sample a set of weights and biases from the learned distribution. The corresponding networks form an ensemble of predictors that can be used to compute a prediction distribution for unseen data.

The network is defined in TensorFlow. In this example the parameters are stored in a Python dictionary. The network is a simple multi-layer perceptron (MLP) with one hidden layer.

def mlp(theta, x):
  h = tf.nn.relu(tf.matmul(x, theta["Wh"]) + theta["bh"])
  yhat = tf.matmul(h, theta["Wy"]) + theta["by"]
  log_pi = tf.nn.log_softmax(yhat)
  return log_pi

Using Edward, we can specify the prior distributions of the weights and biases, here normal distributions centered on 0.

theta = {
    'Wh': Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])),
    'bh': Normal(loc=tf.zeros(nh), scale=tf.ones(nh)),
    'Wy': Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])),
    'by': Normal(loc=tf.zeros(ny), scale=tf.ones(ny)),

The complete model is the following:

x = tf.placeholder(tf.float32, [batch_size, nx])
l = tf.placeholder(tf.int32, [batch_size])
lhat = Categorical(logits=mlp(theta, x))

The placeholders x and l correspond to the images and the labels of the training data. The result lhat is a categorical distribution over the possible labels for an image x.


In Edward it is possible to use variational inference instead of sampling techniques to infer the posterior distribution. Variational inference turns the inference into an optimization problem, by trying to find the member of a family of simpler distribution that is the closest to the true posterior.

We thus need to define the family or guide. In our case all the parameters follow a normal distribution.

def vrandn(*shape):
    return tf.Variable(tf.random_normal(shape))

qtheta = {
    'Wh': Normal(loc=vrandn(nx, nh), scale=tf.nn.softplus(vrandn(nx, nh))),
    'bh': Normal(loc=vrandn(nh), scale=tf.nn.softplus(vrandn(nh))),
    'Wy': Normal(loc=vrandn(nh, ny), scale=tf.nn.softplus(vrandn(nh, ny))),
    'by': Normal(loc=vrandn(ny), scale=tf.nn.softplus(vrandn(ny))),


Note the use of tf.Variable to define the parameters of the family (here the means and scale of the normal distribution). After the training, these variables contain the parameters of the distribution that is the closest to the true posterior.


Before starting the training, let us import the MNIST dataset.

mnist = input_data.read_data_sets('MNIST', one_hot=False)


First we initialize the inference method with pairs of prior:posterior for all the parameters, the data, and the output of the model.

inference = ed.KLqp({theta["Wh"]: qtheta["Wh"], theta["bh"]: qtheta["bh"],
                     theta["Wy"]: qtheta["Wy"], theta["by"]: qtheta["by"]},
                    data={lhat: l})

The training then iterates through the dataset to update the inference.

for epoch in range(num_epochs):
    running_loss = 0.0
    for j in range(num_batches):
        X_batch, Y_batch = mnist.train.next_batch(batch_size)
        info_dict = inference.update(feed_dict={x: X_batch, l: Y_batch})
        loss = info_dict['loss']


When the training is complete, we draw samples of the parameters from the posterior distribution and execute the MLP with each concrete model. The final score is the average of the scores returned by the MLPs.

def predict(x):
    theta_samples = [{"Wh": qtheta["Wh"].sample(), "bh": qtheta["bh"].sample(),
                      "Wy": qtheta["Wy"].sample(), "by": qtheta["by"].sample(), }
                     for _ in range(args.num_samples)]
    yhats = [mlp(theta_samp, x).eval()
             for theta_samp in theta_samples]
    mean = np.mean(yhats, axis=0)
    return np.argmax(mean, axis=1)

X_test = mnist.test.images
Y_test = mnist.test.labels
Y_pred = predict(X_test)
print("accuracy:", (Y_pred == Y_test).mean() * 100)

That’s it! You can now run:

bx ml train manifest.yml

and monitor the job to see the result:

bx ml monitor training-runs training-xxxxxxxxx

More about Edward and DPPLs

If you are curious to find out more about probabilistic programming languages, check out this paper.

Research Staff Member, IBM Research

Benjamin Herta

IBM Research

More AI stories

IBM Research and The Michael J. Fox Foundation Develop Modeling Methodology to Help Understand Parkinson’s Disease Using Machine Learning

In collaboration with The Michael J. Fox Foundation for Parkinson’s Research, our team of researchers at IBM is aiming to develop improved disease progression models that can help clinicians understand how the disease progresses in relation to the emergence of symptoms, even when those patients are taking symptom-modifying medications.

Continue reading

AI Could Help Enable Accurate Remote Monitoring of Parkinson’s Patients

In a paper recently published in Nature Scientific Reports, IBM Research and scientists from several other medical institutions developed a new way to estimate the severity of a person’s Parkinson’s disease (PD) symptoms by remotely measuring and analyzing physical activity as motor impairment increased. Using data captured by wrist-worn accelerometers, we created statistical representations of […]

Continue reading

Image Captioning as an Assistive Technology

IBM Research's Science for Social Good team recently participated in the 2020 VizWiz Grand Challenge to design and improve systems that make the world more accessible for the blind.

Continue reading