#### AI

# Probabilistic Programming with Pyro in WML

November 7, 2018 | Written by: Guillaume Baudart and Benjamin Herta

Categorized: AI | Open Source

Share this post:

In a previous post we explained how to write a probabilistic model using Edward and run it on the IBM Watson Machine Learning (WML) platform. In this post, we discuss the same example written in Pyro, a deep probabilistic programming language built on top of PyTorch.

Deep probabilistic programming languages (DPPLs) such as Edward and Pyro aim to combine the advantages of probabilistic programming languages (i.e., intuitive formalism and dedicated constructs to build probabilistic models) and deep learning frameworks (i.e., the ability to write, train, and deploy DL models) to build advanced probabilistic models.

This post illustrates, with a simple example, how to use Pyro and the IBM Watson Machine Learning (WML) platform to write and train a simple but complete probabilistic model involving a deep learning network. WML is a cloud service that allows developers to efficiently train, deploy, and monitor machine learning models on fast GPUs. You can download the complete code here.

# Quick start with WML

Pyro is now available in WML with PyTorch 0.4. You can now write Pyro code and run it on GPUs using WML. Before we start, you need a ready-to-use WML environment, that is, access to WML, and a Cloud Object Storage service.

Your configuration file manifest.yml should look something like this (filled with your object storage credentials):

model_definition: author: email: me@myself.com name: Me description: Simple MLP in Pyro for classifying MNIST execution: command: python mlp_pyro.py compute_configuration: name: k80 framework: name: pytorch version: '0.4' name: pyro_mnist_mlp training_data_reference: connection: access_key_id: xxxxxxxxxxxxxxx endpoint_url: https://myobjectstorage.com secret_access_key: xxxxxxxxxxxxxxx name: training_data_reference_name source: bucket: xxxxxxxxxxxxxxx type: s3 training_results_reference: connection: access_key_id: xxxxxxxxxxxxxxx endpoint_url: https://myobjectstorage.com secret_access_key: xxxxxxxxxxxxxxx name: training_results_reference_name target: bucket: xxxxxxxxxxxxxxx type: s3

# Bayesian MLP in Pyro

Our previous post explained how to write a simple probabilistic multi-layer perceptron (MLP) for classifying hand-written digits in Edward. The main idea is to treat all the weights and biases of the network as random variables. We thus learn a complete distribution for each parameter of the network. These distributions can be used to measure the uncertainty associated to the ouptut of the network, which can be critical for decision-making systems. As in Edward, we will use variational inference to learn the distributions in Pyro.

In short, we need to define two main components:

- The probabilistic model: a MLP where all weights and biases are treated as random variable; and
- A family of guide distributions for the variational inference.

The corresponding Pyro code is the following:

class MLP(nn.Module): def __init__(self): super(MLP, self).__init__() self.l1 = torch.nn.Linear(nx, nh) self.l2 = torch.nn.Linear(nh, ny) self.relu = torch.nn.ReLU() def forward(self, x): h = self.relu(self.l1(x.view((-1, nx)))) yhat = self.l2(h) return yhat mlp = MLP().to(device)

# Model

def normal(*shape): loc = torch.zeros(*shape).to(device) scale = torch.ones(*shape).to(device) return Normal(loc, scale) def model(imgs, lbls): priors = { 'l1.weight': normal(nh, nx), 'l1.bias': normal(nh), 'l2.weight': normal(ny, nh), 'l2.bias': normal(ny)} lifted_module = pyro.random_module("mlp", mlp, priors) lifted_reg_model = lifted_module() lhat = log_softmax(lifted_reg_model(imgs)) pyro.sample("obs", Categorical(logits=lhat), obs=lbls)

# Inference guide

def vnormal(name, *shape): loc = pyro.param(name + "m", torch.randn(*shape, requires_grad=True, device=device)) scale = pyro.param(name + "s", torch.randn(*shape, requires_grad=True, device=device)) return Normal(loc, softplus(scale)) def guide(imgs, lbls): dists = { 'l1.weight': vnormal("W1", nh, nx), 'l1.bias': vnormal("b1", nh), 'l2.weight': vnormal("W2", ny, nh), 'l2.bias':vnormal("b2", ny)} lifted_module = pyro.random_module("mlp", mlp, dists) return lifted_module()

The MLP network is defined in PyTorch. In the model, we first define the prior distributions for all the weights and biases and then lift the MLP definition from concrete to probabilistic using the `pyro.random_module`

function. The result `yhat`

parameterizes a categorical distribution over the possible labels for an image `x`

. Note the `pyro.observe`

statement that will match the prediction of the network `yhat`

with the known label `y`

during the inference.

The guide defines the family of distributions used for variational inference. In our case all the parameters follow a normal distribution. Note the use of `pyro.param`

to define the parameters of the family (here the means and scale of the normal distribution). After the training these variables contain the parameters of the distribution that is the closest to the true posterior.

# Data

Before starting the training, let us import the MNIST dataset.

train = MNIST("MNIST", train=True, download=True, transform=transforms.Compose([transforms.ToTensor(), ]), ) test = MNIST("MNIST", train=False, download=True, transform=transforms.Compose([transforms.ToTensor(), ]), ) dataloader_args = dict(shuffle=True, batch_size=batch_size, num_workers=1, pin_memory=False) train_loader = dataloader.DataLoader(train, **dataloader_args) test_loader = dataloader.DataLoader(test, **dataloader_args)

# Inference

We can now launch the inference.

inference = SVI(model, guide, Adam({"lr": 0.01}), loss=Trace_ELBO()) for epoch in range(num_epochs): for j, (imgs, lbls) in enumerate(train_loader, 0): loss = inference.step(imgs.to(device), lbls.to(device))

# Prediction

When the training is complete, we can sample the guide containing the posterior distribution multiple times to obtain a set of MLPs. We can then combine the predictions of all the MLPs to compute a prediction.

def predict(x): sampled_models = [guide(None, None) for _ in range(num_samples)] yhats = [model(x).data for model in sampled_models] mean = torch.mean(torch.stack(yhats), 0) return np.argmax(mean, axis=1) correct = 0 total = 0 for j, data in enumerate(test_loader): images, labels = data predicted = predict(images.to(device)) total += labels.size(0) correct += (predicted == labels).sum().item() print("accuracy: {ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887}d {ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887}{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887}" {ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} (100 * correct / total))

That’s it! You can now export the following environment variables with your WML credentials:

export ML_ENV=xxxxxxxxxxxxxxx export ML_INSTANCE=xxxxxxxxxxxxxxx export ML_USERNAME=xxxxxxxxxxxxxxx export ML_PASSWORD=xxxxxxxxxxxxxxx

and run:

zip code.zip mlp_pyro.py bx ml code.zip manifest.yml

where code.zip is an archive containing all the python source files (e.g., edward.py, data-loaders, etc…). This command returns an id (e.g., training-xxxxxxxxx) that you can use to monitor the runnning job:

`bx ml monitor training-runs training-xxxxxxxxx`

# Further reading on Pyro and DPPLs

More examples in both Edward and Pyro can be found in this paper.

**Guillaume Baudart**

Research Staff Member, IBM Research

**Benjamin Herta**

IBM Research

### A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

In a recently published paper in this year’s INTERSPEECH, we were able to achieve additional improvement on the efficiency of Asynchronous Decentralized Parallel Stochastic Gradient Descent, reducing the training time from 11.5 hours to 5.2 hours using 64 NVIDIA V100 GPUs.

### IBM Research advances in end-to-end speech recognition at INTERSPEECH 2019

IBM scientists presented three papers at INTERSPEECH 2019 that address the shortcomings of End-to-end automatic approaches for speech recognition - an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits.

### High quality, lightweight and adaptable Text-to-Speech (TTS) using LPCNet

Recent advances in deep learning are dramatically improving the development of Text-to-Speech systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech.