Compute Services

Process large data sets at massive scale with PyWren over IBM Cloud Functions

Share this post:

(Ed.–Josep Sampé–Universitat Rovira i Virgili–co-authored this post.)

Let’s say you write a function in Python to process and analyze some data. You successfully test the function using a small amount of data and now you want to run the function as a serverless action at massive scale, with parallelism, against terabytes of data.

What options do you have? Obviously, you don’t want to learn cloud IT tricks and setup VMs, for example. Nor do you necessarily want to become a serverless computing expert in scaling data inputs, processing outputs, and monitoring concurrent executions. The bottom line is you want to run your code against a large data set, get the results, and consider the value of insights gained.

PyWren to the rescue

PyWren is an open source project developed by Eric Jonas and RISElab. PyWren executes user’s Python code and its dependencies as serverless actions on a serverless platform.  Without requiring knowledge of how serverless actions are invoked and run, PyWren executes them at massively scale and then monitors the results.

PyWren includes a client that runs locally and a runtime that deploys in the cloud as a serverless action. PyWren uses object storage to pass information between client and server sides. On the client side, PyWren takes the Python code and relevant data, serializes them, and puts them into object storage. The client invokes the stored actions to run in parallel and then waits for the results. On the server side, for each function, PyWren takes the code and processes the relevant data from object storage, storing the results.

PyWren provides great value for uses cases like processing data on Cloud Object Storage, running embarrassingly parallel compute jobs (e.g. Monte-Carlo simulations), enriching data with additional attributes, and many others.

For now, consider this basic example in which “my_function(x)” does an addition of a number with 7.

 

 

 

 

 

 

 

 

In this case, the PyWren client first serializes “my_function(x)” with its input data, invoking three parallel actions, one per value of in “data” array. Each serverless action contains a PyWren runtime. The PyWren runtime is responsible to deserializing its function and executing it. PyWren also creates a main action to monitor all the executions and their progress, eventually returning the results back to object storage and the user’s client.

 

Optimizing PyWren for IBM Cloud Functions

The original version of PyWren did not support executions on the IBM Cloud. Myself and Josep Sampé (Universitat Rovira i Virgili) adapted PyWren both to run over IBM Cloud Functions and use IBM Cloud Object Storage.  Since IBM Cloud Functions is based on Apache OpenWhisk, which uses Docker containers, we also enabled PyWren to work with Docker. (The original PyWren uses Conda to deploy the PyWren runtime.)

If a user wants to add new packages to the PyWren runtime or to use different Python versions, these steps show how to create a new Docker image in place of the default.

We also extended PyWren to support a reduce function, which now enables PyWren to run complete map reduce flows. Our implementation of map is just a starting point, we are planning further enhancements.

In extending PyWren to work with IBM Cloud Object Storage, we also added a partition discovery component that allows PyWren to process large amounts of data stored in the IBM Cloud Object Storage. In short, the partition discovery module gets a COS bucket as an input and returns a list of virtual chunks of specific size.  Upon execution, each data chunk is passed as an input to the actions. As a bucket may contains millions of objects of different sizes, partition discovery allows processing data in parallel with many actions, where each action only processes a specific chunk size. Users can control chunk size via runtime configuration.

Try it yourself

You can easily get the code to experiment with PyWren on IBM Cloud Functions.  Install locally or get it from IBM Watson Studio. The project page contains working examples and details how to setup PyWren.

In the meantime, here is a video that demonstrates the following use case: use airbnb data for various cities in conjunction with a tone analyzer and then plot the results visually, on a map.

What’s next

Josep and I have contributed our code back to the PyrWren community under Apache License 2.0. Our top priorities going forward are to collaborate with the community on improving the fault tolerance of PyWren, sharing data between actions, and more powerfully integrating PyWren with IBM Cloud Object Storage.

Josep Sampé is a PhD student at the Department of Computer Engineering and Mathematics at Universitat Rovira i Virgili (URV), Spain. He is currently collaborating with the IBM Research in developing frameworks on top of serverless and object storage architectures.

IBM Research - IBM Storage Clouds, Security and Analytics

More Compute Services stories

Improving App Availability with Multizone Clusters

Downtime costs money and results in unhappy customers.  Whether you have developed a new cloud-native application or repackaged an existing app to run as a container, now you need to ensure your app and the infrastructure running it are highly available.  IBM is excited to announce the availability of multizone clusters, targeted for June 2018.  Now […]

Continue reading

What the stats say about container development

59% improved application quality and reduced defects. 57% reduced application downtime and costs. All adopted container development.   In 2017, IBM conducted an in-depth research study on the state of container adoption across all industries, startups to enterprises. The study reveals the most important solutions driving usage and highlights the key challenges that must be addressed by cloud providers. […]

Continue reading

Combine organizational transformation and IBM Cloud infrastructure for a successful AI journey

With our recent cloud infrastructure and Deep-Learning-as-a-Service (DLaaS) announcements, IBM Cloud is a key contributor to the push towards AI. We’ve delivered a comprehensive suite of AI tools, high performance bare metal servers, and NVIDIA® GPUs that enables companies of all sizes to analyze complex unstructured data faster, more thoroughly and accurately, and at a far less cost than ever before.

Continue reading