Process large data sets at massive scale with PyWren over IBM Cloud Functions
(Ed.–Josep Sampé–Universitat Rovira i Virgili–co-authored this post.)
Let’s say you write a function in Python to process and analyze some data. You successfully test the function using a small amount of data and now you want to run the function as a serverless action at massive scale, with parallelism, against terabytes of data.
What options do you have? Obviously, you don’t want to learn cloud IT tricks and setup VMs, for example. Nor do you necessarily want to become a serverless computing expert in scaling data inputs, processing outputs, and monitoring concurrent executions. The bottom line is you want to run your code against a large data set, get the results, and consider the value of insights gained.
PyWren to the rescue
PyWren is an open source project developed by Eric Jonas and RISElab. PyWren executes user’s Python code and its dependencies as serverless actions on a serverless platform. Without requiring knowledge of how serverless actions are invoked and run, PyWren executes them at massively scale and then monitors the results.
PyWren includes a client that runs locally and a runtime that deploys in the cloud as a serverless action. PyWren uses object storage to pass information between client and server sides. On the client side, PyWren takes the Python code and relevant data, serializes them, and puts them into object storage. The client invokes the stored actions to run in parallel and then waits for the results. On the server side, for each function, PyWren takes the code and processes the relevant data from object storage, storing the results.
PyWren provides great value for uses cases like processing data on Cloud Object Storage, running embarrassingly parallel compute jobs (e.g. Monte-Carlo simulations), enriching data with additional attributes, and many others.
For now, consider this basic example in which “my_function(x)” does an addition of a number with 7.
In this case, the PyWren client first serializes “my_function(x)” with its input data, invoking three parallel actions, one per value of in “data” array. Each serverless action contains a PyWren runtime. The PyWren runtime is responsible to deserializing its function and executing it. PyWren also creates a main action to monitor all the executions and their progress, eventually returning the results back to object storage and the user’s client.
Optimizing PyWren for IBM Cloud Functions
The original version of PyWren did not support executions on the IBM Cloud. Myself and Josep Sampé (Universitat Rovira i Virgili) adapted PyWren both to run over IBM Cloud Functions and use IBM Cloud Object Storage. Since IBM Cloud Functions is based on Apache OpenWhisk, which uses Docker containers, we also enabled PyWren to work with Docker. (The original PyWren uses Conda to deploy the PyWren runtime.)
If a user wants to add new packages to the PyWren runtime or to use different Python versions, these steps show how to create a new Docker image in place of the default.
We also extended PyWren to support a reduce function, which now enables PyWren to run complete map reduce flows. Our implementation of map is just a starting point, we are planning further enhancements.
In extending PyWren to work with IBM Cloud Object Storage, we also added a partition discovery component that allows PyWren to process large amounts of data stored in the IBM Cloud Object Storage. In short, the partition discovery module gets a COS bucket as an input and returns a list of virtual chunks of specific size. Upon execution, each data chunk is passed as an input to the actions. As a bucket may contains millions of objects of different sizes, partition discovery allows processing data in parallel with many actions, where each action only processes a specific chunk size. Users can control chunk size via runtime configuration.
Try it yourself
You can easily get the code to experiment with PyWren on IBM Cloud Functions. Install locally or get it from IBM Watson Studio. The project page contains working examples and details how to setup PyWren.
In the meantime, here is a video that demonstrates the following use case: use airbnb data for various cities in conjunction with a tone analyzer and then plot the results visually, on a map.
Josep and I have contributed our code back to the PyrWren community under Apache License 2.0. Our top priorities going forward are to collaborate with the community on improving the fault tolerance of PyWren, sharing data between actions, and more powerfully integrating PyWren with IBM Cloud Object Storage.
Josep Sampé is a PhD student at the Department of Computer Engineering and Mathematics at Universitat Rovira i Virgili (URV), Spain. He is currently collaborating with the IBM Research in developing frameworks on top of serverless and object storage architectures.