Cloud Functions loves embarrassingly parallel workloads
Serverless compute platforms like IBM Cloud Functions continue to grow in popularity. The most common use cases for serverless functions are REST-ful microservices, simple request/response handlers, data processing, event-driven apps, AI chatbots, and ETL pipelines. Beyond that, recent interest in applying serverless technology to all kinds of “fan-out” or embarrassingly parallel workloads is driving continued adoption for serverless as a compute platform.
Embarrassingly parallel workloads can be split into many sub-tasks, all running independently from each other. For example, instead of trying to watermark 10,000 images sitting in object storage using a single machine, with serverless, it is possible to just run 10,000 watermarking operations in parallel. This pattern of splitting a data set into independent units to be processed applies to a variety of tasks. Background processes, Monte Carlo simulations, batch processing, video transcoding, processing objects on object storage, model scoring, web scraping, genetic sequence analysis, and financial risk modeling are all candidate workloads for this approach. In formal research, operations, scientific computing, and computer science, “embarrassingly parallel algorithms” refer to those algorithms that can be executed in a parallel, fan-out fashion.
Benefits of IBM Cloud Functions
Serverless compute is an attractive option for all kinds of fan-out workloads that run in a multi-tenant fashion or on isolated clusters. The benefits of using serverless function for such “high-throughput computing” workloads include the following:
- No need to worry about managing the underlying infrastructure and operating system, including compliance
- Managed, rapid scaling based on the need of the workload, no worries about capacity management
- Provisioning measured in milliseconds
- Granular pricing
Cloud Functions delivers strong business benefits for fan-out workloads
A case study with SiteSpirit revealed that moving from a traditional PaaS infrastructure to IBM Cloud Functions gave them a 10x performance increase while saving 90% of infrastructure costs at the same time. They're working with tour operators who have thousands of pictures they can get auto-cropped, sharpened, resized, etc. via SiteSpirit's SaaS offering running on the IBM Cloud.
As another example, running Monte Carlo simulations for three-year stock predictions went from ~250 minutes on a powerful notebook to 90 seconds when run on IBM Cloud Functions. The developer did not need to provision and set up a cluster. They didn’t need to worry about tearing down the compute platform at the end of the simulation. They never had to touch a server. Functions managed all the provisioning and scaling for the simulation without any developer interaction.
IBM Cloud Functions supports this serverless execution model in a highly optimized fashion. It ensures that for each request or task, there is a container with dedicated memory and CPU assigned or created. For example, if 1,000 parallel requests are coming in, it spins up 1,000 containers within seconds, they all do their job, and go away again immediately after the job is finished (or are cached for reuse for that particular customer). The user only pays for the exact capacity and time the function was running. The business logic that is run with each function invocation can be specified in virtually any programming language.
If the number of parallel invocations exceeds the namespace quota (the default is 1,000), the IBM Event Streams service can act as a buffer, which handles retries transparently. As an alternative, the client code can handle retries in case of too high concurrency and/or server-side errors.
In a many ways, IBM Cloud Functions serves as a large, distributed, highly parallel computer. This "computer" scales transparently with the number of parallel function invocations (each having dedicated CPU and memory) in a very rapid fashion.
Pywren: Serverless functions for AI and Data analytics
IBM Cloud Functions also integrates deeply with Pywren. Pywren is an open source project that targets data scientists and analytics experts programming in Python. This project further simplifies running parallel, fan-out workloads and makes the underlying cloud service entirely transparent. The users writes the function in plain Python and adds two additional lines of Pywren-specific code. This is all it takes to make Pywren handle the fan-out, calling all function invocations transparently and aggregating the results. See this blog post for more details.
Good workloads for IBM Cloud Functions
The following kinds of parallel workloads are particularly well suited to using Functions:
- Processing docs/images/audio/video: OCRing images, sharpening a million images, converting audio files or pdfs, processing hundreds or thousands of frames of a video in parallel are all examples for tasks which can be executed in a highly parallelized fashion, For example, one or a small set of files per function invocation and thousands of those function invocations in parallel.
- Batch processing: It is typically possible to break a batch-processing job into many smaller units of work, which are then processed in parallel. When running this on Cloud Functions, the implementation of such batch processing usually gets much faster and resource efficient. This is due to the inherent and rapid scaling to many hundreds or thousands of parallel-function containers. These batch-processing jobs might have run on the mainframe or other machines before or could also be net-new workload.
- Map/(reduce): Cloud Functions is very well-suited for executing any kind of data- and/or compute-centric map task at large scale, without bothering the user with infrastructure details.
- Model scoring (for images, data, etc.): Cloud Functions is perfectly suited to run model scoring at scale across any kind of data (e.g., object detection for a large number of frames extracted out of a video, doing signature validation, modeling players for a virtual reality game, etc.). If there is no work to do, there are no costs generated. However, when the workload spikes, there are almost instantaneously thousands of function containers available to do the parallel processing.
- OCR: If large numbers of documents need to be OCR’d, potentially by including machine-learning technologies, Cloud Functions is very well-suited. As documents come in in large numbers, each of them can be processed in parallel via a dedicated function invocation.
- Monte Carlo simulations: Monte Carlo simulations apply in many industries, including finance, automotive, healthcare, etc. For example, financial-risk modeling is a typical use case for Monte Carlo simulations. Given that Monte Carlo simulations are inherently parallel, thousands of those simulations can be run in a very cost-efficient fashion using IBM Cloud Functions.
- Parallel data processing: Cloud functions is in many cases the fastest and most economical solution for processing residing on object storage or a database in masses or high scale.
- Any kind of parallel or "fan-out" workload: Anything that would be run on a local machine by starting multiple threads can be run on Cloud Functions by executing many function invocations in parallel, across thousands of cores.
- High throughput/scientific computing: Many compute-intensive tasks out of the scientific computing space (e.g., molecule simulations) lend themselves extremely well to be run on Cloud Functions, both from an economical and processing performance perspective.
- Financial sector workloads: Post-trade analysis, risk modeling, and fraud surveillance are just a few examples for financial sector workloads that can be parallelized very well and therefore often executed in the most economical fashion on Cloud Functions.
- Batch processing of object storage data: Processing larger numbers of objects residing on object storage fits the inherent parallelization in Cloud Functions very well, and is therefore in many cases the most economical approach.
- Healthcare workloads: Drug screening and DNA sequencing are typical healthcare-related workloads which can be parallelized very well.
- Web scraping: If there are millions of web pages that need to be searched, parsed, etc., this lends itself extremely well to running it with IBM Cloud Functions. Each function would be processing one or a small set of web pages, with thousands of them running in parallel.
- Background tasks: Any kind of background processing or task is very well-suited to be realized via IBM Cloud Functions. Only the business logic of a background task needs to be defined and can then be called in any volume without having to worry scaling or capacity management.
- Hyperparameter tuning: This is a typical task in the machine-learning space, where a large number of parameters is tested in combination with a machine-learning model to see which set of parameters makes the model deliver the best possible results. Also this task is inherently parallel, which makes it very attractive to run it on Cloud Functions.
Cloud Native HPC?
Workloads like Monte Carlo simulations, genome sequencing, financial risk modeling, etc. are traditionally thought of as high-performance computing (HPC) workloads. These workloads are often executed on dedicated on-prem compute platforms. Serverless functions represent a cloud-native approach to high-performance computing for embarrassingly parallel workloads. Coupled with new libraries like Pywren, scientists and analysts are able to completely focus on the problem domain and not worry themselves with server configuration and infrastructure.