Cloud Functions loves embarrassingly parallel workloads

Serverless compute platforms like IBM Cloud Functions continue to grow in popularity. The most common use cases for serverless functions are REST-ful microservices, simple request/response handlers, data processing, event-driven apps, AI chatbots, and ETL pipelines. Beyond that, recent interest in applying serverless technology to all kinds of “fan-out” or embarrassingly parallel workloads is driving continued adoption for serverless as a compute platform.

Embarrassingly parallel workloads can be split into many sub-tasks, all running independently from each other. For example, instead of trying to watermark 10,000 images sitting in object storage using a single machine, with serverless, it is possible to just run 10,000 watermarking operations in parallel. This pattern of splitting a data set into independent units to be processed applies to a variety of tasks. Background processes, Monte Carlo simulations, batch processing, video transcoding, processing objects on object storage, model scoring, web scraping, genetic sequence analysis, and financial risk modeling are all candidate workloads for this approach. In formal research, operations, scientific computing, and computer science, “embarrassingly parallel algorithms” refer to those algorithms that can be executed in a parallel, fan-out fashion.

Benefits of IBM Cloud Functions

Serverless compute is an attractive option for all kinds of fan-out workloads that run in a multi-tenant fashion or on isolated clusters. The benefits of using serverless function for such “high-throughput computing” workloads include the following:

  • No need to worry about managing the underlying infrastructure and operating system, including compliance
  • Managed, rapid scaling based on the need of the workload, no worries about capacity management
  • Provisioning measured in milliseconds
  • Granular pricing

Cloud Functions delivers strong business benefits for fan-out workloads

A case study with SiteSpirit revealed that moving from a traditional PaaS infrastructure to IBM Cloud Functions gave them a 10x performance increase while saving 90% of infrastructure costs at the same time. They’re working with tour operators who have thousands of pictures they can get auto-cropped, sharpened, resized, etc. via SiteSpirit’s SaaS offering running on the IBM Cloud.

As another example, running Monte Carlo simulations for three-year stock predictions went from ~250 minutes on a powerful notebook to 90 seconds when run on IBM Cloud Functions. The developer did not need to provision and set up a cluster. They didn’t need to worry about tearing down the compute platform at the end of the simulation. They never had to touch a server. Functions managed all the provisioning and scaling for the simulation without any developer interaction.

Technical details

IBM Cloud Functions supports this serverless execution model in a highly optimized fashion. It ensures that for each request or task, there is a container with dedicated memory and CPU assigned or created. For example, if 1,000 parallel requests are coming in, it spins up 1,000 containers within seconds, they all do their job, and go away again immediately after the job is finished (or are cached for reuse for that particular customer). The user only pays for the exact capacity and time the function was running. The business logic that is run with each function invocation can be specified in virtually any programming language.

If the number of parallel invocations exceeds the namespace quota (the default is 1,000), the IBM Event Streams service can act as a buffer, which handles retries transparently. As an alternative, the client code can handle retries in case of too high concurrency and/or server-side errors.

In a many ways, IBM Cloud Functions serves as a large, distributed, highly parallel computer. This “computer” scales transparently with the number of parallel function invocations (each having dedicated CPU and memory) in a very rapid fashion.

Pywren: Serverless functions for AI and Data analytics

IBM Cloud Functions also integrates deeply with Pywren. Pywren is an open source project that targets data scientists and analytics experts programming in Python. This project further simplifies running parallel, fan-out workloads and makes the underlying cloud service entirely transparent. The users writes the function in plain Python and adds two additional lines of Pywren-specific code. This is all it takes to make Pywren handle the fan-out, calling all function invocations transparently and aggregating the results. See this blog post for more details.

Good workloads for IBM Cloud Functions

The following kinds of parallel workloads are particularly well suited to using Functions:

  • Processing docs/images/audio/video: OCRing images, sharpening a million images, converting audio files or pdfs, processing hundreds or thousands of frames of a video in parallel are all examples for tasks which can be executed in a highly parallelized fashion, For example, one or a small set of files per function invocation and thousands of those function invocations in parallel.
  • Batch processing: It is typically possible to break a batch-processing job into many smaller units of work, which are then processed in parallel. When running this on Cloud Functions, the implementation of such batch processing usually gets much faster and resource efficient. This is due to the inherent and rapid scaling to many hundreds or thousands of parallel-function containers. These batch-processing jobs might have run on the mainframe or other machines before or could also be net-new workload.
  • Map/(reduce): Cloud Functions is very well-suited for executing any kind of data- and/or compute-centric map task at large scale, without bothering the user with infrastructure details.
  • Model scoring (for images, data, etc.):  Cloud Functions is perfectly suited to run model scoring at scale across any kind of data (e.g., object detection for a large number of frames extracted out of a video, doing signature validation, modeling players for a virtual reality game, etc.). If there is no work to do, there are no costs generated. However, when the workload spikes, there are almost instantaneously thousands of function containers available to do the parallel processing.
  • OCR: If large numbers of documents need to be OCR’d, potentially by including machine-learning technologies, Cloud Functions is very well-suited. As documents come in in large numbers, each of them can be processed in parallel via a dedicated function invocation.
  • Monte Carlo simulations: Monte Carlo simulations apply in many industries, including finance, automotive, healthcare, etc. For example, financial-risk modeling is a typical use case for Monte Carlo simulations. Given that Monte Carlo simulations are inherently parallel, thousands of those simulations can be run in a very cost-efficient fashion using IBM Cloud Functions.
  • Parallel data processing: Cloud functions is in many cases the fastest and most economical solution for processing residing on object storage or a database in masses or high scale.
  • Any kind of parallel or “fan-out” workload: Anything that would be run on a local machine by starting multiple threads can be run on Cloud Functions by executing many function invocations in parallel, across thousands of cores.
  • High throughput/scientific computing: Many compute-intensive tasks out of the scientific computing space (e.g., molecule simulations) lend themselves extremely well to be run on Cloud Functions, both from an economical and processing performance perspective.
  • Financial sector workloads: Post-trade analysis, risk modeling, and fraud surveillance are just a few examples for financial sector workloads that can be parallelized very well and therefore often executed in the most economical fashion on Cloud Functions.
  • Batch processing of object storage data: Processing larger numbers of objects residing on object storage fits the inherent parallelization in Cloud Functions very well, and is therefore in many cases the most economical approach.
  • Healthcare workloads: Drug screening and DNA sequencing are typical healthcare-related workloads which can be parallelized very well.
  • Web scraping: If there are millions of web pages that need to be searched, parsed, etc., this lends itself extremely well to running it with IBM Cloud Functions. Each function would be processing one or a small set of web pages, with thousands of them running in parallel.
  • Background tasks: Any kind of background processing or task is very well-suited to be realized via IBM Cloud Functions. Only the business logic of a background task needs to be defined and can then be called in any volume without having to worry scaling or capacity management.
  • Hyperparameter tuning: This is a typical task in the machine-learning space, where a large number of parameters is tested in combination with a machine-learning model to see which set of parameters makes the model deliver the best possible results. Also this task is inherently parallel, which makes it very attractive to run it on Cloud Functions.

Cloud Native HPC?

Workloads like Monte Carlo simulations, genome sequencing, financial risk modeling, etc. are traditionally thought of as high-performance computing (HPC) workloads. These workloads are often executed on dedicated on-prem compute platforms. Serverless functions represent a cloud-native approach to high-performance computing for embarrassingly parallel workloads. Coupled with new libraries like Pywren, scientists and analysts are able to completely focus on the problem domain and not worry themselves with server configuration and infrastructure.

Learn more about how IBM Cloud Functions can be used your workloads today.

Was this article helpful?
YesNo

More from Cloud

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

The advantages and disadvantages of private cloud 

6 min read - The popularity of private cloud is growing, primarily driven by the need for greater data security. Across industries like education, retail and government, organizations are choosing private cloud settings to conduct business use cases involving workloads with sensitive information and to comply with data privacy and compliance needs. In a report from Technavio (link resides outside ibm.com), the private cloud services market size is estimated to grow at a CAGR of 26.71% between 2023 and 2028, and it is forecast to increase by…

Optimize observability with IBM Cloud Logs to help improve infrastructure and app performance

5 min read - There is a dilemma facing infrastructure and app performance—as workloads generate an expanding amount of observability data, it puts increased pressure on collection tool abilities to process it all. The resulting data stress becomes expensive to manage and makes it harder to obtain actionable insights from the data itself, making it harder to have fast, effective, and cost-efficient performance management. A recent IDC study found that 57% of large enterprises are either collecting too much or too little observability data.…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters