The integration of Lithops with IBM Cloud Code Engine – our next-generation serverless offering — offers unprecedented flexibility.

Object storage data (pre-)processing, hyperparameter optimization, searching and processing logs, heavy computational tasks (e.g., Monte Carlo simulations or genome analytics), downloading large volumes of data, web scraping, model scoring, etc. are just a few examples of scenarios where lots of CPU, memory and/or network-intensive work needs to be done.

A common approach to handling this programmatically is to just run a “for” loop and kick off asynchronous processing within that loop. A common approach in Python is to use multiprocessing.Pool or concurrent.futures, where the map operation is called with the function to be executed as one parameter, and the list of (many) objects to be processed as a parameter. The remainder of this doc focuses on the former one for the sake of simplicity, but there is also an equivalent approach for concurrent.futures.

Below is a conceptual example of how this works. The map operation receives two parameters:

results = multiprocessing.Pool.map(convert_image, [image1, image2, image3])
 
def convert_image(image_url):
    <logic downloading the image and converting it somehow>
 
print(results[0].get())
print(results[1].get())
…

The beauty of this approach is that it is entirely serverless. In order to use it, a developer only has to pass in the operation to be executed n times and the n objects they’d like to have processed. However, with the original version of these libraries, the restriction is that this can only take advantage of the CPU cores, memory, network bandwidth, etc. available to the (virtual) machine the Python process is running on.

Wouldn’t it be nice if for each call of Pool.map with n objects as a parameter, n containers got spun up behind the scenes (or a smaller number, in case the elements are chunked)? Each of the containers would handle their part of the work, and they would vanish automatically once the work was completed. This would also demonstrate nicely how the often-discussed relevance of cloud to developers can be realized. The developer only has to write code; when executing the code, hundreds or thousands of CPU cores are (de-)provisioned transparently behind the scenes:

Lithops + IBM Cloud Code Engine

This foundational approach is implemented in the open source Lithops project as a client-side library.

The integration of Lithops with IBM Cloud Code Engine — our next-generation serverless offering — offers unprecedented flexibility. Amongst several other things, Code Engine allows you to allocate a large number of parallel containers behind the scenes. Each of them can be provisioned within seconds, with a max of 32 GBs of memory, 8 CPU cores and a maximum execution time of 2 hours (and these are just the defaults every user gets out-of-the-box; we can raise them further on a per-user basis).

The code changes required for adopting Lithops in any existing Python program are very minimal. In the ideal case, it’s literally only about following a so-called “drop-in library” approach — changing an import statement from multiprocessing.Pool to lithops.multiprocessing.Pool.

The initial config instructions are also very minimal — you can find them documented here.

Beyond that, Lithops allows for pure parallelization across a conceptually unlimited pool of resources and the application of a reduce/aggregation step at the end. That can be accomplished by simply passing in the reduce function as a third parameter — for example, map_reduce(business_logic, <list_of_data_elements, reduce_operation).

The price doesn’t change — whether it’s 10 cores for 100 seconds or 100 cores for 10 seconds, it doesn’t make a difference for the price. Obviously, depending on the nature of the problem, there is a certain percentage of overhead for distributing the work, which needs to be taken into account.

Another advantage is that instead of an approach where capacity has to be allocated based on the most expensive operation in your Python program, each individual map operation in a longer Python program would just dynamically allocate exactly the capacity need. This allows for a significant performance boost in combination with significant cost savings (see diagram below):

Obviously, this approach can be leveraged in interactive (data science and other) scenarios, where the data scientist wants to run some heavy processing operation while waiting in front of their screen for it to finish — by using pure Python with an editor, using a Jupyter notebook (e.g., in Watson Studio or elsewhere), etc. It’s also applicable to continuously running backend applications written in Python (or, basically, any other piece of Python code which needs to do some heavy lifting).

As indicated in the diagrams above, from a developer perspective, this looks like a single program running on a single computer. In reality, a very large amount of distributed capacity is being (de-)allocated dynamically. All of this takes us one step further towards our vision of a serverless supercomputer, where we treat the cloud as a single computer in itself vs. a collection of VMs, containers, web servers, app servers and database servers. Watch this space.

Get started

Try it out by yourself by using IBM Cloud Code Engine and following the instructions captured here.

We would like to express our special thanks to URV, in particular Josep Sampe and Pedro Garcia-Lopez, who have been absolutely instrumental in developing Lithops and contributed multi-processing API. We very much enjoy and appreciate the collaboration.

Was this article helpful?
YesNo

More from Cloud

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

The advantages and disadvantages of private cloud 

6 min read - The popularity of private cloud is growing, primarily driven by the need for greater data security. Across industries like education, retail and government, organizations are choosing private cloud settings to conduct business use cases involving workloads with sensitive information and to comply with data privacy and compliance needs. In a report from Technavio (link resides outside ibm.com), the private cloud services market size is estimated to grow at a CAGR of 26.71% between 2023 and 2028, and it is forecast to increase by…

Optimize observability with IBM Cloud Logs to help improve infrastructure and app performance

5 min read - There is a dilemma facing infrastructure and app performance—as workloads generate an expanding amount of observability data, it puts increased pressure on collection tool abilities to process it all. The resulting data stress becomes expensive to manage and makes it harder to obtain actionable insights from the data itself, making it harder to have fast, effective, and cost-efficient performance management. A recent IDC study found that 57% of large enterprises are either collecting too much or too little observability data.…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters