Data management and analytics using serverless form factors
It is no coincidence that machine-generated data—such as audit trails, clickstream logs, or IoT sensor streams—tend to be stored in cloud platforms rather than in traditional on-premise data repositories. This is because the machine-generated data is often originating in the cloud or is at least using the cloud as a medium to distribute or act on its data streams. So, it is a natural thought to use the cloud also to conduct more complex analytics involving larger time horizons of this data. In this article, we look at the opportunity to establish a platform that conducts management and analytics of such data in a very cloud-native abstraction by using serverless form factors.
Traditional data and analytics
For decades now, relational database management systems (RDBMS) have been a well-established mechanism to store data for the purpose of flexible analytics with SQL. A dedicated class of RDBMS offerings even emerged that are especially optimized for analytics: data warehouses. Data warehouses are a deeply integrated stack of components that facilitate analytics, with features like tightly integrated scale-out storage and compute architecture, query compiler and optimiser, meta data catalog for table definitions, data loading functions, indexing components, and application client protocols and SDKs.
As the variety of data sources and producers grew over time, the need for more flexibility to analyze different data formats gained significance. At the same time, the data volumes grew rapidly, and the need for very elastic and scalable (yet affordable) analytic solutions also became vital. As a result, Hadoop emerged and quickly gained adoption about a decade ago. Besides openness for many data formats and flexible hardware deployments, Hadoop also introduced a broader set of analytic APIs beyond just SQL. SQL is still the most popular analytic paradigm being exercised with Hadoop nowadays, however, and it is supported by a multitude of popular Hadoop based SQL engines, including Hive, Drill, Spark SQL, Impala, and Presto. Commercial SQL engines on Hadoop have also emerged, such as IBM BigSQL and Oracle Big Data SQL. As a result, many data warehouse workloads are served today by Hadoop-based solution stacks.
The cloud promise
The main benefit for users of the cloud consumption model is the ability to delegate responsibility for infrastructure to cloud platform vendors, allowing the users to focus entirely on their core business and value-add generation. When it comes to data and analytics, this means specifically focusing less on infrastructure and more on generating and acting on insights.
The serverless consumption model is a more consequent application of the principles of a cloud consumption model. Customers of serverless offerings are not even involved in defining the resource to use. There are no standing costs. Serverless always implies a fair pay-per-job pricing. You use what you need and pay only for what you use. In addition, consumers also don’t have to worry about making a workload highly available because they can rely on the serverless service to execute the jobs in a highly available fashion. The most prominent example of a serverless service is a Function-as-a-Service (FaaS), such as IBM Cloud Functions, AWS Lambda, Azure Functions, or Google Cloud Functions.
From a consumption perspective, serverless offerings are in many respects very much comparable to other sharing economies. A similar example would be in a ride-sharing service—with Uber or Lyft, the consumer only orders and pays for a ride; in serverless, they only submit and pay for an individual job execution.
So, serverless is indeed highly topical, and, not surprisingly, it is growing rapidly in popularity. Serverless is delivering the cloud promise by eliminating the responsibility for infrastructure and allowing for 100% focus on insights.
How can data and analytics be conducted in a serverless fashion?
The traditional data and analytics form factors described in the first section above are not a very suitable technology stack to be delivered in a serverless form factor in their integrated entirety. Cloud delivery in general, and serverless abstraction in particular, requires us to take a step back and look at the core components required to conduct data and analytics.
A serverless data and analytics platform
We define a data and analytics platform as a platform that allows you to store, manage, life-cycle, and analyze data and develop analytic solutions on top of it. The following are the key ingredients for a comprehensive, serverless data and analytics platform:
Storage as a service.
A catalog service to handle metadata and governance.
Data ingest services.
Data transformation services.
Analytic and query services.
Analytic application runtimes, orchestration, and APIs.
Let’s look at these key elements from the perspective of a serverless form factor.
1. Serverless storage
Serverless storage sounds like a contradiction in terms at first. But in fact, the concept of cloud object storage provided by cloud vendors like IBM, AWS, and Google is serverless storage in the sense that the client can simply just upload their data into a logical container called a bucket. There are no externalized minimum or maximum data volume constraints. The user only is charged for the data that they store based on the data volume and for how long they keep it there before deleting it. In addition, there is a charge whenever the data is read, which also depends on the volume of data and amount of access requests made.
2. Data catalog
A data catalog service is required to manage all information about the data stored in cloud object storage. So, it needs to maintain the list of data sets (we can also call them “tables”) with a lot of associated information. This includes technical metadata, such as the data schema of the data set, with column names and types, pointers to physical objects in cloud object storage (i.e., the physical partitions of the data set), and statistics about the cardinality and column value distribution in the partitions. It also includes frequently used projections of the actual data sets (we can also call them “views”), which are stored SQL queries. Furthermore, it is also comprised of access control lists for the data sets and information about data lineage of the data sets and semantical type of data stored in data sets (e.g., sensitive data, such as credit card numbers). The latter is relevant for data governance specifically. In order to be compatible with serverless, the consumption externals of this catalog service must only expose the logical concepts of the catalog as a namespace that can be created and protected for access. Also, there must be no base cost for merely having a created a catalog—it uses pay-as-you-go pricing on a per-data-set basis.
3. Data ingest services
Data ingest services are message hub services that, when connected to cloud object storage, enable the latter to become the primary landing point for machine-generated data (such as application log data or IoT sensor data). Data ingest services also consist of self-service bulk data upload and ETL services that allow the user to copy existing data from other repositories to cloud object storage. To be compatible with the serverless paradigm, the message hub service must only expose logical elements, such as a topic or message queue. The consumer must not be confronted with provisioning or maintaining a dedicated Kafka server with a dedicated base cost. Only the volume of actually processed messages, the number of topics or queues, and the actually consumed message buffer storage can matter for the pricing. Similar requirements exist for the upload and ETL services, where only actually executed upload or ETL jobs themselves matter and will be charged.
4. Data transformation services
Data transformation services allow data engineers to cleanse, filter, aggregate, enrich, compress, and re-layout data within cloud object storage in order to prepare or optimize it for the actual analytics. The transformation services can be used to build fully automated data transformation and preparation pipelines together with the orchestration capabilities provided through Feature 6 described below (i.e., analytic application runtimes, orchestration, and APIs). In order to be serverless, the only thing that the consumer needs to be concerned about is the volume and complexity of the individual transformation job that they conduct. There must be no standing costs for the mere existence of a transformation service instance.
5. Serverless analytics
The bread and butter of analytics is SQL analytics. Consequently, there needs to be a SQL query service that allows to the client to run serverless SQL queries on data in cloud object storage, and only the individual SQL query executions will matter for consumption and billing. But beyond SQL, we also must not miss out the breadth of analytic APIs that Hadoop originally brought to the table. Consequently, we also require more advanced analytic services, such as machine learning as a service. These advanced analytics services must be consumable in a serverless fashion, where all that the consumer has to be concerned about is how many model trainings or scorings they run and on what volume of data. There must be no standing cost for the mere existence of a provisioned SQL or ML service instance.
6. Analytic application runtimes, orchestration, and APIs
A customer analytic solution does require a certain amount of custom code. Function-as-a-Service provides a serverless runtime for this custom code. It also allows the customer to build higher-level cloud services using the data and analytics cloud platform that expose their own APIs to their clients. The full platform synergy becomes reality, however, when cloud functions are combined and integrated with the other components of the data and analytics platform. The event-driven execution model of cloud functions makes them the perfect orchestration layer to construct and automate entire data pipelines, starting with serverless ETL jobs for data ingest. Successful execution of these jobs then triggers data transformation jobs. These will then can trigger other transformation jobs, until finally, they trigger analytic jobs, such as a serverless SQL to produce a report or a serverless ML model update.
Serverless data and analytics in IBM Cloud
Where does IBM Cloud stand on delivering a serverless data and analytics platform as described above? In fact, it has a really compelling portfolio of relevant serverless services that makes it a prime player for such platforms:
IBM Cloud Object Storage for serverless storage.
IBM Event Streams for serverless Kafka message stream landing.
IBM Cloud SQL Query for serverless SQL on Cloud Object Storage, supporting serverless SQL ETL, serverless SQL transformations, and serverless SQL analytics.
Watson Knowledge Catalog for metadata catalog and governance for data on Cloud Object Storage.
IBM Cloud Functions for serverless runtimes and custom data transformation code (non-SQL).
IBM Cloud provides complete coverage of serverless services that form an overall serverless data and analytics platform. All of these services have special mutual integration and optimization points. For instance, the IBM Event Streams service has optimized built-in connectors to store messages into Cloud Object Storage. IBM Cloud SQL Query is deeply integrated and optimized for Cloud Object Storage, acting as the source of tables for the SQL statements as well as the target of SQL result sets. Watson Knowledge Catalog relies on Cloud Object Storage as the default persistency for data. IBM Cloud Functions offers out of the box SQL cloud functions.
As a next step, you should get your hands on these services. Here is a set of further resources that will help you with your further evaluation: