Classifying edge data and analyzing potential storage options.
When using edge computing to analyze Internet of Things (IoT) data at the edge, we are faced with a deluge of data. Be it audio, video, sensory or telemetry, every device spits out data every second, and most of it is never stored or analyzed locally, nor transmitted northbound to enterprise data centers or public clouds. Storing and managing all that data costs money and time that enterprises do not have. So, what do we do with all that data? Enterprises can follow some basic guidelines to effectively classify edge data such that you separate the data you need from the data you could discard.
This blog post will attempt to both classify edge data and look at storage options. We need to ask ourselves two fundamental questions:
Do we really need all that data?
Do we have to store all that data?
To make those tasks easier, we recommend classifying the data for efficient analysis and storage.
Please make sure to check out all the installments in this series of blog posts on edge computing:
Data is classified in many different ways, often determined by a specific discipline or practice. In Statistics, there are two types of data — quantitative and qualitative. Data science goes a step further and classifies data as Nominal, Ordinal, Discrete and Continuous. Database management systems classify data into three types — short-term data, long-term data and useless data. Enterprises manage security and compliance by classifying data into these four types — public, internal-only, confidential and restricted.
There are even methods to automatically classify data based on the content of the data and the context of the data. But the data industry uses the following three methods of data classification — content-based, context-based and user-based.
Edge solution data classification
An edge solution has two types of data:
System data required to configure and run the system components.
User data that is generated by the devices that are part of the solution.
System data — also referred to as operational data — is usually stored in small relational databases and/or NoSQL databases. The bulk of the data is user data that is generated or captured by the devices and is transmitted via the edge solution components. This is data that enterprises have to deal with and have to decide what to store, where, and for how long. To help make decisions on how to manage edge User data, it can be classified into the following categories, as shown below:
From an edge perspective, finer granular classification of data isn’t practical because there are so many different types of devices, each sending data in a specific format and using different protocols to get it across, as described in a prior blog, “Analytics at the Edge.”
We know inferencing of data is typically done by models running on the devices or by edge servers on-premises. And, often, dataset generation and model training are still performed in the cloud. Apart from data privacy and protection considerations, transporting or streaming data to the traditional cloud is expensive both from a latency and cost perspective. Things are changing with on-premises cloud solutions like IBM Cloud Satellite and AWS Outpost. Then there are micro-modular data centers that are accessible within microseconds. No matter the environment, whether it is a harsh environment or disconnected one, data should be available and accessible at all times.
Like many edge platforms, IEAM uses databases for storing and managing System data. It uses two PostgreSQL databases — one for the Exchange and another for the AgBot (agreement bot) — and a NoSQL MongoDB that stores content for the Model Management Service, as shown in the figure below:
It is worth pointing out that IEAM implements a Control Plane and does not provide a Data Plane. Control Plane data is System data, while Data Plane data is User data. So, IEAM only manages System data. It is up to the applications that edge platforms deploy to manage the transmission of User data. See the reference to Eurotech’s ESF.
All Application Services running on the edge devices that are configured to communicate with the edge hub would be the source of all User data. That could be audio, visual, image, telemetry and/or streaming data. Additional data (e.g., historical data) that is needed for retraining of models could be injected or imported into the edge platform. All this is critical data because it is used for providing insights and analytics, and it may need to be stored. The resources available in the cloud play a critical role in ML model creation and training, especially for deep learning models. Once trained, the model is then pushed to the edge.
As we have seen, edge data can and sometimes needs to be stored in the cloud. More often than not, the data is stored and analyzed on-premises. Products like IBM Cloud Pak® for Data claim that regardless of the deployment location, one can connect to data no matter where it lives. From a private cluster accessing data on the cloud to accessing data in an on-premises database, data can be cleansed, inferenced, modeled, analyzed and stored securely.
Given all the generated data, we can use what we want and throw away everything else that is “useless” — but who decides what data is useless? Enterprises collect and store data because they think they need it or because they are forced to because of compliance requirements. Enterprises in the financial services domain or healthcare domain are such examples.
That brings us to the question of where do we store the data and which database is best suited? The Database Journal lists over 100 databases. For an edge solution, a database should be small enough to run in a container and be robust enough to handle enterprise scale. Quite often, the requirement is for a NoSQL database. There are many databases that fit the bill, like Cloudant, CouchDB, Cassandra and even Db2. We happen to pick Informix as the exemplar database, which is a component of the aforementioned IBM Cloud Pak for Data.
Data store for the edge
The Informix database engine, which is embeddable in edge devices, integrates time series, spatial, NoSQL, SQL and JSON data. Informix is viable in the following situations:
As an operational database that can support a rapidly changing data model
As lightweight, low-latency analytics integrated into an operational database
Need to store large amounts of data from Edge/IoT devices or sensors
Need to store and serve many different types of content
The latter two bullet points make it a good candidate in edge solutions. At one end of the spectrum, because of its small footprint (<100MB), it can be embedded in most edge devices, including ARM and Intel Quark with very low memory (<256MB). At the other end, Informix hosted on cloud offers the same features as on-premises deployment without the cost, complexity and risk of managing your own infrastructure. The figure below shows the breadth of Informix editions:
All Informix editions are available on-premises. That means it can be used as an edge operational database, at the edge gateway for limited analytics and also as the enterprise data repository in a hybrid cloud to store large amounts of data for AI model building and training.
We have seen the need for edge data to be rapidly deployed, processed, stored and analyzed, while in other cases, the data needs to be stored for model building and training. Informix has the features and capabilities needed to help edge customers meet their GDPR (General Data Protection Regulation) responsibilities. To borrow a line from the State of the Edge paper referenced below, “the unprecedented convergence of AI, IoT, 5G, and Edge data centers makes it possible to assemble, store, and process vast amounts of data at the edge.”
This blog talked about classifying edge data and explored data storage options. When deciding between an in-memory database, a no-SQL database, in the cloud or choosing a hybrid model, usage and context is key.
Do let us know what you think.
Special thanks to Joe Pearson for reviewing the article, Peter Kohlmann for bringing IBM Cloud Pak for Data experience and Matt Trifiro for his insights.