6 min read

Classifying edge data and analyzing potential storage options.

When using edge computing to analyze Internet of Things (IoT) data at the edge, we are faced with a deluge of data. Be it audio, video, sensory or telemetry, every device spits out data every second, and most of it is never stored or analyzed locally, nor transmitted northbound to enterprise data centers or public clouds. Storing and managing all that data costs money and time that enterprises do not have. So, what do we do with all that data? Enterprises can follow some basic guidelines to effectively classify edge data such that you separate the data you need from the data you could discard.

This blog post will attempt to both classify edge data and look at storage options. We need to ask ourselves two fundamental questions:

  • Do we really need all that data?
  • Do we have to store all that data?

To make those tasks easier, we recommend classifying the data for efficient analysis and storage.

Please make sure to check out all the installments in this series of blog posts on edge computing:

Classifying data

Data is classified in many different ways, often determined by a specific discipline or practice. In Statistics, there are two types of data — quantitative and qualitative. Data science goes a step further and classifies data as Nominal, Ordinal, Discrete and Continuous. Database management systems classify data into three types — short-term data, long-term data and useless data. Enterprises manage security and compliance by classifying data into these four types — public, internal-only, confidential and restricted.

There are even methods to automatically classify data based on the content of the data and the context of the data. But the data industry uses the following three methods of data classification — content-based, context-based and user-based.

Edge solution data classification

An edge solution has two types of data:

  • System data required to configure and run the system components.
  • User data that is generated by the devices that are part of the solution.
Figure 1: Edge components that deal with data.

Figure 1: Edge components that deal with data.

System data — also referred to as operational data — is usually stored in small relational databases and/or NoSQL databases. The bulk of the data is user data that is generated or captured by the devices and is transmitted via the edge solution components. This is data that enterprises have to deal with and have to decide what to store, where, and for how long. To help make decisions on how to manage edge User data, it can be classified into the following categories, as shown below:

Figure 2: Edge data classifications.

Figure 2: Edge data classifications.

From an edge perspective, finer granular classification of data isn’t practical because there are so many different types of devices, each sending data in a specific format and using different protocols to get it across, as described in a prior blog, “Analytics at the Edge.”

We know inferencing of data is typically done by models running on the devices or by edge servers on-premises. And, often, dataset generation and model training are still performed in the cloud. Apart from data privacy and protection considerations, transporting or streaming data to the traditional cloud is expensive both from a latency and cost perspective. Things are changing with on-premises cloud solutions like IBM Cloud Satellite and AWS Outpost. Then there are micro-modular data centers that are accessible within microseconds. No matter the environment, whether it is a harsh environment or disconnected one, data should be available and accessible at all times.

Data everywhere

A product like the IBM Edge Application Manager (IEAM) would be at the core of an edge solution. The hub or the platform would communicate with the devices, ingesting data and providing insights. As noted earlier, in the case of artificial intelligence (AI) and machine learning (ML) apps running on edge devices, data is inferenced at the source and may not even travel to the edge hub. An earlier blog — “Security at the Edge” — described IEAM components.

Like many edge platforms, IEAM uses databases for storing and managing System data. It uses two PostgreSQL databases — one for the Exchange and another for the AgBot (agreement bot) — and a NoSQL MongoDB that stores content for the Model Management Service, as shown in the figure below:

Figure 3: Databases used by IBM Edge Application Manager.

Figure 3: Databases used by IBM Edge Application Manager.

It is worth pointing out that IEAM implements a Control Plane and does not provide a Data Plane. Control Plane data is System data, while Data Plane data is User data. So, IEAM only manages System data. It is up to the applications that edge platforms deploy to manage the transmission of User data. See the reference to Eurotech’s ESF.

All Application Services running on the edge devices that are configured to communicate with the edge hub would be the source of all User data. That could be audio, visual, image, telemetry and/or streaming data. Additional data (e.g., historical data) that is needed for retraining of models could be injected or imported into the edge platform. All this is critical data because it is used for providing insights and analytics, and it may need to be stored. The resources available in the cloud play a critical role in ML model creation and training, especially for deep learning models. Once trained, the model is then pushed to the edge.

Data anywhere

As we have seen, edge data can and sometimes needs to be stored in the cloud. More often than not, the data is stored and analyzed on-premises. Products like IBM Cloud Pak® for Data claim that regardless of the deployment location, one can connect to data no matter where it lives. From a private cluster accessing data on the cloud to accessing data in an on-premises database, data can be cleansed, inferenced, modeled, analyzed and stored securely.

Given all the generated data, we can use what we want and throw away everything else that is “useless” — but who decides what data is useless? Enterprises collect and store data because they think they need it or because they are forced to because of compliance requirements. Enterprises in the financial services domain or healthcare domain are such examples.

That brings us to the question of where do we store the data and which database is best suited? The Database Journal lists over 100 databases. For an edge solution, a database should be small enough to run in a container and be robust enough to handle enterprise scale. Quite often, the requirement is for a NoSQL database. There are many databases that fit the bill, like Cloudant, CouchDB, Cassandra and even Db2. We happen to pick Informix as the exemplar database, which is a component of the aforementioned IBM Cloud Pak for Data.

Data store for the edge

The Informix database engine, which is embeddable in edge devices, integrates time series, spatial, NoSQL, SQL and JSON data. Informix is viable in the following situations:

  • As an operational database that can support a rapidly changing data model
  • As lightweight, low-latency analytics integrated into an operational database
  • Need to store large amounts of data from Edge/IoT devices or sensors
  • Need to store and serve many different types of content

The latter two bullet points make it a good candidate in edge solutions. At one end of the spectrum, because of its small footprint (<100MB), it can be embedded in most edge devices, including ARM and Intel Quark with very low memory (<256MB). At the other end, Informix hosted on cloud offers the same features as on-premises deployment without the cost, complexity and risk of managing your own infrastructure. The figure below shows the breadth of Informix editions:

Figure 4: Informix as a data store from far edge to cloud.

Figure 4: Informix as a data store from far edge to cloud.

All Informix editions are available on-premises. That means it can be used as an edge operational database, at the edge gateway for limited analytics and also as the enterprise data repository in a hybrid cloud to store large amounts of data for AI model building and training.

Wrap-up

We have seen the need for edge data to be rapidly deployed, processed, stored and analyzed, while in other cases, the data needs to be stored for model building and training. Informix has the features and capabilities needed to help edge customers meet their GDPR (General Data Protection Regulation) responsibilities. To borrow a line from the State of the Edge paper referenced below, “the unprecedented convergence of AI, IoT, 5G, and Edge data centers makes it possible to assemble, store, and process vast amounts of data at the edge.”

The IBM Cloud architecture center offers up many hybrid cloud and multicloud reference architectures, including AI frameworks. Look for the IBM Edge Computing reference architecture and the Data and AI architecture.

This blog talked about classifying edge data and explored data storage options. When deciding between an in-memory database, a no-SQL database, in the cloud or choosing a hybrid model, usage and context is key.

Do let us know what you think.

Special thanks to Joe Pearson for reviewing the article, Peter Kohlmann for bringing IBM Cloud Pak for Data experience and Matt Trifiro for his insights.

Learn more

Related articles

Be the first to hear about news, product updates, and innovation from IBM Cloud