Storage vendors are coming out with better and faster data storage options — we’ll look at some of these storage options as they pertain to data at the edge.

We have all heard about the massive amounts of data being generated at the edge by a plethora of devices. From videos to sensor data to posts to emails, it is estimated that each day, 2.5 exabytes of data are produced. Those bytes need to be stored somewhere, otherwise they get discarded since many devices can store little or no data.

Many are of the opinion that the data should be analyzed at the source and acted upon. While true, what about raw data, intermittent data or even data that needs to be stored for auditability? Whether raw or post-analysis, there is a definite need to store lot of that data some place.

A previous blog in the series — “Data at the Edge” — talked about classifying data at the edge and the different options to deal with it. Again, the variations are the result of all the different reasons and types of data flowing through an edge topology. And, similar to computing resources, you will hear the distribution of storage resources closer to the micro, metro and macro edge locations where data is generated and consumed. This past year, storage vendors have come out with better and faster data storage options. This blog post will look at some of these storage options as they pertain to data at the edge.

Please make sure to check out all the installments in this series of blog posts on edge computing:

Classifying edge data

From the blog post referenced earlier, there are two types of edge data: system data and user data. One might argue that logging data should also be considered. Suffice it to say, an edge solution will/should integrate with any existing logging service. Edge solutions may maintain logs on lower-capacity edge devices with tiny amounts of storage or, in many cases, no storage whatsoever. Indeed, there is real danger on lower-capacity devices that they will quickly run out of local storage with logs and other data unless these devices are constantly monitored and managed or alternate logging storage is configured. This is why there is such a need for remote logging services. Our focus in this blog post is on user data, and Figure 1 shows how it is classified:

Figure 1: Edge data classification.

There are enterprise data-classification and storage products like IBM Spectrum Discover, Azure Data Catalog and AWS Glue that can help to identify, tag and catalog data, but that is beyond the scope of this post. Nevertheless, classification of data is important for triaging, where decisions are made to keep, delete, move to pre-processing or archive.

Distributed data

More and more data will live in edge environments rather than in traditional data centers due to reasons ranging from data gravity to data sovereignty. The questions remain on how to handle remote data storage and how to improve data governance and compliance at the edge. Retrieving the right data at the right time is often an expensive and challenging proposition for data-storage firms.

The new way of storing large amounts of data in the cloud uses a distributed model. In the distributed model, instead of storing data in one location, data is stored repeatedly among multiple physical servers called nodes. These nodes can be located in the same region or even across continents. Distributed storage systems offer significant advantages over the centralized model and can provide any of the three types of storage: block, file and object.

There are many new storage options, but the old adage about storing data holds true even today – what is the type of data you want to store and how often do you want to access this data? There are new startup companies like Anylog.co that are focusing on federated data querying. The current world of Kubernetes containers, which is a key enabler of edge architecture, effectively shifts the storage requirement to be disaggregated and object-based. To that end, Ceph storage, Minio object storage and Hadoop are three of the popular options:

  • Ceph is a free-software storage platform that implements object storage on a single distributed computer cluster and provides interfaces for object-, block- and file-level storage.
  • Hadoop from Apache is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.
  • Minio, also from Apache, is an object storage server compatible with Amazon S3 and licensed under Apache 2.0 License.

In the telco or network edge and enterprise edge domains, we are seeing technologies and products that facilitate the handling and storing of data. In the previous blog post, we mentioned IBM Cloud Satellite providing distributed cloud services. Storage service would be one of these examples — satellite locations operate in close proximity to the edge, where the data is generated. This approach would satisfy low-latency and data-sovereignty requirements. And with the cloud object storage or databases being physically located on-premises, compliance and other security requirements are easier to implement:

Figure 2: Data storage points in a distributed cloud topology.

Figure 2 shows the five potential data storage points in an edge or distributed cloud topology. From the left, far edge devices may or may not have storage capability. Edge clusters or edge nodes should have decent amount of storage capacity for quick analytics and inferencing.

The sweet spot is at the enterprise edge or the network edge, which could very well be satellite locations (in IBM Cloud Satellite parlance). Some of the products that are offered as appliances are mentioned in the next section. It would follow that the largest storage locations would reside in the cloud region within a hybrid cloud model.

Another option would be to have a data platform that stretches from public cloud to any on-premises location — or even edge locations. That might entail the use of software-defined storage (SDS), which is a storage architecture that separates storage software from its hardware, complemented by a physical storage appliance in enterprise edge location. SDS is part of a larger ecosystem called hyperconverged infrastructure (loosely defined as software-defined everything), where all software is separated from all hardware.

One such product is OpenShift Data Foundation (ODF). ODF is software-defined storage for containers engineered as the data and storage services for Red Hat OpenShift.

Edge storage options

The cloud deployment model will also be dictated by the industry. Highly regulated industries and defense-related use cases will typically not allow the use of public cloud and will have restrictions where sensitive data — like clinical records or banking transactions — is stored.

Some enterprises will choose to manage their data, while others might opt for managed data storage services across a hybrid cloud topology. With managed storage services, the storage can be located onsite or offsite, depending on the company’s needs and the service provider they select. Cloud storage is a type of managed storage.

There are new on-premises storage appliances like HPE GreenLake, IBM Spectrum Protect and IBM Spectrum Fusion that offer terabyte and petabyte storage capacities. Along with ease of use, they are particularly useful for advanced analytics, machine learning and artificial intelligence (AI) applications across the edge spectrum.

If we follow the distributed model, in a product like IBM Cloud Satellite, we find that there are three main storage consumption models:

  • Existing or new on-premises storage systems like NetApp, Dell-EMC, IBM Flash System, IBM Spectrum Fusion or PureSystems.
  • Public-cloud-provided storage like AWS EBS, Azure File and IBM Cloud Object Storage.
  • Container-native storage (CNS) and software-defined storage (SDS) solutions like Red Hat OpenShift Data Foundation (ODF), Spectrum Fusion SDS, Netapp, Portworx, Robin.io and Weka FS.

This flexibility to utilize any storage type and natively support distributed cloud environments makes IBM Cloud Satellite a compelling storage option at the edge. Furthermore, using container-native storage solutions like Red Hat OpenShift Data Foundation in satellite locations gets you a consistent set of capabilities for workloads across on-premises, public cloud and edge devices.

While beyond the scope of this blog post, it is worth noting that IBM Cloud Satellite has this concept of storage templates that simplifies management of storage resources across edge topologies. There are templates for many of the most popular storage products.

The end goal is to get machine-learning model training done and deliver data analytics where data is sourced or located as quickly as possible. This can be on a shop floor, in an operating room, between connected vehicles, at the ATM (Automated Teller Machine) or in the store.

Lastly, how long data should be stored becomes more of a business decision. Some Cloud Service Providers (CSPs) offer storage tiers like hot, cool, archive, etc. These tiers relate to storage duration — anywhere from 30 days to 90+ days. Data storage boils down to capacity, which costs money. Hence, it is worth thinking about using a similar tiered paradigm for storing edge data and adding a duration dimension to the data classification graphic in Figure 1.

Wrap-up

One line of thinking is installing micro-data center servers at remote locations to help replicate cloud/data services locally. That will improve performance and allow connected devices to act upon perishable data in milliseconds. Thus, the need for very lightweight storage services that an edge computing platform should provide. As an example, IBM Cloud Pak® for Data can be wrapped as services and installed via a script by the IBM Edge Application Manager hub. These services would then be ready for deployment and consumption at the far edge.

This blog post reiterated the classification of edge data and explored data storage options. Concepts like data fabric, data governance, data rule sets, etc., aren’t addressed.

Data movement and data storage are key components of edge computing. We cannot assume storage on far edge devices; rather, it should be on edge clusters. Edge clusters supporting edge devices provide many important services, with storage being one of them. Those storage services are backed up by different physical storage options.

The IBM Cloud Architecture Center offers many hybrid and multicloud reference architectures, including edge computing and data frameworks. Look for the IBM edge computing reference architecture and the data architecture.

Special thanks to Sandy Amin, Joe Pearson and Frank Lee for reviewing the article.

Please make sure to check out all the installments in this series of blog posts on edge computing:

Learn more

Related articles

Was this article helpful?
YesNo

More from Cloud

How digital solutions increase efficiency in warehouse management

3 min read - In the evolving landscape of modern business, the significance of robust maintenance, repair and operations (MRO) systems cannot be overstated. Efficient warehouse management helps businesses to operate seamlessly, ensure precision and drive productivity to new heights. In our increasingly digital world, bar coding stands out as a cornerstone technology, revolutionizing warehouses by enabling meticulous data tracking and streamlined workflows. With this knowledge, A3J Group is focused on using IBM® Maximo® Application Suite and the Red Hat® Marketplace to help bring…

How fintechs are helping banks accelerate innovation while navigating global regulations

4 min read - Financial institutions are partnering with technology firms—from cloud providers to fintechs—to adopt innovations that help them stay competitive, remain agile and improve the customer experience. However, the biggest hurdle to adopting new technologies is security and regulatory compliance. While third and fourth parties have the potential to introduce risk, they can also be the solution. As enterprises undergo their modernization journeys, fintechs are redefining digital transformation in ways that have never been seen before. This includes using hybrid cloud and…

IBM Cloud expands its VPC operations in Dallas, Texas

3 min read - Everything is bigger in Texas—including the IBM Cloud® Network footprint. Today, IBM Cloud opened its 10th data center in Dallas, Texas, in support of their virtual private cloud (VPC) operations. DAL14, the new addition, is the fourth availability zone in the IBM Cloud area of Dallas, Texas. It complements the existing setup, which includes two network points of presence (PoPs), one federal data center, and one single-zone region (SZR). The facility is designed to help customers use technology such as…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters