Storage vendors are coming out with better and faster data storage options — we’ll look at some of these storage options as they pertain to data at the edge.
We have all heard about the massive amounts of data being generated at the edge by a plethora of devices. From videos to sensor data to posts to emails, it is estimated that each day, 2.5 exabytes of data are produced. Those bytes need to be stored somewhere, otherwise they get discarded since many devices can store little or no data.
Many are of the opinion that the data should be analyzed at the source and acted upon. While true, what about raw data, intermittent data or even data that needs to be stored for auditability? Whether raw or post-analysis, there is a definite need to store lot of that data some place.
A previous blog in the series — “Data at the Edge” — talked about classifying data at the edge and the different options to deal with it. Again, the variations are the result of all the different reasons and types of data flowing through an edge topology. And, similar to computing resources, you will hear the distribution of storage resources closer to the micro, metro and macro edge locations where data is generated and consumed. This past year, storage vendors have come out with better and faster data storage options. This blog post will look at some of these storage options as they pertain to data at the edge.
Please make sure to check out all the installments in this series of blog posts on edge computing:
- Part 1: “Cloud at the edge”
- Part 2: “Rounding out the edges”
- Part 3: “Architecting at the edge”
- Part 4: “DevOps at the edge”
- Part 5: “Policies at the edge”
- Part 6: “Models deployed at the edge”
- Part 7: “Security at the edge”
- Part 8: “Analytics at the edge”
- Part 9: “5G at the edge”
- Part 10: “Clusters at the edge”
- Part 11: “Automation at the edge”
- Part 12: “Network slicing at the edge”
- Part 13: “Data at the edge”
- Part 14: “Architectural decisions at the edge”
- Part 15: “GitOps at the edge”
- Part 17: “Storage services at the edge”
- Part 18: “Cloud services at the edge”
- Part 19: “Distributed cloud: Empowerment at the edge”
- Part 20: “Data sovereignty at the edge”
- Part 21: “Solutioning at the edge”
- Part 22: “Connected products at the edge”
- Part 23: “Foundational models at the edge”
Classifying edge data
From the blog post referenced earlier, there are two types of edge data: system data and user data. One might argue that logging data should also be considered. Suffice it to say, an edge solution will/should integrate with any existing logging service. Edge solutions may maintain logs on lower-capacity edge devices with tiny amounts of storage or, in many cases, no storage whatsoever. Indeed, there is real danger on lower-capacity devices that they will quickly run out of local storage with logs and other data unless these devices are constantly monitored and managed or alternate logging storage is configured. This is why there is such a need for remote logging services. Our focus in this blog post is on user data, and Figure 1 shows how it is classified:
There are enterprise data-classification and storage products like IBM Spectrum Discover, Azure Data Catalog and AWS Glue that can help to identify, tag and catalog data, but that is beyond the scope of this post. Nevertheless, classification of data is important for triaging, where decisions are made to keep, delete, move to pre-processing or archive.
Distributed data
More and more data will live in edge environments rather than in traditional data centers due to reasons ranging from data gravity to data sovereignty. The questions remain on how to handle remote data storage and how to improve data governance and compliance at the edge. Retrieving the right data at the right time is often an expensive and challenging proposition for data-storage firms.
The new way of storing large amounts of data in the cloud uses a distributed model. In the distributed model, instead of storing data in one location, data is stored repeatedly among multiple physical servers called nodes. These nodes can be located in the same region or even across continents. Distributed storage systems offer significant advantages over the centralized model and can provide any of the three types of storage: block, file and object.
There are many new storage options, but the old adage about storing data holds true even today – what is the type of data you want to store and how often do you want to access this data? There are new startup companies like Anylog.co that are focusing on federated data querying. The current world of Kubernetes containers, which is a key enabler of edge architecture, effectively shifts the storage requirement to be disaggregated and object-based. To that end, Ceph storage, Minio object storage and Hadoop are three of the popular options:
- Ceph is a free-software storage platform that implements object storage on a single distributed computer cluster and provides interfaces for object-, block- and file-level storage.
- Hadoop from Apache is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.
- Minio, also from Apache, is an object storage server compatible with Amazon S3 and licensed under Apache 2.0 License.
In the telco or network edge and enterprise edge domains, we are seeing technologies and products that facilitate the handling and storing of data. In the previous blog post, we mentioned IBM Cloud Satellite providing distributed cloud services. Storage service would be one of these examples — satellite locations operate in close proximity to the edge, where the data is generated. This approach would satisfy low-latency and data-sovereignty requirements. And with the cloud object storage or databases being physically located on-premises, compliance and other security requirements are easier to implement:
Figure 2 shows the five potential data storage points in an edge or distributed cloud topology. From the left, far edge devices may or may not have storage capability. Edge clusters or edge nodes should have decent amount of storage capacity for quick analytics and inferencing.
The sweet spot is at the enterprise edge or the network edge, which could very well be satellite locations (in IBM Cloud Satellite parlance). Some of the products that are offered as appliances are mentioned in the next section. It would follow that the largest storage locations would reside in the cloud region within a hybrid cloud model.
Another option would be to have a data platform that stretches from public cloud to any on-premises location — or even edge locations. That might entail the use of software-defined storage (SDS), which is a storage architecture that separates storage software from its hardware, complemented by a physical storage appliance in enterprise edge location. SDS is part of a larger ecosystem called hyperconverged infrastructure (loosely defined as software-defined everything), where all software is separated from all hardware.
One such product is OpenShift Data Foundation (ODF). ODF is software-defined storage for containers engineered as the data and storage services for Red Hat OpenShift.
Edge storage options
The cloud deployment model will also be dictated by the industry. Highly regulated industries and defense-related use cases will typically not allow the use of public cloud and will have restrictions where sensitive data — like clinical records or banking transactions — is stored.
Some enterprises will choose to manage their data, while others might opt for managed data storage services across a hybrid cloud topology. With managed storage services, the storage can be located onsite or offsite, depending on the company’s needs and the service provider they select. Cloud storage is a type of managed storage.
There are new on-premises storage appliances like HPE GreenLake, IBM Spectrum Protect and IBM Spectrum Fusion that offer terabyte and petabyte storage capacities. Along with ease of use, they are particularly useful for advanced analytics, machine learning and artificial intelligence (AI) applications across the edge spectrum.
If we follow the distributed model, in a product like IBM Cloud Satellite, we find that there are three main storage consumption models:
- Existing or new on-premises storage systems like NetApp, Dell-EMC, IBM Flash System, IBM Spectrum Fusion or PureSystems.
- Public-cloud-provided storage like AWS EBS, Azure File and IBM Cloud Object Storage.
- Container-native storage (CNS) and software-defined storage (SDS) solutions like Red Hat OpenShift Data Foundation (ODF), Spectrum Fusion SDS, Netapp, Portworx, Robin.io and Weka FS.
This flexibility to utilize any storage type and natively support distributed cloud environments makes IBM Cloud Satellite a compelling storage option at the edge. Furthermore, using container-native storage solutions like Red Hat OpenShift Data Foundation in satellite locations gets you a consistent set of capabilities for workloads across on-premises, public cloud and edge devices.
While beyond the scope of this blog post, it is worth noting that IBM Cloud Satellite has this concept of storage templates that simplifies management of storage resources across edge topologies. There are templates for many of the most popular storage products.
The end goal is to get machine-learning model training done and deliver data analytics where data is sourced or located as quickly as possible. This can be on a shop floor, in an operating room, between connected vehicles, at the ATM (Automated Teller Machine) or in the store.
Lastly, how long data should be stored becomes more of a business decision. Some Cloud Service Providers (CSPs) offer storage tiers like hot, cool, archive, etc. These tiers relate to storage duration — anywhere from 30 days to 90+ days. Data storage boils down to capacity, which costs money. Hence, it is worth thinking about using a similar tiered paradigm for storing edge data and adding a duration dimension to the data classification graphic in Figure 1.
Wrap-up
One line of thinking is installing micro-data center servers at remote locations to help replicate cloud/data services locally. That will improve performance and allow connected devices to act upon perishable data in milliseconds. Thus, the need for very lightweight storage services that an edge computing platform should provide. As an example, IBM Cloud Pak® for Data can be wrapped as services and installed via a script by the IBM Edge Application Manager hub. These services would then be ready for deployment and consumption at the far edge.
This blog post reiterated the classification of edge data and explored data storage options. Concepts like data fabric, data governance, data rule sets, etc., aren’t addressed.
Data movement and data storage are key components of edge computing. We cannot assume storage on far edge devices; rather, it should be on edge clusters. Edge clusters supporting edge devices provide many important services, with storage being one of them. Those storage services are backed up by different physical storage options.
The IBM Cloud Architecture Center offers many hybrid and multicloud reference architectures, including edge computing and data frameworks. Look for the IBM edge computing reference architecture and the data architecture.
Special thanks to Sandy Amin, Joe Pearson and Frank Lee for reviewing the article.
Please make sure to check out all the installments in this series of blog posts on edge computing:
- Part 1: “Cloud at the edge”
- Part 2: “Rounding out the edges”
- Part 3: “Architecting at the edge”
- Part 4: “DevOps at the edge”
- Part 5: “Policies at the edge”
- Part 6: “Models deployed at the edge”
- Part 7: “Security at the edge”
- Part 8: “Analytics at the edge”
- Part 9: “5G at the edge”
- Part 10: “Clusters at the edge”
- Part 11: “Automation at the edge”
- Part 12: “Network slicing at the edge”
- Part 13: “Data at the edge”
- Part 14: “Architectural decisions at the edge”
- Part 15: “GitOps at the edge”
- Part 17: “Storage services at the edge”
- Part 18: “Cloud services at the edge”
- Part 19: “Distributed cloud: Empowerment at the edge”
- Part 20: “Data sovereignty at the edge”
- Part 21: “Solutioning at the edge”
- Part 22: “Connected products at the edge”
- Part 23: “Foundational models at the edge”
Learn more
- IBM Edge Application Manager
- IBM Cloud Satellite
- IBM Cloud Pak for Data
- IBM Cloud Satellite Storage
- Red Hat Software-Defined Storage