Storage for the exabyte future

By | 6 minute read | July 10, 2019

“There is no AI without IA (information architecture)” is a common phrase here at IBM. It describes the business and operation platform every business needs to connect and manage the lifecycle of their AI applications. Data scientists, analytic teams, and line of business need access to the data that helps drive innovation, insight, and ultimately competitive advantage.

Data accessibility is the heart of the IBM Storage focus on solutions for artificial intelligence (AI) and big data. It highlights the importance of the announcements made this week about numerous enhancements to IBM Spectrum Discover and IBM Cloud Object Storage–solutions that are intended to help transform this explosion of unstructured data into the fuel that drives 21st century business.

Storage administrators and data scientists often find that system metadata–information they have about their data–doesn’t provide the view of storage consumption and data quality needed for effective data management for their artificial intelligence (AI), big data, and analytic applications, workloads and use cases. Basic system-level metadata can also be inadequate for data scientists, business analysts, and knowledge workers who may spend up to 80 percent of their time finding and preparing data – leaving only 20 percent for performing actual data analysis.[1]

IBM Storage helps improve data science productivity across data pipelines with a portfolio designed and optimized to serve the unique requirements of different stages—from ingest to insights. IBM Spectrum Discover serves the important role of classifying and tagging (or labeling) data with custom metadata that not only makes it easier to find and recall data for analysis but may increase the value of data by imbuing it with additional semantics, such as which projects, departments or users have accessed particular data sets.

AI in particular offers enormous opportunity, but it also brings very real challenges for both data scientists and IT infrastructure. Spending on storage systems and software to support AI initiatives is already approaching five billion dollars a year[2], yet enterprises report that dealing with data volumes and data quality, implementing effective advanced data management, and finding knowledge workers with the requisite skills is hampering AI adoption[3]. IBM Spectrum Discover helps organizations address these challenges by providing sophisticated metadata management for exabyte-scale unstructured data stores.

ibm storage and AI

This week, IBM Storage announced significant enhancements to the already powerful IBM Spectrum Discover capabilities:

  • First, in a move that expands the universe of unstructured data that can be identified and classified by IBM Spectrum Discover, we’ve added support for heterogeneous storage with data connectors for S3- and NFSv3-compliant data sources, including Dell EMC Isilon, NetApp, Amazon Web Services S3, and Ceph, in addition to existing IBM Spectrum Scale and IBM Cloud Object Storage support.
  • Next, we’ve added new content-based data classification and search functionality. IBM Spectrum Discover can now examine the content of over a thousand different file types, enabling users to apply custom metadata tags to data based on keywords found in its content. This new capability enables subsequent low-latency, very granular searches of with billions of files to speed data curation and preparation.
  • Finally, IBM Spectrum Discover helps organizations as they work to comply with GDPR and other regulations, with automatic detection and labeling of certain types of sensitive and personally identifiable information to detect Social Security Numbers, credit card numbers, and many other patterns. Plus, users can define their own custom patterns to detect and label data based on criteria that are unique to their own business or domain.

“IBM Spectrum Discover is part of the new breed of software-defined solutions that can help our university deploy effective AI-driven applications,” notes Kevin Shinpaugh, Director of IT, HPC, and Computing Services at the Biocomplexity Institute of Virginia Tech.  “We welcome the announcement this week of support for additional data sources and enhanced metadata search and tagging functionality. We look forward to making IBM Spectrum Discover a foundational element of our AI-driven research workflows.”

Enhancements to IBM Spectrum Discover aren’t the only big news for AI and Big Data; this week IBM Storage is also announcing an entirely new generation of IBM Cloud Object Storage solutions. This well-known object storage system is a leader in its industry sector[4] and offers many advantages over other types of storage:

  • IBM Cloud Object Storage has 1.6x more write operations per second[5] and is 30% less costly than it was a year ago[6] and can now store 10.2PB in a single rack and 1.27PB in a single node.
  • It can scale in a mixed environment with previous generations of Cloud Object Storage without a forklift upgrade or loss of access to data. It can be deployed with 1EB of storage in a single cluster and has multiple customers with over 1.1EB of capacity.
  • The system can distribute data across one to many sites and maintain a single copy that is geo-dispersed designed for 8 nines of availability and has been recognized as the single-largest, on-premises object storage deployment in the world with more large deployments than any other vendor.[7]

This week, IBM Storage is introducing a new generation of object storage system called IBM Cloud Object Storage Generation 2, or Gen2 for short. Designed around a cloud native S3 interface, the new architecture provides a modern solution for building hybrid cloud environments. The Gen2 systems consist of three new storage capacity enclosures, along with updated Gen2 accessor nodes for increased performance and an updated Gen2 manager that provides all the management in a simple-to-use solution. The new solutions are designed to easily scale up for additional efficiency, as well as massively scale out to EB+ worth of data.

As data continues to grow and more and more enterprises move to massively scalable storage like object storage, efficiency, density, and cost become very critical components,” notes Lynda Stadtmueller, Vice President of Cloud Computing Services at Frost and Sullivan. “IBM is on the right track with its new Gen2 systems for IBM Cloud Object Storage. Businesses already trust IBM Cloud Object Storage to provide the capacity, accessibility, and scale their data-intensive applications require.  With the additional storage efficiencies promised by the Gen2 systems, customers will realize even greater value.

IBM Cloud Object Storage is a great repository for secondary storage such as backup and archive or remote file services, but some of the fastest growing use cases now involve AI and big data workloads. With the capability to centralize PBs of data with efficiency, we expect the movement of data to object storage and, specifically, IBM Cloud Object Storage to accelerate in many more data centers worldwide.

When your company looks toward the future and sees petabytes and even exabytes of data–filled with business advantage and IT challenges–it’s time to look at IBM Spectrum Discover and IBM Cloud Object Storage.

Read the blog post IBM drives innovation in storage for AI and big data, modern data protection, and hybrid multicloud for more information about these and other announcements from IBM Storage.

[1] InfoWorld: The 80/20 data science dilemma, September 2017

[2] IDC: Worldwide Storage for Cognitive/AI Workloads Forecast, 2018–2022, April 2018

[3] IDC: Cognitive, ML, and AI Workloads Infrastructure Market Survey, January 2018

[4] Gartner: 2018 Gartner Magic Quadrant for Distributed File Systems and Object Storage

[5] Comparing IBM COS Gen 2 SJ12 to IBM COS Gen 1 2548 in IBM internal testing

[6] Comparing $/usable TB of standard configuration Express Bundles of the same usable terabytes available in June 2018 v. in July 2019

[7] Gartner: Critical Capabilities for Object Storage, Jan 2019