Meeting the data needs of artificial intelligence

Share this post:

Artificial intelligence (AI) is playing an increasingly critical role in business. By 2020, 30 percent of organizations that fail to apply AI will not be operationally and economically viable, according to one report[1].  And in a survey, 91 percent of infrastructure and operations leaders cite “data” as a main inhibitor of AI initiatives[2]. What does a data professional need to know about AI and its data requirements in order to support his or her organization’s AI efforts?

Many factors have converged in recent years to make AI viable, including the growth of processing power and advances in AI techniques, notably in the area of deep learning (DL). Unlike traditional programming in which a programmer provides the computer with each step that it has to take to accomplish some task, deep learning requires the computer to learn for itself.  In the case of visual object recognition, for example, there is no way to program a computer with the steps needed to recognize a given object which may present itself in different locations, at different angles, in different lighting conditions, perhaps partially obscured by some other object and so forth.  Instead, the computer is trained by being given thousands of examples of images containing the object until it can consistently recognize it.

This kind of training requires lots of data. One recommendation is to start with at least 100,000 examples – and each example can be large: an image or a voice recording, for example. Different stages of training and deployment of a deep learning system have different data and processing requirements.  For the training stage, there may be years of data to process, and it can take weeks or even months to complete.   By contrast with these extended time frames, once deployed, the system may need to respond in seconds.

Obviously, given the data volumes involved, storage capacity is an important consideration during the training stage.  The data may also be in different formats in different systems, so multi-protocol capability may be needed. The data may also be geographically dispersed, an additional factor the storage system needs to handle. Once deployed, fast access to the data becomes particularly important to support the response requirements of users and applications, which typically need answers in seconds.

A system such as IBM Spectrum Scale is perfectly suited to meeting these requirements.  It is a high-performance system that can scale out to handle petabytes or exabytes of data.  It supports a wide variety of protocols for accessing files or objects. For Hadoop applications, it provides direct access to data without having to copy the data to HDFS, as is usually required. Avoiding the overhead of copying data between systems lowers costs by saving space and also speeds time to results.

IBM Spectrum Scale is a software-defined solution that can be deployed on a customer’s choice of platform, or it can be delivered as a complete solution in the form of IBM Elastic Storage Server (ESS).  The capacity and performance capabilities of IBM Spectrum Scale and ESS are well illustrated by the US Department of Energy CORAL project, currently on track to build the world’s fastest supercomputer.  ESS will be providing the 250PB of storage the system requires, with performance requirements that include 2.5 TB/second single stream IOR and the creation of 2.6 million 32K files per second.

IBM Spectrum Scale and IBM Elastic Storage Server undergo constant improvement.  The latest version of IBM Spectrum Scale incorporates enhancements to the install and upgrade process, the GUI, and system health capabilities, along with scalability and performance tuning for Transparent Cloud Tiering up to one billion files, and file audit logging enhancements.

Meanwhile, ESS now offers models incorporating the superior performance of IBM Spectrum Scale version 5.0 with performance improvements designed to meet the requirements of the CORAL supercomputer.  ESS is also bringing out its first hybrid models incorporating both flash and disk storage in a single unit, allowing improved handling of different kinds of data such as video and analytics within a single environment.

Constant improvements, along with decades of experience in the most challenging customer environments, ensure that IBM Spectrum Scale and IBM Elastic Storage Server will continue to lead the way in managing the data that is a key element in the success of any deep learning project. Visit our website to learn more about IBM Spectrum Scale and IBM Elastic Storage Server.

[1] Gartner Predicts 2018: Compute Infrastructure 

[2] Gartner AI State of The Market – and Where HPC intersects

More Storage stories

Accelerating data for NVIDIA GPUs

AI, Big data & analytics, Storage

These days, most AI and big data workloads need more compute power and memory than one node can provide. As both the number of computing nodes and the horsepower of processors and GPUs increases, so does the demand for I/O bandwidth. What was once a computing challenge can now become an I/O challenge. For those more

Secure, efficient and high-performance deployment on premises and in the cloud

Flash storage, Multicloud, Storage

In Part one of this 2-part post on IBM Storage for hybrid multicloud, I discussed the two pieces of software at the center of our approach. The first is IBM Storage Insights, a free cloud-based service that simplifies your operations. The second is IBM Spectrum Virtualize, the strategic storage software foundation. In Part 2, we’ll more

Open flexibility with infrastructure-independent software

Hybrid cloud, Multicloud, Storage

There is a rapid industry transition underway to a hybrid multicloud style of computing. It’s happening across all industries and all lines of business and being pushed along by stakeholders from IT managers to the C-suite and shareholders. Although the transition to hybrid multicloud infrastructure is touching every area of IT, its impact on storage more