Building your AI data pipeline

By | 5 minute read | December 10, 2018

IBM Research, IBM Flash, Data Protection

Artificial intelligence, the erstwhile fascination of sci-fi aficionados and the perennial holy grail of computer scientists, is now ubiquitous in the lexicon of business. Now more modern-business-imperative than fiction, the world is moving toward AI adoption fast.

According to Forrester Research, AI adoption is ramping up. 63 percent[1] of business technology decision makers are implementing, have implemented, or are expanding use of AI.

The stakes are high. AI promises to help business accurately predict changing market dynamics, improve the quality of offerings, increase efficiency, enrich customer experiences and reduce organizational risk by making business, processes and products more intelligent. Such competitive benefits present a compelling enticement to adopt AI sooner rather than later.

AI is finding its way into all manner of applications from AI-driven recommendations, to autonomous vehicles, virtual assistants, predictive analytics and products that adapt to the needs and preferences of users. But as many and varied as AI-enabled applications are, they all share an essentially common objective at their core—to ingest data from many sources and derive actionable insights or intelligence from it.

AI data pipeline

Still, as much promise as AI holds to accelerate innovation, increase business agility, improve customer experiences, and a host of other benefits, some companies are adopting it faster than others. For some, there is uncertainty because AI seems too complicated and, for them, getting from here to there—or, more specifically, from ingest to insights—may seem too daunting a challenge. That may be because no other business or IT initiative promises more in terms of outcomes or is more demanding of the infrastructure on which it is runs.

AI done well looks simple from the outside in. Hidden from view behind every great AI-enabled application, however, lies a data pipeline that moves data— the fundamental building block of artificial intelligence— from ingest through several stages of data classification, transformation, analytics, machine learning and deep learning model training, and retraining through inference to yield increasingly accurate decisions or insights.

The AI data pipeline is neither linear nor fixed, and even to informed observers, it can seem that production-grade AI is messy and difficult. And as organizations move from experimentation and prototyping to deploying AI in production, their first challenge is to embed AI into their existing analytics data pipeline and build a data pipeline that can leverage existing data repositories. In the face of this imperative, concerns about integration complexity may loom as one of the greatest challenges to adoption of AI in their organizations. But it doesn’t have to be so.

AI data pipeline

Different stages of the data pipeline exhibit unique I/O characteristics and benefit from complementary storage infrastructure. For example, ingest or data collection benefits from the flexibility of software-defined storage at the edge, and demands high throughput. Data classification and transformation stages which involve aggregating, normalizing, classifying data, and enriching it with useful metadata require extremely high throughput, with both small and large I/O. Model training requires a performance tier that can support the highly parallel processes involved in training of machine learning and deep learning models with extremely high throughput and low latency.

Retraining of models with inference doesn’t require as much throughput, but still demands extremely low latency. And archive demands a highly scalable capacity tier for cold and active archive data that is throughput oriented, and supports large I/O, streaming, sequential writes. Any of these may occur on premises or in private or public clouds, depending on requirements.

These varying requirements for scalability, performance, deployment flexibility, and interoperability are a tall order. But data science productivity is dependent upon the efficacy of the overall data pipeline and not just the performance of the infrastructure that hosts the ML/DL workloads. It requires a portfolio of software and system technologies that can satisfy these requirements along the entire data pipeline.

Many vendors are racing to answer the call for high-performance ML/DL infrastructure. IBM does more by offering a portfolio of sufficient breadth to address the varied needs at every stage of the AI data pipeline— from ingest to insights. IBM answers the call with a comprehensive portfolio of software-defined storage products that enable customers to build or enhance their data pipelines with capabilities and cost characteristics that are optimal for each stage bringing performance, agility and efficiency to the entire data pipeline.

IBM Cloud Object Storage provides geographically dispersed object repositories that support global ingest, transient storage and cloud archive of object data.

  • Multiple IBM Elastic Storage Server models powered by IBM Spectrum Scale deliver high performance file storage and scalable common data lakes.
  • IBM Spectrum Discover automatically indexes and helps manage metadata and operationalizes data preparation tasks to speed classification and curation of data.
  • IBM Spectrum Archive enables direct file access to data stored on tape for cost-effective and highly scalable active archives.

IBM Storage is a proven AI performance leader with top benchmarks on common AI workloads, tested data throughput that is several times greater than the competition, and sustained random read of over 90GB/s in a single rack. Add to that unmatched scalability already deployed for AI workloads—Summit and Sierra, the #1 and #2 fastest supercomputers in the world with 2.5TB/s of data throughput to feed data-hungry GPUs—and multiple installations of more than an exabyte and billions of objects and files, and IBM emerges as a clear leader in AI performance and scalability.

Continual innovation from IBM Storage gets clients to insights faster with industry-leading performance plus hybrid and muticloud support that spans public clouds, private cloud, and the latest in containers. With well-tested reference architectures already in production, IBM solutions for AI are real-world ready.

Customers who take an end-to-end data pipeline view when choosing storage technologies can benefit from higher performance, easier data sharing and integrated data management. The result is improved data governance and faster time to insight.

Learn more about IBM Systems Reference Architecture for AI and in this IDC Technology Spotlight: Accelerating and Operationalizing AI Deployments using AI-Optimized Infrastructure.

[1] Forrester Infographic: Business-Aligned Tech Decision Makers Drive Enterprise AI Adoption, January 2018