IBM Systems Lab Services

Optimizing data lake infrastructure

Share this post:

In today’s world, data is the new oil, and there’s a great need to preserve that data for exploration and to derive value. A “data lake” acts as a repository that consolidates an organization’s data into a governed and well-managed environment that supports both analytics and production workloads. It embraces multiple data platforms, such as relational data warehouses, Apache Hadoop clusters and analytical appliances, and manages them together.

Most companies aspire to become more data-driven, but many organizations are struggling to deliver on their strategy. This is primarily because of how data has been traditionally managed, with several point-to-point connections, new interfaces built over time as quick solutions, and a large amount of data that needs to be stored and moved across these systems. Addressing these challenges requires innovative solutions.

In this blog post, we’ll talk about infrastructure solutions from IBM Spectrum Scale and IBM Power Systems that can help you address these challenges and optimize your data lake infrastructure.

The challenge with traditional Hadoop

Hadoop technology is the basis for many data lake solutions. Traditional Hadoop with shared-nothing architecture brings in nodes with direct-attached disks to form a cluster. The compute from each node is used to run Hadoop jobs, and storage is used to form a Hadoop Distributed File System (HDFS). A capacity scheduler (YARN) reruns jobs in the case of node failure, and HDFS maintains redundancy by typically retaining three copies of data across nodes. If you need more storage capacity, irrespective of compute capacity requirements, additional nodes can be added. This architecture of directly attaching storage to compute typically results in underutilized compute farms and makes it difficult to scale compute and storage independently.

IBM Spectrum Scale offers the capability to independently scale compute and storage. Here’s how it works:

  • IBM Spectrum Scale emulates the HDFS API through Hadoop transparency connector and allows Hadoop ecosystem applications to run seamlessly without the need to make any application level code changes. IBM Spectrum Scale software-defined storage enables an efficient shared storage model, allowing Hadoop workloads to scale storage independently of compute. It can be used either as a horizontally scalable appliance (IBM Elastic Storage Server) or customized IBM Spectrum Scale implementation with varied storage servers in the back end.
  • IBM Elastic Storage Server includes RAID software, eliminating the need for three-way replication, resulting in up to 60 percent storage capacity reduction over traditional Hadoop architecture.
  • IBM Spectrum Scale storage tiering capability helps with seamlessly moving data to the right storage tiers.
  • IBM Spectrum Scale is certified with the popular Hadoop distribution from Hortonworks and is compatible with the Hortonworks Data Platform (HDP) and Hortonworks Data Flow (HDF) products.

Simplified infrastructure for data movement 

Organizations typically need to maintain separate copies of the same data for traditional and analytics applications due to disparate storage technologies and interfaces. This results in additional storage requirements and time spent copying data across systems. What’s desired is a unified storage layer that can virtualize storage across different technologies and support different interfaces such as a Unix file system, HDFS, Object interface and so forth.

IBM Spectrum Scale can virtualize different storage technologies such as flash, SSD, spinning disk and tape drives onto a unified file system name space providing advanced tiering and replication capabilities. Its comprehensive support of data access protocols and APIs, including NFS, SMB, Object, POSIX file system and HDFS, helps with in-place analytics capabilities to simplify the infrastructure and optimize data movement.

As an example, one could host common data on the IBM Spectrum Scale file system, mount as POSIX file system on an RDBMS system and access the same data in Hadoop through the HDFS API. With this approach, in-place analytics brings analytics to where data is, rather than moving data around, helping with reduced infrastructure footprint and faster data ingest.

Lack of cognitive infrastructure poses roadblocks

An organization’s existing infrastructure can pose roadblocks for scalability and efficiency. This new era of data requires a different approach to computing. Moore’s law no longer delivers the needed processor technology innovation to keep up with the demands of big data. Traditional infrastructure cannot handle high data volumes and process compute-intensive AI algorithms to derive insights. It requires a collaborative approach with innovation at various layers, including the processor/chip, system boards, I/O interconnects, accelerators, I/O adapters, and software to build cognitive systems.

IBM Power Systems addresses various requirements for such workloads with cognitive systems. Here are some key highlights:

  • OpenPower Foundation brings in collaborators across the industry to optimize and innovate on the Power processor and system platform to build custom systems for large-scale data centers

evolving from compute systems to cognitive systems

  • High-speed PCIe4 interconnect on POWER9 systems, one of the first in the industry to move data efficiently across the system
  • Innovative Coherent Accelerator Processor Interface (CAPI) removes the overhead and complexity of I/O subsystem, allowing an accelerator (such as FPGA) to operate as an extension of an application
  • High-speed Nvidia NVLINK connectivity (total of 600GB/s) between processor and GPU accelerator catering to high performance deep learning AI workloads
  • Workload optimized processor and server architecture with separate server lineup for scale up and scale out workloads supporting AIX, IBM i and Linux operating systems
  • Superior price performance over competitor platforms resulting in lower infrastructure costs

IBM Spectrum Scale and IBM Power Systems technologies bring in the needed innovation for optimization and efficiency for new-age workload requirements and data lake infrastructure.

With constant pressure to optimize IT costs, large investments such as data lakes are under scrutiny for alternate architectures and optimization options. Building data lakes requires an end-to-end approach from infrastructure to software stack. IBM Spectrum Scale and Power Systems provide a strong infrastructure alternative and choice to clients looking at constructing and optimizing their data lake solution.

IBM Systems Lab Services offers a wide range of services on Cognitive Solutions. If you’re interested in talking to Lab Services about your data lake infrastructure optimization, contact us today.

More IBM Systems Lab Services stories

Integrating IBM Cloud Automation Manager, PowerVC and IBM Cloud Private

Cloud computing, IBM Systems Lab Services, Power Systems

It’s evident that a “one-cloud-fits-all” approach doesn’t always work, and the IBM Systems Lab Services team’s work on thousands of IBM client engagements demonstrates this. Organizations are now using multiple clouds and integrating them with existing IT systems to generate more value. To compete successfully in today’s dynamic era of multi-cloud, you need flexibility and more

Residency services: Empowering your IT storage staff

IBM Systems Lab Services, Software-defined storage, Storage

Have you ever been in this situation? You come into a new position in a new company, with new staff, and you find out that no one has enough knowledge to completely manage the primary storage system. What do you do? Hire someone? Train everyone? Hope it runs fine on its own? Well, you could more

Learn DevOps, Ansible, Chef and Puppet skills from IBM

Academic initiatives, DevOps, IBM Systems Lab Services

DevOps is a popular topic in IT today. It can be defined as what is necessary to take an idea (feature, code, documentation or other) from inception through delivery to a customer in the most expedient and sustainable way possible. It’s often used to represent IT practices that help reduce the time and cost involved more