March 28, 2023 By Hebert W. Pereyra
Jonathan Sloan
3 min read

The real-world challenges organizations are facing with big data today are multi-faceted. Every IT organization knows there is more raw data than ever before; data sources are in multiple locations, in varying forms and of questionable data quality. To add complexity, the business users of, and use cases for, data have become even more varied. The new data being used for decision support and business intelligence might also be used for developing machine learning models. In addition, semi-structured data, such as JSON and IoT log files, might need to be mixed with transactional data to get a complete picture of a customer buying experience, while emails and social media content might need to be interpreted to understand customer sentiment (i.e., emotion) to enrich operational decisions, machine learning (ML) models or decision-support applications. Choosing the right technology for your organization can help solve many of these challenges. 

Addressing these data integration issues might appear to be relatively easy: land all enterprise data in a centralized data lake and process it from start to finish. But that is a bit too simplistic because simultaneously, real-time data needs to be processed for decision-support access and often the curated data inputs reside in a data warehouse repository. And keeping data copies synchronized between the physical platforms supporting Hadoop based data lakes and data warehouses and data marts can be challenging.

Warehouses are known for the high-performance processing of terabytes of structured data for business intelligence but can quickly become expensive for new, evolving workloads. And when it comes to price-performance, the reality is that organizations are running data engineering data pipelines and data science machine learning model building workflows in data warehouses that are not necessarily optimized for scalability or to run these challenging workloads – impacting pipeline performance and driving up costs. It is this complex set of different data relationship dependencies requiring continuous data movement of interdependent data sets across platforms that makes these data challenges so complex to solve.

Read the Forrester Wave ‘Data Management for Analytics’ report

Rethinking data analytics architecture

Software architects at vendors understand these challenges, and several companies have tried to address the challenges in their own way. New workload requirements led to new functionality in software platforms that were not specifically optimized for these workloads, reducing their efficiency and worsening data silos within many organizations. Additionally, each platform must have overlapping copies of data, implying issues with data management (data governance, privacy, and security) and higher costs for data storage.

For these reasons, the challenges of the traditional data warehouse and data lake architecture have led businesses to operate complex architectures, with data siloed and copied across data warehouses, data marts, data lakes, and other relational databases throughout the organization. Given the prohibitive costs of high-performance on-premises and cloud data warehouses, and performance challenges within legacy data lakes, neither of these repositories satisfy the need for analytical flexibility and price-performance.

Instead of having each new technology solve the same problem, what is needed is a fresh, new architectural style.

Fortunately, the IT landscape is changing due to a mix of cloud computing platforms, open source, and traditional software vendors. Cloud vendors, leading with object storage, have helped to drive down the cost of disk storage. But data stored in object storage cannot readily be updated and object storage does not offer the type of query performance to which business users have become accustomed. Open-source technology such as Apache Iceberg combined with open-source engines such as Presto and Apache Spark are providing the advantage of object storage along with the business capabilities of better SQL performance and the ability to update large structured and semi-structured data in place. But there is still a gap to be filled that allows all these technologies to work together as a coordinated, integrated platform.

To truly solve these challenges, query, and reporting, provided by engines such as Presto, needs to work along with the Spark infrastructure framework to support advanced analytics and complex data transformations. And Presto and Spark need to readily work with existing and modern data warehouse infrastructures.

The industry is waiting for a breakthrough approach that allows organizations to optimize their analytics ecosystem by selecting the right engine for the right workload at the right cost — without having to copy data to multiple platforms and while taking advantage of integrated metadata. Whichever vendor gets there first will allow organizations to reduce cost and complexity and drive the greatest return on investment from their analytics workloads while also helping to deliver better governance and data security.

Learn more about data management solutions
Was this article helpful?

More from Analytics

IBM acquires StreamSets, a leading real-time data integration company

3 min read - We are thrilled to announce that IBM has acquired StreamSets, a real-time data integration company specializing in streaming structured, unstructured and semistructured data across hybrid multicloud environments. Acquired from Software AG along with webMethods, this strategic acquisition expands IBM's already robust data integration capabilities, helping to solidify our position as a leader in the data integration market and enhancing IBM Data Fabric’s delivery of secure, high-quality data for artificial intelligence (AI).  According to a Forrester study conducted on behalf of…

Fine-tune your data lineage tracking with descriptive lineage

4 min read - Data lineage is the discipline of understanding how data flows through your organization: where it comes from, where it goes, and what happens to it along the way. Often used in support of regulatory compliance, data governance and technical impact analysis, data lineage answers these questions and more.  Whenever anyone talks about data lineage and how to achieve it, the spotlight tends to shine on automation. This is expected, as automating the process of calculating and establishing lineage is crucial to…

Reimagine data sharing with IBM Data Product Hub

3 min read - We are excited to announce the launch of IBM® Data Product Hub, a modern data sharing solution designed to accelerate data-driven outcomes across your organization. Today, we're making this product generally available to our clients across the world, following its announcement at the IBM Think conference in May 2024. Data sharing has become the lifeblood of modern organizations, fueling growth and driving innovation. But traditional approaches to data sharing can often be a bottleneck constricting the seamless sharing of data.…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters