The real-world challenges organizations are facing with big data today are multi-faceted. Every IT organization knows there is more raw data than ever before; data sources are in multiple locations, in varying forms and of questionable data quality. To add complexity, the business users of, and use cases for, data have become even more varied. The new data being used for decision support and business intelligence might also be used for developing machine learning models. In addition, semi-structured data, such as JSON and IoT log files, might need to be mixed with transactional data to get a complete picture of a customer buying experience, while emails and social media content might need to be interpreted to understand customer sentiment (i.e., emotion) to enrich operational decisions, machine learning (ML) models or decision-support applications. Choosing the right technology for your organization can help solve many of these challenges. 

Addressing these data integration issues might appear to be relatively easy: land all enterprise data in a centralized data lake and process it from start to finish. But that is a bit too simplistic because simultaneously, real-time data needs to be processed for decision-support access and often the curated data inputs reside in a data warehouse repository. And keeping data copies synchronized between the physical platforms supporting Hadoop based data lakes and data warehouses and data marts can be challenging.

Warehouses are known for the high-performance processing of terabytes of structured data for business intelligence but can quickly become expensive for new, evolving workloads. And when it comes to price-performance, the reality is that organizations are running data engineering data pipelines and data science machine learning model building workflows in data warehouses that are not necessarily optimized for scalability or to run these challenging workloads – impacting pipeline performance and driving up costs. It is this complex set of different data relationship dependencies requiring continuous data movement of interdependent data sets across platforms that makes these data challenges so complex to solve.

Read the Forrester Wave ‘Data Management for Analytics’ report

Rethinking data analytics architecture

Software architects at vendors understand these challenges, and several companies have tried to address the challenges in their own way. New workload requirements led to new functionality in software platforms that were not specifically optimized for these workloads, reducing their efficiency and worsening data silos within many organizations. Additionally, each platform must have overlapping copies of data, implying issues with data management (data governance, privacy, and security) and higher costs for data storage.

For these reasons, the challenges of the traditional data warehouse and data lake architecture have led businesses to operate complex architectures, with data siloed and copied across data warehouses, data marts, data lakes, and other relational databases throughout the organization. Given the prohibitive costs of high-performance on-premises and cloud data warehouses, and performance challenges within legacy data lakes, neither of these repositories satisfy the need for analytical flexibility and price-performance.

Instead of having each new technology solve the same problem, what is needed is a fresh, new architectural style.

Fortunately, the IT landscape is changing due to a mix of cloud computing platforms, open source, and traditional software vendors. Cloud vendors, leading with object storage, have helped to drive down the cost of disk storage. But data stored in object storage cannot readily be updated and object storage does not offer the type of query performance to which business users have become accustomed. Open-source technology such as Apache Iceberg combined with open-source engines such as Presto and Apache Spark are providing the advantage of object storage along with the business capabilities of better SQL performance and the ability to update large structured and semi-structured data in place. But there is still a gap to be filled that allows all these technologies to work together as a coordinated, integrated platform.

To truly solve these challenges, query, and reporting, provided by engines such as Presto, needs to work along with the Spark infrastructure framework to support advanced analytics and complex data transformations. And Presto and Spark need to readily work with existing and modern data warehouse infrastructures.

The industry is waiting for a breakthrough approach that allows organizations to optimize their analytics ecosystem by selecting the right engine for the right workload at the right cost — without having to copy data to multiple platforms and while taking advantage of integrated metadata. Whichever vendor gets there first will allow organizations to reduce cost and complexity and drive the greatest return on investment from their analytics workloads while also helping to deliver better governance and data security.

Learn more about data management solutions

Categories

More from Analytics

Data science vs data analytics: Unpacking the differences

5 min read - Though you may encounter the terms “data science” and “data analytics” being used interchangeably in conversations or online, they refer to two distinctly different concepts. Data science is an area of expertise that combines many disciplines such as mathematics, computer science, software engineering and statistics. It focuses on data collection and management of large-scale structured and unstructured data for various academic and business applications. Meanwhile, data analytics is the act of examining datasets to extract value and find answers to…

Financial planning & budgeting: Navigating the Budgeting Paradox

5 min read - Budgeting, an essential pillar of financial planning for organizations, often presents a unique dilemma known as the “Budgeting Paradox.” Ideally, a budget should give the most accurate and timely idea of anticipated revenues and expenses. However, the traditional budgeting process, in its pursuit of precision and consensus, can take several months. By the time the budget is finalized and approved, it might already be outdated.In today's rapid pace of change and unpredictability, the conventional budgeting process is coming under scrutiny.It's…

How Macmillan Publishers authored success using IBM Cognos Analytics

5 min read - Macmillan Publishers is a global publishing company and one of the “Big Five” English language publishers. If you're a reader, chances are good you've read a book from Macmillan. They published many perennial favorites including Kristin Hannah’s The Nightingale, Bill Martin’s Brown Bear, Brown Bear, what do you see? and some of the more recent bestsellers such as The Silent Patient by Alex Michaelides, Identity by Nora Roberts and Razorblade Tears by S. A. Cosby. It’s no wonder then that Macmillan…

MLOps and the evolution of data science

7 min read - The advancement of computing power over recent decades has led to an explosion of digital data, from traffic cameras monitoring commuter habits to smart refrigerators revealing how and when the average family eats. Both computer scientists and business leaders have taken note of the potential of the data. The information can deepen our understanding of how our world works—and help create better and “smarter” products. Machine learning (ML), a subset of artificial intelligence (AI), is an important piece of data-driven…