The real-world challenges organizations are facing with big data today are multi-faceted. Every IT organization knows there is more raw data than ever before; data sources are in multiple locations, in varying forms and of questionable data quality. To add complexity, the business users of, and use cases for, data have become even more varied. The new data being used for decision support and business intelligence might also be used for developing machine learning models. In addition, semi-structured data, such as JSON and IoT log files, might need to be mixed with transactional data to get a complete picture of a customer buying experience, while emails and social media content might need to be interpreted to understand customer sentiment (i.e., emotion) to enrich operational decisions, machine learning (ML) models or decision-support applications. Choosing the right technology for your organization can help solve many of these challenges.
Addressing these data integration issues might appear to be relatively easy: land all enterprise data in a centralized data lake and process it from start to finish. But that is a bit too simplistic because simultaneously, real-time data needs to be processed for decision-support access and often the curated data inputs reside in a data warehouse repository. And keeping data copies synchronized between the physical platforms supporting Hadoop based data lakes and data warehouses and data marts can be challenging.
Warehouses are known for the high-performance processing of terabytes of structured data for business intelligence but can quickly become expensive for new, evolving workloads. And when it comes to price-performance, the reality is that organizations are running data engineering data pipelines and data science machine learning model building workflows in data warehouses that are not necessarily optimized for scalability or to run these challenging workloads – impacting pipeline performance and driving up costs. It is this complex set of different data relationship dependencies requiring continuous data movement of interdependent data sets across platforms that makes these data challenges so complex to solve.
Software architects at vendors understand these challenges, and several companies have tried to address the challenges in their own way. New workload requirements led to new functionality in software platforms that were not specifically optimized for these workloads, reducing their efficiency and worsening data silos within many organizations. Additionally, each platform must have overlapping copies of data, implying issues with data management (data governance, privacy, and security) and higher costs for data storage.
For these reasons, the challenges of the traditional data warehouse and data lake architecture have led businesses to operate complex architectures, with data siloed and copied across data warehouses, data marts, data lakes, and other relational databases throughout the organization. Given the prohibitive costs of high-performance on-premises and cloud data warehouses, and performance challenges within legacy data lakes, neither of these repositories satisfy the need for analytical flexibility and price-performance.
Instead of having each new technology solve the same problem, what is needed is a fresh, new architectural style.
Fortunately, the IT landscape is changing due to a mix of cloud computing platforms, open source, and traditional software vendors. Cloud vendors, leading with object storage, have helped to drive down the cost of disk storage. But data stored in object storage cannot readily be updated and object storage does not offer the type of query performance to which business users have become accustomed. Open-source technology such as Apache Iceberg combined with open-source engines such as Presto and Apache Spark are providing the advantage of object storage along with the business capabilities of better SQL performance and the ability to update large structured and semi-structured data in place. But there is still a gap to be filled that allows all these technologies to work together as a coordinated, integrated platform.
To truly solve these challenges, query, and reporting, provided by engines such as Presto, needs to work along with the Spark infrastructure framework to support advanced analytics and complex data transformations. And Presto and Spark need to readily work with existing and modern data warehouse infrastructures.
The industry is waiting for a breakthrough approach that allows organizations to optimize their analytics ecosystem by selecting the right engine for the right workload at the right cost — without having to copy data to multiple platforms and while taking advantage of integrated metadata. Whichever vendor gets there first will allow organizations to reduce cost and complexity and drive the greatest return on investment from their analytics workloads while also helping to deliver better governance and data security.