My IBM

What is a data platform?

Authors

What is a data platform?

A data platform is a technology solution that enables the collection, storage, cleaning, transformation, analysis and governance of data. Data platforms can include both hardware and software components. They make it easier for organizations to use their data to improve decision making and operations.

Today, many organizations rely on complex data pipelines to support data analytics, data science and data-driven decisions. A modern data platform provides the tools that organizations need to safeguard data quality and unlock the value of their data.

Specifically, data platforms can help surface actionable insights, reduce data silos, enable self-service analytics, streamline automation and power artificial intelligence (AI) applications.

A data platform, also referred to as a “data stack,” is composed of five foundational layers: data storage and processing, data ingestion, data transformation, business intelligence (BI) and analytics and data observability.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Types of data platforms

Data platforms can be built and configured to serve specific business functions. Some of the most common types of data platforms include:

Enterprise data platform (EDP)
Big data platform (BDP)
Cloud data platform (CDP)
Customer data platform (CDP)

Enterprise data platform (EDP)

Enterprise data platforms were originally developed to serve as central repositories to make data more accessible across an organization. These platforms typically housed data on-premises, in operational databases or data warehouses. They often handled structured customer, financial and supply chain data.

Today’s modern data platforms expand the capabilities of traditional enterprise data platforms to male sure that data is accurate and timely, reduce data silos and enable self-service. Modern data platforms are often built on a suite of cloud-native software, which supports more flexibility and cost-effectiveness.

The two fundamental principles that govern enterprise data platforms are:

Availability: Data is readily available in a data lake, data warehouse or data lakehouse, which separate storage and compute. Splitting these functions makes it possible to store large amounts of data relatively inexpensively.
Elasticity: Compute functions are cloud-based, which enables autoscalability. For example, if most data and analytics are used at a certain day and time, processing can be automatically scaled up for a better customer experience and scaled back down as workload needs decrease.

Big data platform (BDP)

A big data platform is designed to gather, process and store large volumes of data, often in real time. Given the huge volumes of data they handle, big data platforms often use distributed computing, with the data spread across many servers.

Other types of data platforms might also manage large volumes of data, but a big data platform is specially designed to process that data at high speeds. An enterprise-grade BDP is able to run complex queries against massive datasets, whether structured, semistructured or unstructured. Typical BDP uses include big data analytics, fraud detection, predictive analytics and recommendation systems.

Big data platforms are often available as software-as-a-service (SaaS) products, as part of a data as a service (DaaS) offering or in a cloud computing suite.

Cloud data platform (CDP)

As the name implies, the defining feature of a cloud data platform is that it is cloud-based, which can provide multiple benefits:

A cloud data platform is often available on a pay-as-you-go basis.
Total storage space is flexible, for scaling up or down as needed.
Staff is not needed to maintain an on-premises hardware platform.
A cloud data platform can house platforms for big data, enterprise data or customer data.
Many CDPs offer supplemental capabilities such as advanced analytics, machine learning (ML) and visualization tools.

Customer data platform (CDP)

A customer data platform collects and unifies customer data from multiple sources to build a single, coherent and complete view of every customer.

Input to the CDP might be received from an organization’s customer relationship management (CRM) system, social media activity, touchpoints with the organization, transactional systems or website analytics.

A unified, 360-degree view of customers can give an organization greater insight into their behavior and preferences, enabling more targeted marketing, better user experiences and new revenue opportunities.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Go to episode

Layers in a data platform

Data platforms can come in all shapes and sizes, depending on the needs of the organization. A typical platform includes at least these five layers:

Data storage
Data ingestion
Data transformation
Business intelligence and analytics
Data observability

1. Data storage

The first layer in many data platforms is the data storage layer. The type of data storage used depends on the needs of the organization and can include both on-premises and cloud storage. Common data stores include:

Data warehouses

A data warehouse—or enterprise data warehouse (EDW)—aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, AI and machine learning. Data warehouses are most often used for managing structured data with clearly defined analytics use cases.

Data lakes

A data lake is a lower-cost storage environment, which typically houses petabytes of raw data. A data lake can store both structured and unstructured data in various formats, allowing researchers to more easily work with a broad range of data.

Data lakes were often originally built in the Hadoop ecosystem, an open-source project based on NoSQL. Starting around 2015, many data lakes began shifting to the cloud. A typical data lake architecture now might store data on an object storage platform, such as Amazon S3 from Amazon Web Services (AWS) and use a tool such as Spark to process the data.

Data lakehouses

A data lakehouse combines the capabilities of data warehouses and data lakes into a single data management solution.

While data warehouses offer better performance than data lakes, they are often more expensive and limited in their ability to scale. Data lakes optimize for storage costs but lack the structure for useful analytics.

A data lakehouse is designed to address these challenges by using cloud object storage to store a broader range of data types—that is, structured data, unstructured data and semistructured data. A data lakehouse architecture combines this storage with tools to support advanced analytics efforts, such as business intelligence and machine learning.

2. Data ingestion

The process of collecting data from various sources and moving the data into a storage system is called data ingestion. When ingested, data can be used for record-keeping purposes or further processing and analysis.

The effectiveness of an organization’s data infrastructure depends largely on how well data is ingested and integrated. If there are problems during ingestion, such as missing or outdated data sets, every step of the downstream analytical workflows might suffer.

Ingestion can use different data processing models, depending on the needs of an organization and its overarching data architecture.

Batch processing is the most common form of data ingestion. It does not process data in real time, but instead collects and groups data into batches, which are then sent to storage. Batch processing might be initiated by using a simple schedule or activated when certain predetermined conditions exist. It is typically used when real-time data is not necessary, because it requires less work and is less expensive than real-time processing.
Real-time processing, also called streaming or stream processing, does not group data. Instead, data is obtained, transformed and loaded as it is recognized. Real-time processing is more expensive because it requires constant monitoring of data sources.

3. Data transformation

The third layer, data transformation, deals with changing the structure and format of data to make it usable for data analytics and other projects. For example, unstructured data can be converted to an SQL format to make it easier to search. Data can be transformed either before or after arriving at the storage destination.

Until recently, most data ingestion models used an extract, transform, load (ETL) procedure to take data from its source, reformat it and transport it to its destination. This makes sense when businesses use in-house analytics systems. Doing the prep work before delivering data to its destination can help lower costs. Organizations that still use on-premises data warehouses normally use an ETL process.

However, many organizations today prefer cloud-based data warehouses, such as IBM Db2 Warehouse, Microsoft Azure, Snowflake or BigQuery from Google Cloud. Cloud scalability enables organizations to use an extract, load, transform (ELT) model, which bypasses preload transformations to send raw data directly to the data warehouse more quickly. The data is then transformed as needed after arriving, typically when running a query.

4. Business intelligence and analytics

The fourth data platform layer includes business intelligence (BI) and analytics tools that enable users to leverage data for business analytics and big data analytics efforts. For example, BI and analytics tools might let users query data, transform it into visualizations or otherwise manipulate it.

For many departments in an organization, this layer is the face of the data platform, where users directly interact with the data.

Researchers and data scientists can work with data to derive actionable intelligence and insights. Marketing departments might use BI and analytics tools to learn more about their customers and find valuable initiatives. Supply chain teams might use data analytics insights to streamline processes or find superior vendors.

Using this layer is the primary reason organizations gather data in the first place.

5. Data observability

Data observability is the practice of monitoring, managing and maintaining data to promote data quality, availability and reliability. Data observability covers several activities and technologies, including tracking, logging, alerting and anomaly detection.

These activities, when combined and viewed on a dashboard, enable users to identify and resolve data difficulties in near real time. For example, the observability layer helps data engineering teams answer specific questions about what is taking place behind the scenes in distributed systems. It can show how data flows through the system, where data is moving slowly and what is broken.

Observability tools can also alert managers, data teams and other stakeholders about potential problems so that they can proactively address issues.

Additional data platform layers

In addition to those five foundational layers, other layers that are common in a modern data stack include:

Data discovery

Inaccessible data is useless data. Data discovery helps make sure that data doesn’t just sit out of sight. Specifically, data discovery is about collecting, evaluating and exploring data from disparate sources, with the goal of bringing together data from siloed or previously unknown sources for analysis.

Data governance

Modern data platforms often emphasize data governance and data security to protect sensitive information, drive regulatory compliance, facilitate access and manage data quality. Tools supporting this layer include access controls, encryption, auditing and data lineage tracking.

Data cataloging and metadata management

Data catalogs use metadata—data that describes or summarizes data—to create an informative and searchable inventory of all data assets in an organization. For example, a data catalog can help people more quickly locate unstructured data, including documents, images, audio, video and data visualizations.

Machine learning and AI

Some enterprise-grade data platforms incorporate machine learning and AI capabilities to help users extract valuable insights from data. For example, platforms might feature predictive analytics algorithms, machine learning models for anomaly detection and automated insights powered by generative AI tools.

Why data platforms matter

A robust data platform can help an organization get more value from its data by enabling greater control over data by technical staff and faster self-service for everyday users.

Data platforms can help knock down data silos, one of the biggest barriers to data usability. Separate departments—such as HR, production and supply chain—might maintain separate data stores in separate environments, creating inconsistencies and overlaps. When data is unified on a data platform, it creates an organization-wide single source of truth (SSoT).

Analytics and business decisions can be improved by the removal of silos and improved data integration. In this way, data platforms are key components of a robust data fabric, which helps decision-makers get a more cohesive view of organizational data. This cohesive view can help organizations draw new connections between data and harness big data for data mining and predictive analytics.

A data platform can also enable an organization to study end-to-end data processes and find new efficiencies. An enterprise-grade data platform can also speed access to information, which can boost efficiency for both internal decision-making and customer-facing efforts.

Finally, a well-managed data platform can offer diversified and redundant data storage, improving organizational resilience in the face of cyberattacks or natural disasters.

Data management for AI and analytics

Explore the value of data architectures and learn how IBM’s database portfolio can help simplify data for all your applications, analytics and AI workflows.

Resources

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

2024 Gartner® Magic Quadrant™ for Data Integration Tools

IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Increase AI adoption with AI-ready data

Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

IBM Research® data management publications

Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.

Gartner® predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

What is a data platform?

Authors

Jim Holdsworth

Matthew Kosinski

What is a data platform?

The latest AI News + Insights

Types of data platforms

Enterprise data platform (EDP)

Big data platform (BDP)

Cloud data platform (CDP)

Customer data platform (CDP)

Is data management the secret to generative AI?

Layers in a data platform

1. Data storage

2. Data ingestion

3. Data transformation

4. Business intelligence and analytics

5. Data observability

Additional data platform layers

Data discovery

Data governance

Data cataloging and metadata management

Machine learning and AI

Why data platforms matter

Resources

Related solutions