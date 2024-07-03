Modern data management solutions provide an efficient way to manage data and metadata across diverse datasets. Modern systems are built with the latest data management software and reliable databases or data stores. This can include transactional data lakes, data warehouses or data lakehouses, combined with a data fabric architecture including data ingestion, governance, lineage, observability and master data management. Together, this trusted data foundation can feed quality data to data consumers as data products, business intelligence (BI) and dashboarding, and AI models—both traditional ML and generative AI.

A strong data management strategy typically includes multiple components to streamline strategy and operations throughout an organization.

The right databases and data lakehouse architecture

While data can be stored before or after data processing, the type of data and purpose of it will usually dictate the storage repository that is used. While relational databases organize data into a tabular format, nonrelational databases do not have as rigid of a database schema.

Relational databases are also typically associated with transactional databases, which run commands or transactions collectively. An example is a bank transfer. A defined amount is withdrawn from one account and then it is deposited within another. But for enterprises to support both structured and unstructured data types, they require purpose-built databases. These databases must also cater to various use cases across analytics, AI and applications. They must span both relational and nonrelational databases, such as key-value, document, wide-column, graph and in-memory. These multimodal databases provide native support for different types of data and the latest development models, and can run many kinds of workloads, including IoT, analytics, ML and AI.

Data management best practices suggest that data warehousing be optimized for high-performance analytics on structured data. This requires a defined schema to meet specific data analytics requirements for specific use cases, such as dashboards, data visualization and other business intelligence tasks. These data requirements are usually directed and documented by business users in partnership with data engineers, who will ultimately run against the defined data model.

The underlying structure of a data warehouse is typically organized as a relational system that uses a structured data format, sourcing data from transactional databases. However, for unstructured and semistructured data, data lakes incorporate data from both relational and nonrelational systems, and other business intelligence tasks. Data lakes are often preferred to the other storage options because they are normally a low-cost storage environment, which can house petabytes of raw data.

Data lakes benefit data scientists in particular, as they enable them to incorporate both structured and unstructured data into their data science projects. However, data warehouses and data lakes have their own limitations. Proprietary data formats and high storage costs limit AI and ML model collaboration and deployments within a data warehouse environment.

In contrast, data lakes are challenged with extracting insights directly in a governed and performant manner. An open data lakehouse addresses these limitations by handling multiple open formats over cloud object storage and combines data from multiple sources, including existing repositories, to ultimately enable analytics and AI at scale.

Hybrid cloud database strategy

Multicloud and hybrid strategies are steadily becoming more popular. AI technologies are powered by massive amounts of data that require modern data stores that reside on cloud-native architectures to provide scalability, cost optimization, enhanced performance and business continuity. According to Gartner2, by the end of 2026, "90% of data management tools and platforms that fail to support multi-cloud and hybrid capabilities will be set for decommissioning."

While existing tools aid database administrators (DBAs) in automating numerous conventional management duties, manual involvement remains necessary due to the typically large and intricate nature of database setups. Whenever manual intervention becomes necessary, the likelihood of errors rises. Minimizing the necessity for manual data management stands as a primary goal in operating databases as fully managed services.

Fully managed cloud databases automate time-consuming tasks such as upgrades, backups, patching and maintenance. This approach helps free DBAs from time-consuming manual tasks to spend more time on valuable tasks such as schema optimization, new cloud-native apps and support for new AI use cases. Unlike on-premises deployments, cloud storage providers also enable users to spin up large clusters as needed, often requiring only payment for the storage specified. This means that if an organization needs more compute power to run a job in a few hours (versus a few days), it can do this on a cloud platform by purchasing more compute nodes.

This shift to cloud data platforms is also facilitating the adoption of streaming data processing. Tools such as Apache Kafka enable more real-time data processing, so that consumers can subscribe to topics to receive data in a matter of seconds. However, batch processing still has its advantages as it’s more efficient at processing large volumes of data. While batch processing abides by a set schedule, such as daily, weekly or monthly, it is ideal for business performance dashboards, which typically do not require real-time data.

Data fabric architecture

More recently, data fabrics have emerged to assist with the complexity of managing these data systems. Data fabrics use intelligent and automated systems to facilitate end-to-end integration of data pipelines and cloud environments. A data fabric also simplifies delivery of quality data and provides a framework for enforcing data governance policies to help ensure that the data used is compliant. This facilitates self-service access to trustworthy data products by connecting to data residing across organizational silos, so that business leaders gain a more holistic view of business performance. The unification of data across HR, marketing, sales, supply chain and others give leaders a better understanding of their customer.



A data mesh might also be useful. A data fabric is an architecture that facilitates the end-to-end integration. In contrast, a data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service and more. This approach provides more ownership to the producers of a dataset.

Data integration and processing

Within this stage of the data management lifecycle, raw data is ingested from a range of data sources, such as web APIs, mobile apps, Internet of Things (IoT) devices, forms, surveys and more. After data collection, the data is usually processed or loaded by using data integration techniques, such as extract, transform, load (ETL) or extract, load, transform (ELT). While ETL has historically been the standard method to integrate and organize data across different datasets, ELT has been growing in popularity with the emergence of cloud data platforms and the increasing demand for real-time data.

In addition to batch processing, data replication is an alternative method of integrating data and consists of synchronizing data from a source location to one or more target locations, helping ensure data availability, reliability and resilience. Technology such as change data capture (CDC) uses log-based replication to capture changes made to data at the source and propagate those changes to target systems, helping organizations make decisions based on current information.

Independently of the data integration technique used, the data is usually filtered, merged or aggregated during the data processing stage to meet the requirements for its intended purpose. These applications can range from a business intelligence dashboard to a predictive machine learning algorithm.

Using continuous integration and continuous deployment (CI/CD) for version control can enable data teams to track changes to their code and data assets. Version control enables data teams to collaborate more effectively, as they can work on different parts of a project simultaneously and merge their changes without conflicts.

Data governance and metadata management

Data governance promotes the availability and usage of data. To help ensure compliance, governance generally includes processes, policies and tools around data quality, data access, usability and data security. For instance, data governance councils tend to align taxonomies to help ensure that metadata is added consistently across various data sources. A taxonomy can also be further documented through a data catalog to make the data more accessible to users, facilitating data democratization across an organization.

Enriching data with the right business context is critical for the automated enforcement of data governance policies and data quality. This is where service level agreement (SLA) rules come into effect, helping ensure that data is protected and of the required quality. It is also important to understand the provenance of the data and gain transparency into the journey of the data as it moves through pipelines. This calls for robust data lineage capabilities to drive visibility as organizational data makes it ways from data sources to the end users. Data governance teams also define roles and responsibilities to help ensure that data access is provided appropriately. This controlled access is particularly important to maintain data privacy.



Data security

Data security sets guardrails in place to protect digital information from unauthorized access, corruption or theft. As digital technology becomes an increasing part of our lives, more scrutiny is placed upon the security practices of modern businesses. This scrutiny is important to help protect customer data from cybercriminals or to help prevent incidents that need disaster recovery. While data loss can be devastating to any business, data breaches, in particular, can result in costly consequences from both a financial and brand standpoint. Data security teams can better secure their data by using encryption and data masking within their data security strategy.

Data observability

Data observability refers to the practice of monitoring, managing and maintaining data in a way that helps ensure its quality, availability and reliability across various processes, systems and pipelines within an organization. Data observability is about truly understanding the health of an organization’s data and its state across a data ecosystem. It includes various activities that go beyond traditional monitoring, which only describes a problem. Data observability can help identify, troubleshoot and resolve data issues in near-real time.

Master data management

Master data management (MDM) focuses on the creation of a single, high-quality view of core business entities including products, customers, employees and suppliers. By delivering accurate views of master data and their relationships, MDM enables faster insights, improved data quality and compliance readiness. With a single 360-degree view of master data across the enterprise, MDM enables businesses with the right data to drive business analytics, determine their most successful products and markets, and their highest valued customers.