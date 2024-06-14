Before data flows into a data repository, it usually undergoes some data processing. This is inclusive of data transformations, such as filtering, masking, and aggregations, which ensure appropriate data integration and standardization. This is particularly important when the destination for the dataset is a relational database. This type of data repository has a defined schema which requires alignment—that is, matching data columns and types—to update existing data with new data.

As the name suggests, data pipelines act as the “piping” for data science projects or business intelligence dashboards. Data can be sourced through a wide variety of places—APIs, SQL and NoSQL databases, files, et cetera—but unfortunately, that data usually isn’t ready for immediate use. During sourcing, data lineage is tracked to document the relationship between enterprise data in various business and IT applications, for example, where data is currently and how it’s stored in an environment, such as on-premises, in a data lake or in a data warehouse.

Data preparation tasks usually fall on the shoulders of data scientists or data engineers, who structure the data to meet the needs of the business use cases and handle huge amounts of data. The type of data processing that a data pipeline requires is usually determined through a mix of exploratory data analysis and defined business requirements. Once the data has been appropriately filtered, merged, and summarized, it can then be stored and surfaced for use. Well-organized data pipelines provide the foundation for a range of data projects; this can include exploratory data analyses, data visualizations, and machine learning tasks.