Orchestration by using Apache Airflow

Apache Airflow is an open source platform that allows you to create, schedule, and monitor workflow. Work-flows are defined as Directed Acyclic Graphs (DAGs) which consist of multiple tasks that are written by using Python code. Each task represents a discrete unit of work, such as running a script, querying a database, or calling an API. The Airflow architecture supports scaling and parallel execution, making it suitable for managing complex, data-intensive pipelines.

Applies to :

Spark engine

Apache airflow supports the following use cases:
  • ETL or ELT Pipelines : Extracting data from various sources, transforming it, and loading it into the data warehouse.
  • Data Warehousing : Scheduling regular updates and data transformations in a data warehouse.
  • Data Processing: Orchestrating distributed data processing tasks across different systems.