Apache Airflow
Introduction
Agenda
●   What is Airflow
●   What is a workflow
●   Example of an Airflow workflow
●   Background and the world before Airflow
●   Purpose
●   Terminologies
●   Core Components
●   Usages
●   Demo
What is Airflow
● Apache Airflow is an open-source platform for programmatically
  authoring, scheduling, and monitoring workflows.
● It allows you to define and manage complex data pipelines as
  directed acyclic graphs (DAGs) of tasks and automate the process of
  creating and updating data pipelines. It provides a rich web-based
  interface for setting up, monitoring and managing workflow
  execution, and an API for triggering and monitoring workflows.
What is a Workflow?
● A sequence of tasks
● Started on a schedule or triggered by an event
● Frequently used to handle big data processing pipelines
   Example for an Airflow workflow
Example of an Airflow workflow
1.   Download data
2.   Send data to processing
3.   monitor processing
4.   Generate report
5.   Send email
Background
A developer wants to run a job on schedule
 ● Cron job (Job scheduling)
 ● Python or bash script
                                  Extract data from data
          Start                                            End
                                  source A to storage B
Background
Business demands more data extractions from various sources
Solution. Develop more cron job
                           1      Extract data from
             Start                data source A to            End
                                  storage B
                           2      Extract data from
             Start                data source C to            End
                                  storage D
              ….                                               ….
                           n
                                  Extract data from
             Start                data source E to            End
                                  storage F
Challenges with cron jobs
●   Hard to scale
●   Hard to monitor
●   Hard to maintain
●   Hard to maintaining dependencies
●   Hard to manage jobs failures and timeouts
●   Hard to manage deployments
Airflow advantages
Developers can programmatically:
 ●   Author workflows
 ●   Schedule workflows
 ●   Monitor workflow
 ●   Debug
 ●   Scale easly
Airflow Terms
●   Task
                     1       Extract data from
           Start             data source A to     End
                             storage B
                         2    Extract data from
             Start            data source A to     End
                              storage B
Airflow DAG
● A DAG (Directed Acyclic Graph) is used to define a workflow as a series of
  tasks and how they interact with each other.
● Each task in a DAG represents a single operation in your workflow, such as
  running a query, sending an email, or uploading a file.
● The relationships between tasks are defined by dependencies, where one task
  can only run after another task has completed
Airflow Core components
            Task
            Execution     Webserver   Web UI
            logs
            Metadata
                          Scheduler   Workers
            database
Airflow usage
●   Run and automate ETL pipelines
●   Data ingestion pipelines
●   Machine learning pipelines
●   Predictive data pipelines
●   General purpose scheduling
Airflow architecture
● Scheduler: Triggers scheduled workflows and submits tasks to executor to
  run
● Executor: Manages tasks
● Worker: Runs the tasks
● Webserver:Supports the user interface
● Metadata database: Stores information about DAGs and tasks