0% found this document useful (0 votes)

271 views9 pages

Apache Airflow - A Python Hands-On Guide

The document is a hands-on guide for using Apache Airflow with Python, detailing how to write Directed Acyclic Graphs (DAGs) and utilize various operators like PythonOperator and BranchPythonOperator. It also covers integration with providers such as AWS, Google Cloud, PostgreSQL, and Slack, along with examples for executing tasks and managing dependencies. Additionally, it includes a section on orchestrating Apache Spark jobs within Airflow workflows.

Uploaded by

bisennikhil49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

271 views9 pages

Apache Airflow - A Python Hands-On Guide

Uploaded by

bisennikhil49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Apache Airflow - A Python Hands-On Guide

Available Provider
Available Provider

1. Writing a DAG in Python

A Directed Acyclic Graph (DAG) is the core abstraction in Airflow. It defines the
workflow and task dependencies.

Basic DAG Structure

from datetime import datetime

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

# Define default arguments

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}

# Initialize DAG
with DAG(
dag_id='example_dag',
default_args=default_args,
description='A simple example DAG',
schedule_interval='@daily',
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:

def print_hello():
print("Hello, Airflow!")

task = PythonOperator(

1/9
Apache Airflow - A Python Hands-On Guide

task_id='print_hello',
python_callable=print_hello,
)

Key Notes:
DAG : The container for your workflow.
PythonOperator : Executes Python functions.
schedule_interval : Defines the schedule (e.g., @daily , @hourly ).
catchup : Prevents past executions from running.

2. Options for Python Operators

Airflow offers various operators that use Python extensively:

PythonOperator
Executes Python callables.

PythonOperator(
task_id='process_data',
python_callable=process_data_function,
op_kwargs={'param': 'value'}, # Pass arguments to the callable
)

BranchPythonOperator
Allows branching based on a condition.

from airflow.operators.python_operator import BranchPythonOperator

def choose_branch(**kwargs):
return 'branch_1' if kwargs['some_condition'] else 'branch_2'

branch_task = BranchPythonOperator(
task_id='branching',
python_callable=choose_branch,
provide_context=True,
)

2/9
Apache Airflow - A Python Hands-On Guide

PythonVirtualenvOperator
Executes Python code within a virtual environment.

from airflow.operators.python_operator import

PythonVirtualenvOperator

virtualenv_task = PythonVirtualenvOperator(
task_id='venv_task',
python_callable=lambda: print("Running in a virtualenv!"),
requirements=["numpy", "pandas"],
system_site_packages=False,
)

3. Working with Providers

Providers are integrations for various platforms. Here's how to use five popular ones
with Python:

1. AWS Provider
Install: pip install apache-airflow-providers-amazon

Example: S3 File Upload

from airflow.providers.amazon.aws.operators.s3 import

S3CreateObjectOperator

upload_task = S3CreateObjectOperator(
task_id='upload_to_s3',
aws_conn_id='my_aws_conn',
s3_bucket='my_bucket',
s3_key='path/to/file.txt',
data="Sample Data",
)

2. Google Cloud Provider

Install: pip install apache-airflow-providers-google

3/9
Apache Airflow - A Python Hands-On Guide

Example: BigQuery Query Execution

from airflow.providers.google.cloud.operators.bigquery import

BigQueryExecuteQueryOperator

bq_task = BigQueryExecuteQueryOperator(
task_id='bq_query',
sql='SELECT * FROM my_dataset.my_table',
gcp_conn_id='my_gcp_conn',
use_legacy_sql=False,
)

3. PostgreSQL Provider
Install: pip install apache-airflow-providers-postgres

Example: Run SQL on PostgreSQL

from airflow.providers.postgres.operators.postgres import

PostgresOperator

sql_task = PostgresOperator(
task_id='run_postgres_query',
postgres_conn_id='my_postgres_conn',
sql='SELECT * FROM my_table;',
)

4. Slack Provider
Install: pip install apache-airflow-providers-slack

Example: Send Slack Notification

from airflow.providers.slack.operators.slack_webhook import

SlackWebhookOperator

slack_task = SlackWebhookOperator(
task_id='send_slack_message',

4/9
Apache Airflow - A Python Hands-On Guide

http_conn_id='slack_conn',
message="Workflow completed successfully!",
channel="#alerts",
)

5. MySQL Provider
Install: pip install apache-airflow-providers-mysql

Example: Execute SQL in MySQL

from airflow.providers.mysql.operators.mysql import MySqlOperator

mysql_task = MySqlOperator(
task_id='mysql_query',
mysql_conn_id='my_mysql_conn',
sql='INSERT INTO my_table (id, value) VALUES (1, "test");',
)

5. Python Cheatsheet for Apache Airflow

Component Python Example

DAG DAG(dag_id='my_dag', schedule_interval='@daily', ...)
Initialization
PythonOperator PythonOperator(task_id='task',
python_callable=my_func)

Branching BranchPythonOperator(task_id='branch',
python_callable=my_func)

S3 Upload S3CreateObjectOperator(...,
s3_key='path/to/file.txt')

SQL Execution PostgresOperator(sql='SELECT * FROM table;')

BigQuery BigQueryExecuteQueryOperator(sql='SELECT * FROM

table')

Slack SlackWebhookOperator(message="Job done!")

Notification

5/9
Apache Airflow - A Python Hands-On Guide

Component Python Example

Virtualenv PythonVirtualenvOperator(python_callable=my_func,
...)

Working with Apache Spark in Airflow

Apache Spark is a distributed data processing framework widely used for big data
tasks. In Airflow, we can manage and orchestrate Spark jobs using operators such as:

SparkSubmitOperator : Submits a Spark job directly to a cluster.

EmrAddStepsOperator : Submits a Spark job to an Amazon EMR cluster.
DataprocSubmitJobOperator : Submits a Spark job to Google Dataproc.

These operators allow us to control Spark jobs programmatically within Airflow

workflows.

Complex Workflow: Conditional Spark Job Execution

Workflow Logic:

1. Execute spark_job_1 .
2. Execute spark_job_2 .
3. If spark_job_2 fails, run spark_job_3 .
4. If spark_job_2 succeeds, run spark_job_4 .

DAG Implementation
Step 1: Import Required Modules

from airflow import DAG

from airflow.operators.dummy_operator import DummyOperator
from airflow.providers.apache.spark.operators.spark_submit import
SparkSubmitOperator

6/9
Apache Airflow - A Python Hands-On Guide

from airflow.operators.python_operator import BranchPythonOperator

from datetime import datetime

Step 2: Define the DAG and Default Arguments

default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'retries': 1,
}

dag = DAG(
dag_id='spark_conditional_jobs',
default_args=default_args,
description='A DAG with conditional Spark job execution',
schedule_interval=None,
start_date=datetime(2023, 12, 1),
catchup=False,
)

Step 3: Define the SparkSubmitOperator Jobs

# Spark Job 1
spark_job_1 = SparkSubmitOperator(
task_id='spark_job_1',
application='/path/to/spark_job_1.py',
conn_id='spark_default', # Connection to your Spark cluster
application_args=['arg1', 'arg2'],
dag=dag,
)

# Spark Job 2
spark_job_2 = SparkSubmitOperator(
task_id='spark_job_2',
application='/path/to/spark_job_2.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)

# Spark Job 3
spark_job_3 = SparkSubmitOperator(
task_id='spark_job_3',

7/9
Apache Airflow - A Python Hands-On Guide

application='/path/to/spark_job_3.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)

# Spark Job 4
spark_job_4 = SparkSubmitOperator(
task_id='spark_job_4',
application='/path/to/spark_job_4.py',
conn_id='spark_default',
application_args=['arg1', 'arg2'],
dag=dag,
)

Step 4: Branch Logic Using BranchPythonOperator

def choose_next_task(**kwargs):
# Check the state of spark_job_2
task_instance = kwargs['ti']
spark_job_2_state =
task_instance.xcom_pull(task_ids='spark_job_2', key='return_value')

# Return the task to execute next

if spark_job_2_state == 'failed':
return 'spark_job_3'
return 'spark_job_4'

branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=choose_next_task,
provide_context=True,
dag=dag,
)

Step 5: Define the Task Dependencies

start = DummyOperator(task_id='start', dag=dag)

end = DummyOperator(task_id='end', dag=dag)

# Define dependencies
start >> spark_job_1
spark_job_1 >> spark_job_2
spark_job_2 >> branch_task

8/9
Apache Airflow - A Python Hands-On Guide

branch_task >> spark_job_3 >> end

branch_task >> spark_job_4 >> end

Explanation of Components
1. SparkSubmitOperator :
Used to submit Spark jobs to a cluster.
Specify the application path, connection ID, and arguments for the Spark job.
2. BranchPythonOperator :
Dynamically determines the next task based on the state of a previous task.
In this case, it checks if spark_job_2 succeeded or failed.
3. Dependencies:
The workflow ensures sequential execution from spark_job_1 to
spark_job_2 and conditional branching to spark_job_3 or spark_job_4 .

Python Cheatsheet for Spark in Airflow

Component Python Example

SparkSubmitOperator SparkSubmitOperator(application='/path/app.py',
...)

BranchPythonOperator BranchPythonOperator(python_callable=my_func,
...)

Conditional Execution branch_task >> task_1 >> end or branch_task >>

task_2

Task Dependency task_1 >> task_2 >> task_3

9/9

Airflow CLI Guide for Developers
No ratings yet
Airflow CLI Guide for Developers
10 pages
Airflow
No ratings yet
Airflow
23 pages
Best Practices Apache Airflow
100% (1)
Best Practices Apache Airflow
28 pages
Airflow Notes
No ratings yet
Airflow Notes
5 pages
Developing Elegant Workflows in Python Code With Apache Airflow
100% (1)
Developing Elegant Workflows in Python Code With Apache Airflow
35 pages
Apache Airflow Workflow
No ratings yet
Apache Airflow Workflow
4 pages
2.airflow 2
No ratings yet
2.airflow 2
17 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Airflow DAG for File Processing
No ratings yet
Airflow DAG for File Processing
3 pages
Airflow
No ratings yet
Airflow
97 pages
Notes Airflow MQTT
No ratings yet
Notes Airflow MQTT
6 pages
Airflow
No ratings yet
Airflow
7 pages
Week 6. Airflow Overview
No ratings yet
Week 6. Airflow Overview
71 pages
Study Guide For Apache Airflow Fundamentals Certification
No ratings yet
Study Guide For Apache Airflow Fundamentals Certification
6 pages
Apache Airflow
No ratings yet
Apache Airflow
10 pages
52.2 Apache Airflow On Windows
No ratings yet
52.2 Apache Airflow On Windows
6 pages
Airflow for Data Pipeline Management
100% (1)
Airflow for Data Pipeline Management
6 pages
Pyspark For Etl
No ratings yet
Pyspark For Etl
4 pages
Apache Airflow 1741977651
100% (1)
Apache Airflow 1741977651
83 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Airflowintroduction 190217155729
No ratings yet
Airflowintroduction 190217155729
21 pages
Apache AirFlow
No ratings yet
Apache AirFlow
24 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Appache Airflow
No ratings yet
Appache Airflow
5 pages
Apache Airflow Certification - Study Guide For DAG Authoring
No ratings yet
Apache Airflow Certification - Study Guide For DAG Authoring
17 pages
Scenario Based Airflow Interview Questions
No ratings yet
Scenario Based Airflow Interview Questions
4 pages
Apache Airflow 50
100% (1)
Apache Airflow 50
50 pages
Sid Anand Qcon Ai 2018 v2 PDF
No ratings yet
Sid Anand Qcon Ai 2018 v2 PDF
35 pages
Airflow Web UI and CLI
No ratings yet
Airflow Web UI and CLI
51 pages
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
No ratings yet
Apache Airflow For Data Engineering - The Ultimate Guide - by Vijay Gadhave - Mar, 2025 - Medium
18 pages
Airflow Dag Bash
No ratings yet
Airflow Dag Bash
6 pages
Apache Airflow Workflow Guide
No ratings yet
Apache Airflow Workflow Guide
4 pages
Apache Airflow Cookbook 2
No ratings yet
Apache Airflow Cookbook 2
55 pages
Setting Up Airflow With Docker From Installation To Data Processing
No ratings yet
Setting Up Airflow With Docker From Installation To Data Processing
10 pages
Etalab Talk Apache Airflow Embulk
No ratings yet
Etalab Talk Apache Airflow Embulk
29 pages
Airflow Best Practices
No ratings yet
Airflow Best Practices
34 pages
Apache Airflow For Data Science
No ratings yet
Apache Airflow For Data Science
23 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
93 pages
Overview - DAg Structure and Operators-1
No ratings yet
Overview - DAg Structure and Operators-1
6 pages
Fundamentals of Apache Airflow - CoderProg
No ratings yet
Fundamentals of Apache Airflow - CoderProg
2 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Apache Airflow
No ratings yet
Apache Airflow
24 pages
Airflow Operators Chapter2
No ratings yet
Airflow Operators Chapter2
35 pages
ETL Pipeline, Class Notes
No ratings yet
ETL Pipeline, Class Notes
2 pages
Python Data Pipeline Guide
No ratings yet
Python Data Pipeline Guide
38 pages
Airflow
No ratings yet
Airflow
7 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
PySpark Airflow Interview Pack Shubham
No ratings yet
PySpark Airflow Interview Pack Shubham
3 pages
Airflow Chapter2
No ratings yet
Airflow Chapter2
35 pages
Airflow Setup Guide for Beginners
No ratings yet
Airflow Setup Guide for Beginners
2 pages
The Ultimate Guide To Apache Airflow DAGs
No ratings yet
The Ultimate Guide To Apache Airflow DAGs
135 pages
Apache Airflow Fundamentals Study Guide
No ratings yet
Apache Airflow Fundamentals Study Guide
7 pages
Airflow 101 Mobile
No ratings yet
Airflow 101 Mobile
48 pages
Airflow 101
No ratings yet
Airflow 101
25 pages
Airflow Git CICD
No ratings yet
Airflow Git CICD
6 pages
Aws Ques
No ratings yet
Aws Ques
62 pages
Quick Study Guide To The Endocrine System
No ratings yet
Quick Study Guide To The Endocrine System
11 pages
Myrna Accordion and Orchestra Score
No ratings yet
Myrna Accordion and Orchestra Score
21 pages
Geothermal Energy Bibliography
No ratings yet
Geothermal Energy Bibliography
388 pages
Theology For Beginners PDF
No ratings yet
Theology For Beginners PDF
287 pages
Teamwork Principles
No ratings yet
Teamwork Principles
16 pages
The Jungle Excerpt Questions and HIPP
No ratings yet
The Jungle Excerpt Questions and HIPP
5 pages
7 Steps To Create Systems That Will Change Your Life
No ratings yet
7 Steps To Create Systems That Will Change Your Life
6 pages
Logistics Information System
No ratings yet
Logistics Information System
6 pages
Brochure Vrs Consumables Chemicals Jul Dec 2017
No ratings yet
Brochure Vrs Consumables Chemicals Jul Dec 2017
12 pages
CHA Hyderabad (AutoRecovered) Jan2023
No ratings yet
CHA Hyderabad (AutoRecovered) Jan2023
18 pages
The Level of Interest Between Small Scale and Large Scale
No ratings yet
The Level of Interest Between Small Scale and Large Scale
30 pages
Office: of The Secretary
No ratings yet
Office: of The Secretary
8 pages
Laboratory Oil Testers by Megger
No ratings yet
Laboratory Oil Testers by Megger
4 pages
B.Tech Semester 7 Results
No ratings yet
B.Tech Semester 7 Results
2 pages
MP Material by Sravan
No ratings yet
MP Material by Sravan
189 pages
0054 Syllabus
No ratings yet
0054 Syllabus
2 pages
NIS - Daily - Lesson - Plan - English - Ali Grade 10 2020 - Double (Lessons 1-2)
No ratings yet
NIS - Daily - Lesson - Plan - English - Ali Grade 10 2020 - Double (Lessons 1-2)
4 pages
Nuclear Fuel Rod Thermal Analysis
No ratings yet
Nuclear Fuel Rod Thermal Analysis
12 pages
Rule 8: Action To Avoid A Collision
100% (3)
Rule 8: Action To Avoid A Collision
48 pages
European Commission. (2013, November) - Organic Versus Conventional Farming
No ratings yet
European Commission. (2013, November) - Organic Versus Conventional Farming
10 pages
Listening Compre and Dictation Grade 3
No ratings yet
Listening Compre and Dictation Grade 3
3 pages
Step-By-Step Configuration of MRP Types in Sap PP
No ratings yet
Step-By-Step Configuration of MRP Types in Sap PP
3 pages
Personality & Components
No ratings yet
Personality & Components
2 pages
P Block Notes
No ratings yet
P Block Notes
6 pages
High Seas
No ratings yet
High Seas
7 pages
Tugas Topic 4 Devi Permatasari
No ratings yet
Tugas Topic 4 Devi Permatasari
8 pages
Ganesh Chaturthi 2014
No ratings yet
Ganesh Chaturthi 2014
6 pages
Bayes Theorem PDF
No ratings yet
Bayes Theorem PDF
9 pages
Bulk Solids Chute Design Guide
No ratings yet
Bulk Solids Chute Design Guide
17 pages
Progard H3
No ratings yet
Progard H3
15 pages

Apache Airflow - A Python Hands-On Guide

Uploaded by

Apache Airflow - A Python Hands-On Guide

Uploaded by

Apache Airflow - A Python Hands-On Guide