3.3 Airflow infrastructure
Before Data Scientists and Machine Learning Engineers can utilize the power of Airflow Workflows, Airflow obviously needs to be set up and deployed. There are multiple ways an Airflow deployment can take place. It can be run either on a single machine or in a distributed setup on a cluster of machines. As stated in the prerequisites, a local Airflow setup is on a single machine can be used for this tutorial to give an introduction on how to work with airflow. Although Airflow can be run on a single machine, it should be deployed as a distributed system to utilize its full power when working with Airflow in production.
Airflow can be configured by modifying the airflow.cfg
file or by using environment variables. The airflow.cfg
file is a key component, as it contains various settings and parameters that control the behavior and settings of the Airflow system. Some of the common settings that can be configured include settings that define the core behavior of Airflow, as well as settings regarding the Airflow Executor, Logging, Security, or the Scheduler. All available configuration settings can seen here. The .cfg
is usually located in the “conf” directory within your Airflow installation.
3.3.1 Airflow as a distributed system
Airflow consists of several separate architectural components. While this separation is somewhat simulated on a local deployment, each unit of Airflow can be set up separately when deploying Airflow in a distributed manner. A distributed architecture comes with benefits of availability, security, reliability, and scalability.
An Airflow deployment generally consists of five different compontens:
- Scheduler: The schedules handles triggering scheduled workflows and submitting tasks to the executor to run.
- Executor: The executor handles running the tasks themselves. In a local installation of Airflow, the tasks are run by the executor itself. In a production ready deployment of Airflow the executor pushes the task execution to separate worker instance which then runs the task.
- Webserver: The webserver provides the web user interface of Airflow that allows to inspect, trigger, and debug DAGs and tasks.
- DAG Directory: The DAG directory is a directory that contains the DAG files which are read by the scheduler and executor.
- Metadata Database: The metadata database is used to store data of the scheduler, executor, and the webserver, such as scheduling- or runtime, user settings, or XCom.
The following graphs shows how the components build up the Airflow architecture.
3.3.2 Scheduler
The scheduler is basically the brain and heart of Airflow. It handles triggering and scheduling of workflows, as well as submitting tasks to the executor to run. To be able to do this, the scheduler is responsible to parse the DAG files from the DAG directory, manage the database states in the metadata database, and to communicate with the executor to schedule tasks. Since the release of Airflow 2.0 it is possible to run multiple schedulers at a time to ensure a high availability and reliability of this centerpiece of Airflow.
3.3.3 Webserver
The webserver runs the web interface of Airflow and thus the user interface every Airflow user sees. This allows to inspect, trigger, and debug DAGs and tasks in Airflow (and much more!). Each user interaction and change is written to the DAG directory or the metadata database, from where the scheduler will read and act upon.
3.3.4 Executor
The executor defines where and how the Airflow tasks should be executed and run. This crucial component of Airflow can be configured by the user and should be chosen to fit the users specific needs. There are several different executors, each handling the run of a task a bit differently. Choosing the right executor also relies on the underlying infrastructure Airflow is build upon.
In a production ready deployment of Airflow the executor pushes the task execution to separate worker instances that run the tasks. This allows for different setups such as a Celery-like executors or an executor based on Kubernetes. The benefit of a distributed deployment is its reliability and availability, as it is possible to have many workers in different places (for example using separate virtual machines, or multiple kubernetes pods). It is further possibility to run tasks on different instances based on their needs, for example to run the training step of a machine learning model on a GPU node.
- The
SequentialExecutor
is the default executor in Airflow. This executor only runs one task instance at a time and should therefore not used in a production use case. - The
LocalExecutor
executes each task in a separate process on a single machine. It’s the only non-distributed executor which is production ready and works well in relatively small deployments. If Airflow is installed locally as mentioned in the prerequisites, this executor will be used. - In contrast, the
CeleryExecutor
uses under the hood the Celery queue system that allows users to deploy multiple workers that read tasks from the broker queue (Redis or RabbitMQ) where tasks are sent by scheduler. This enables Airflow to distribute tasks between many machines and allows users to specify what task should be executed and where. This can be useful for routing compute-heavy tasks to more resourceful workers and is the most popular production executor. - The
KubernetesExecutor
is another widely used production-ready executor and works similarly to theCeleryExecutor
. As the name already suggests it requires an underlying Kubernetes cluster that enables Airflow to spawn a new pod to run each task. Even though this is a robust method to account for machine or pod failure, the additional overhead in creating pods or even nodes can be problematic for short running tasks. - The
CeleryKubernetsExecutor
uses both, theCeleryExecutor
andKubernetesExecutor
(as the name already says). It allows to distinguish whether a particular task should be executed on kubernetes or routed to the celery workers. This way users can take full advantage of horizontal auto scaling of worker pods, and to delegate computational heavy tasks to kubernetes. - The additional
DebugExecutor
is an executor whose main purpose is to debug DAGs locally. It’s the only executor that uses a single process to execute all tasks.
Examining which executor is currently set can be done by running the following command.
airflow config get-value core executor
3.3.5 DAG Directory
The DAG directory contains the DAG files written in Python. The DAG directory is read and used by each Airflow component for a different purpose. The web interface lists all written DAGs from the directory as well as their content. The scheduler and executor run a DAG or a task based on the input read from the DAG directory.
The DAG directory can be of different nature. It can be a local folder in case of a local installation, or also use a separate repository like Git where the DAG files are stored. The scheduler will recurse through the DAG Directory so it is also possible to create subfolders for example based on different projects.
3.3.6 Metadata Database
The metadata database is used to store data of the scheduler, executor, and the webserver, such as scheduling- or runtime, user settings, or XCom. It is beneficial to run the metadata database as a separate component to keep all data safe and secure in case there are erroneous other parts of the infrastructure. Such considerations account for the architectural decisions of an Airflow deployment and the metadata database itself. For example, it is possible to run Airflow and all of its components on a Kubernetes cluster. However, this is not necessarily recommended as it is prone to the cluster and its accessibility itself. It is also possible to outsource Airflow components such as the metadata database to use a clouds’ distinguished database resources for example. For example to store all metadata, in the Relational Database Service (RDS) on the AWS Cloud.