8.3 Components

In this configuration, several custom modules are in place, including Airflow, MLflow, JupyterHub, and monitoring. Each module has a distinct role in deploying specific workflow tools.

Additionally, these module names correspond to their respective namespaces within the cluster.

8.3.1 User Profiles

The code within the "User-profiles" module serves the purpose of configuring IAM (Identity and Access Management) policies and roles tailored for an ML (Machine Learning) platform. It involves the creation of AWS IAM user profiles, each linked to specific users and their corresponding access keys. These users are granted access privileges determined by specific policies, which can be designated as either user or developer access levels. All pertinent information is securely stored within AWS Secrets Manager. The list of users is dynamically provided to the module through the var.profiles variable from the root modules.

At the outset of the code, data sources are defined to retrieve essential information about the AWS caller’s identity and the current AWS region. These data sources, namely aws_caller_identity and aws_region, serve as repositories for critical details that may be utilized in subsequent configurations.

Furthermore, the code introduces an AWS managed policy named “AmazonSageMakerFullAccess” using the aws_iam_policy data source. This policy is characterized by its unique ARN (Amazon Resource Name) and is configured to grant comprehensive access privileges to Amazon SageMaker services.

data "aws_caller_identity" "current" {}
data "aws_region" "current" {}

data "aws_iam_policy" "AmazonSageMakerFullAccess" {
  arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}

An IAM policy titled "mlplatform-developer-access-policy" is meticulously crafted to cater to platform developers, extending to them full-fledged access to a wide array of AWS services. This comprehensive access includes EKS, EC2 , S3, RDS, and VPC. The policy’s specifications are delineated within a JSON file situated at the designated file path.

In a parallel fashion, an IAM policy denoted as "mlplatform-user-access-policy" is tailored to accommodate platform users, authorizing them to harness the capabilities of Amazon SageMaker services. This policy is also constructed through the utilization of a JSON file, residing at the specified path, and is thoughtfully designed to provide users with the necessary access privileges.

resource "aws_iam_policy" "mlplatform_developer_access_policy" {
  name        = "mlplatform-developer-access-policy"
  description = "Access for platform developers granting them full EKS, EC2, S3, RDS, VPC access"

  policy = file("${path.module}/access_policies/AccessPolicyDeveloper.json")
}

resource "aws_iam_policy" "mlplatform_user_access_policy" {
  name        = "mlplatform-user-access-policy"
  description = "Access for platform users granting them access to Sagemaker"

  policy = file("${path.module}/access_policies/AccessPolicyUser.json")
}

For each profile outlined within the var.profiles variable, the “aws-profiles” module is instantiated. The primary purpose of this module is to facilitate the creation of IAM users, IAM roles, and AWS Secrets Manager secrets. Furthermore, it establishes the necessary associations by linking the previously defined developer and user access policies with their corresponding roles.

module "aws-profiles" {
  for_each = var.profiles
  source   = "./aws-profiles"
  profile  = each.value

  access_policy_developer = aws_iam_policy.mlplatform_developer_access_policy.arn
  access_policy_user      = aws_iam_policy.mlplatform_user_access_policy.arn
}

In the concluding phase of the code, the "AmazonSageMakerFullAccess" policy is affixed to IAM roles that are employed by platform users. This process involves iterating through the local.user_user_access_auth_list, which encompasses a collection of IAM role names designated for users. For each user role, the SageMaker access policy is linked or attached.

# Add additional policies to ALL users
resource "aws_iam_role_policy_attachment" "sagemaker_access_user_role_policy" {
  for_each = toset(local.user_user_access_auth_list)

  role       = each.value
  policy_arn = data.aws_iam_policy.AmazonSageMakerFullAccess.arn

  depends_on = [module.aws-profiles]
}

8.3.1.1 AWS Profiles

The aws-profiles module, which is invoked within the user-profiles module, streamlines the creation of IAM users, roles, access keys, and Secrets Manager secrets tailored to user profiles within an ML platform. This module takes charge of managing IAM permissions and securely storing secrets, ensuring that platform users and developers can access AWS resources securely.

In this segment, local variables are established based on the details provided in the var.profile. It dissects the username into firstName and lastName, extracts the role, and assembles a username for the IAM user. Just as in the preceding code snippet, this section fetches details regarding the AWS caller’s identity and the current AWS region.

locals {
  firstName = split(".", var.profile.username)[0]
  lastName  = split(".", var.profile.username)[1]
  role      = var.profile.role
  username  = "${local.firstName}-${local.lastName}"
}

data "aws_caller_identity" "current" {}
data "aws_region" "current" {}

Next, an IAM user is generated using the previously constructed username. The user is established at the root level (“/”) within IAM. Additionally, an IAM access key is generated for this IAM user, providing the capability for programmatic access to AWS resources.

resource "aws_iam_user" "this" {
  name = local.username
  path = "/"
}

resource "aws_iam_access_key" "this" {
  user = aws_iam_user.this.name
}

Following this, an IAM role is generated and assigned the name "mlplatform-access-${local.firstName}-${local.lastName}". This role is configured with a maximum session duration of 28,800 seconds, equivalent to 8 hours. The assume_role_policy is defined to grant the IAM user the ability to assume this role, and it also authorizes Amazon S3 to assume this role. This is commonly employed to facilitate access to S3 buckets.

resource "aws_iam_role" "user_access_role" {
  name                 = "mlplatform-access-${local.firstName}-${local.lastName}"
  max_session_duration = 28800

  assume_role_policy = <<EOF
  {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
              "AWS": "arn:aws:iam::${data.aws_caller_identity.current.account_id}:user/${aws_iam_user.this.name}"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
              "Service": "s3.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
  }
  EOF
  #   # tags = {
  #   #   tag-key = "tag-value"
  #   # }
}

Furthermore, the code dynamically attaches two IAM role policy attachments based on the role specified in var.profile. If the role is designated as “Developer”, the "access_policy_developer" is linked to the IAM role. Alternatively, if the role is marked as “User”, the "access_policy_user" is connected to the IAM role. This conditional attachment of policies ensures that each IAM role is granted the appropriate level of access based on the user’s role designation.

resource "aws_iam_role_policy_attachment" "role_attachement_policy_developer" {
  count      = local.role == "Developer" ? 1 : 0
  role       = aws_iam_role.user_access_role.name
  policy_arn = var.access_policy_developer
}

resource "aws_iam_role_policy_attachment" "role_attachement_policy_user" {
  count      = local.role == "User" ? 1 : 0
  role       = aws_iam_role.user_access_role.name
  policy_arn = var.access_policy_user
}

Concluding the module code, an AWS Secrets Manager secret is generated with a name derived from the username found in var.profile. A secret version is also established and populated with a range of attributes, encompassing access key information, username, email, role, and IAM role ARN. This comprehensive secret ensures secure storage and retrieval of critical user-specific data for authentication and access control.

resource "aws_secretsmanager_secret" "this" {
  name                    = var.profile.username
  recovery_window_in_days = 0
}

resource "aws_secretsmanager_secret_version" "this" {
  secret_id = aws_secretsmanager_secret.this.id
  secret_string = jsonencode(
    {
      "ACCESS_KEY_ID" : aws_iam_access_key.this.id,
      "SECRET_ACCESS_KEY" : aws_iam_access_key.this.secret
      "username" : var.profile.username
      "email" : var.profile.email
      "role" : local.role
      "firstName" : local.firstName
      "lastName" : local.lastName
      "AWS_role" : aws_iam_role.user_access_role.arn
  })
}

8.3.2 Airflow

The Airflow module is responsible for provisioning all components related to the deployment of Airflow. Being a crucial workflow orchestration tool in our ML platform, Airflow is tightly integrated with various other components in the Terraform codebase, which requires it to receive multiple input variables and configurations. This Terraform code serves as the foundation for erecting an ML platform dashboard based on Apache Airflow, encompassing the configuration of a diverse array of components, including IAM roles, data storage, RDS, and Helm releases.

The deployment of Airflow itself via this Terraform code hinges on the utilization of a Helm chart. This codebase also seamlessly integrates the Airflow deployment with AWS S3 for efficient data storage and logging, while relying on an AWS RDS instance from the infrastructure section to serve as the bedrock for metadata storage. Furthermore, pertinent Kubernetes secrets are meticulously integrated into the setup to ensure a secure and well-protected deployment.

The code initiation begins with the declaration of several local variables, encompassing prefixes, secret names, and variable lists. These variables serve as cornerstones throughout the code, fostering consistency in naming conventions and configurations. Additionally, essential data sources, such as "aws_caller_identity" and "aws_region", are defined. A dedicated Kubernetes namespace for Airflow is also established, providing a controlled and isolated environment for Airflow resources within the Kubernetes cluster.

locals {
  prefix                       = "${var.name_prefix}-${var.namespace}"
  k8s_airflow_db_secret_name   = "${local.prefix}-db-auth"
  git_airflow_repo_secret_name = "${local.prefix}-https-git-secret"
  git_organization_secret_name = "${local.prefix}-organization-git-secret"
  s3_data_bucket_secret_name   = "${var.namespace}-${var.s3_data_bucket_secret_name}"
  s3_data_bucket_name          = "${local.prefix}-${var.s3_data_bucket_name}"

  airflow_variable_list_addition = [
    {
      key   = "s3_access_name"
      value = "${local.s3_data_bucket_secret_name}"
    }
  ]
  airflow_variable_list_full = concat(var.airflow_variable_list, local.airflow_variable_list_addition)
}

data "aws_caller_identity" "current" {}
data "aws_region" "current" {} #

resource "kubernetes_namespace" "airflow" {
  metadata {
    name = var.namespace
  }
}

Subsequently, a customized Terraform module, denoted as "iam-service-account," is enlisted to configure IAM roles and policies tailored to service accounts associated with Airflow. These configurations include enabling role access through an OIDC provider and attaching an IAM policy that grants access to S3 buckets related to MLflow. These roles and policies serve as the bedrock for permissions and access control over a multitude of resources.

module "iam-service-account" {
  source                      = "./iam-service-account"
  namespace                   = var.namespace
  oidc_provider_arn           = var.oidc_provider_arn
  s3_mlflow_bucket_policy_arn = var.s3_mlflow_bucket_policy_arn
}

Moreover, an additional customized Terraform module, titled "s3-data-storage," is harnessed to configure data storage. This entails the definition of an S3 bucket designed for data storage, complete with associated configurations such as the bucket name, secret names, and policies. This data storage facility empowers Airflow users to securely store a diverse range of data, including training images and more.

module "s3-data-storage" {
  source                      = "./data-storage"
  namespace                   = var.namespace
  s3_data_bucket_name         = local.s3_data_bucket_name
  s3_data_bucket_secret_name  = local.s3_data_bucket_secret_name
  s3_mlflow_bucket_policy_arn = var.s3_mlflow_bucket_policy_arn
  s3_force_destroy            = true
}

Before the Airflow deployment can take shape, the code diligently crafts Kubernetes secrets serving various purposes. These secrets play a pivotal role in safeguarding sensitive information and providing secure access and configurations for Airflow. They encompass a spectrum of crucial details, including database credentials, Git authentication credentials, AWS account information, and SageMaker access data.

resource "kubernetes_secret" "airflow_db_credentials" {
  metadata {
    name      = local.k8s_airflow_db_secret_name
    namespace = helm_release.airflow.namespace
  }
  data = {
    "postgresql-password" = module.rds-airflow.rds_password
  }
}

resource "kubernetes_secret" "airflow_https_git_secret" {
  metadata {
    name      = local.git_airflow_repo_secret_name
    namespace = helm_release.airflow.namespace
  }
  data = {
    "username" = var.git_username
    "password" = var.git_token
  }
}

resource "kubernetes_secret" "airflow_organization_git_secret" {
  metadata {
    name      = local.git_organization_secret_name
    namespace = helm_release.airflow.namespace
  }
  data = {
    "GITHUB_CLIENT_ID"     = var.git_client_id
    "GITHUB_CLIENT_SECRET" = var.git_client_secret
  }
}

# secret with account information
resource "kubernetes_secret" "aws-account-information" {
  metadata {
    name      = "${var.namespace}-aws-account-information"
    namespace = var.namespace
  }
  data = {
    "AWS_REGION" = "${data.aws_region.current.name}"
    "AWS_ID"     = "${data.aws_caller_identity.current.account_id}"
  }
}

# secret for sagemaker
resource "kubernetes_secret" "sagemaker-access" {
  metadata {
    name      = "${var.namespace}-sagemaker-access"
    namespace = var.namespace
  }
  data = {
    "AWS_ROLE_NAME_SAGEMAKER" = var.sagemaker_access_role_name
  }
}

Additionally, the code orchestrates the deployment of an RDS database through the utilization of a dedicated module. This RDS database is earmarked to serve as the metadata repository for the Airflow deployment, with a randomly generated password enhancing security.

resource "random_password" "rds_password" {
  length  = 16
  special = false
}

module "rds-airflow" {
  source                      = "../../infrastructure/rds"
  vpc_id                      = var.vpc_id
  private_subnets             = var.private_subnets
  private_subnets_cidr_blocks = var.private_subnets_cidr_blocks
  rds_port                    = var.rds_port
  rds_name                    = var.rds_name
  rds_password                = coalesce(var.rds_password, random_password.rds_password.result)
  rds_engine                  = var.rds_engine
  rds_engine_version          = var.rds_engine_version
  rds_instance_class          = var.rds_instance_class
  storage_type                = var.rds_storage_type
  max_allocated_storage       = var.rds_max_allocated_storage
}

In this final phase, Apache Airflow is deployed using Helm, and meticulous configurations are set to ensure its smooth operation. These configurations encompass various aspects of Airflow, encompassing Git authentication, database connections, and Ingress settings for external accessibility. The Helm release configurations can be dissected into several key parts, each serving a specific purpose within the larger infrastructure of the dashboard.

Helm Release Configuration: This segment delineates the Helm release resource named “airflow,” responsible for the deployment of Apache Airflow. It encompasses crucial details such as the Helm chart repository, chart name, chart version, and assorted configuration options.
Airflow Configuration: In this section, comprehensive configurations for Apache Airflow are provided. This includes environment variables, Airflow-specific settings, and additional parameters required for tailoring the Airflow deployment. Of notable significance is the establishment of GitHub authentication and the definition of the Airflow webserver’s base URL.
Service Account Configuration: This segment is dedicated to specifying the configuration for the service account utilized by Airflow. It initiates the creation of a service account named “airflow-sa” and establishes its association with an IAM role, as denoted by the “eks.amazonaws.com/role-arn” annotation.
Ingress Configuration: Here, meticulous configuration of the Ingress for Apache Airflow takes place, facilitating external accessibility. This involves specifying annotations and settings for the Ingress controller, including hostname and health check path.
Web Configuration: This component defines settings pertinent to the Airflow web component. It encompasses aspects such as readiness and liveness probes, which are instrumental in verifying the responsiveness of the web server. Additionally, provisions are made for configuration overrides through the utilization of a custom Python file, affording flexibility in tailoring the web server’s behavior.

# HELM
resource "helm_release" "airflow" {
  name             = var.name
  namespace        = var.namespace
  create_namespace = var.create_namespace

  repository = "https://airflow-helm.github.io/charts"
  chart      = var.helm_chart_name
  version    = var.helm_chart_version
  wait       = false # deactivate post install hooks otherwise will fail

  values = [yamlencode({
    airflow = {
      extraEnv = [
        {
          name = "GITHUB_CLIENT_ID"
          valueFrom = {
            secretKeyRef = {
              name = local.git_organization_secret_name
              key  = "GITHUB_CLIENT_ID"
            }
          }
        },
        {
          name = "GITHUB_CLIENT_SECRET"
          valueFrom = {
            secretKeyRef = {
              name = local.git_organization_secret_name
              key  = "GITHUB_CLIENT_SECRET"
            }
          }
        }
      ],
      config = {
        AIRFLOW__WEBSERVER__EXPOSE_CONFIG = false
        AIRFLOW__WEBSERVER__BASE_URL      = "http://${var.domain_name}/${var.domain_suffix}"
        AIRFLOW__CORE__LOAD_EXAMPLES = false
        AIRFLOW__CORE__DEFAULT_TIMEZONE = "Europe/Amsterdam"
      },
      users = []
      image = {
        repository = "seblum/airflow"
        tag        = "2.6.3-python3.11-custom-light"
        pullPolicy = "IfNotPresent"
        pullSecret = ""
        uid        = 50000
        gid        = 0
      },
      executor           = "KubernetesExecutor"
      fernetKey          = var.fernet_key
      webserverSecretKey = "THIS IS UNSAFE!"
      variables = local.airflow_variable_list_full
    },
    serviceAccount = {
      create = true
      name   = "airflow-sa"
      annotations = {
        "eks.amazonaws.com/role-arn" = "${module.iam-service-account.airflow_service_account_role_arn}"
      }
    },
    scheduler = {
      logCleanup = {
        enabled = false
      }
    },
    workers = {
      enabled = false
      logCleanup = {
        enables = true
      }
    },
    flower = {
      enabled = false
    },
    postgresql = {
      enabled = false
    },
    redis = {
      enabled = false
    },
    externalDatabase = {
      type              = "postgres"
      host              = module.rds-airflow.rds_host
      port              = var.rds_port
      database          = "airflow_db"
      user              = "airflow_admin"
      passwordSecret    = local.k8s_airflow_db_secret_name
      passwordSecretKey = "postgresql-password"
    },
    dags = {
      path = "/opt/airflow/dags"
      gitSync = {
        enabled  = true
        repo     = var.git_repository_url
        branch   = var.git_branch
        revision = "HEAD"
        # repoSubPath           = "workflows"
        httpSecret            = local.git_airflow_repo_secret_name
        httpSecretUsernameKey = "username"
        httpSecretPasswordKey = "password"
        syncWait              = 60
        syncTimeout           = 120
      }
    },
    logs = {
      path = "/opt/airflow/logs"
      persistence = {
        enabled = true
        storageClass : "efs"
        size : "5Gi"
        accessMode : "ReadWriteMany"
      }
    },
    ingress = {
      enabled    = true
      apiVersion = "networking.k8s.io/v1"
      web = {
        annotations = {
          "external-dns.alpha.kubernetes.io/hostname"  = "${var.domain_name}"
          "alb.ingress.kubernetes.io/scheme"           = "internet-facing"
          "alb.ingress.kubernetes.io/target-type"      = "ip"
          "kubernetes.io/ingress.class"                = "alb"
          "alb.ingress.kubernetes.io/group.name"       = "mlplatform"
          "alb.ingress.kubernetes.io/healthcheck-path" = "/${var.domain_suffix}/health"
        }
        path = "/${var.domain_suffix}"
        host = "${var.domain_name}"
        precedingPaths = [{
          path        = "/${var.domain_suffix}*"
          serviceName = "airflow-web"
          servicePort = "web"
        }]
      }
    },
    web = {
      readinessProbe = {
        enabled             = true
        initialDelaySeconds = 45
      },
      livenessProbe = {
        enabled             = true
        initialDelaySeconds = 45
      },
      webserverConfig = {
        stringOverride = file("${path.module}/WebServerConfig.py")
      }
    },
  })]
}

In summary, the Terraform code provisions the necessary infrastructure components, IAM roles and policies, data storage, RDS database, Kubernetes secrets, and deploys Apache Airflow using Helm. This setup forms the foundation for the ML platform’s dashboard, enabling workflow orchestration and data management capabilities with Airflow.

8.3.2.1 WebServerConfig

In a final step of the Helm Chart, a custom WebServerConfig.py is specified which is set to integrate our Airflow deployment with a Github Authentication provider. The Python script consists of two major parts: a custom AirflowSecurityManager class definition and the actual webserver_config configuration file for Apache Airflow’s web server.

The custom CustomSecurityManager class extends the default AirflowSecurityManager to retrieves user information from the GitHub OAuth provider. The webserver_config configuration sets up the configurations for the web server component of Apache Airflow by indicating that OAuth will be used for user authentication. The SECURITY_MANAGER_CLASS is set to the previously defined CustomSecurityManager to customizes how user information is retrieved from the OAuth provider. Finally, the GitHub provider is configured with its required parameters like client_id, client_secret, and API endpoints.

#######################################
# Custom AirflowSecurityManager
#######################################
from airflow.www.security import AirflowSecurityManager
import os


class CustomSecurityManager(AirflowSecurityManager):
    def get_oauth_user_info(self, provider, resp):
        if provider == "github":
            user_data = self.appbuilder.sm.oauth_remotes[provider].get("user").json()
            emails_data = (
                self.appbuilder.sm.oauth_remotes[provider].get("user/emails").json()
            )
            teams_data = (
                self.appbuilder.sm.oauth_remotes[provider].get("user/teams").json()
            )

            # unpack the user's name
            first_name = ""
            last_name = ""
            name = user_data.get("name", "").split(maxsplit=1)
            if len(name) == 1:
                first_name = name[0]
            elif len(name) == 2:
                first_name = name[0]
                last_name = name[1]

            # unpack the user's email
            email = ""
            for email_data in emails_data:
                if email_data["primary"]:
                    email = email_data["email"]
                    break

            # unpack the user's teams as role_keys
            # NOTE: each role key will be "my-github-org/my-team-name"
            role_keys = []
            for team_data in teams_data:
                team_org = team_data["organization"]["login"]
                team_slug = team_data["slug"]
                team_ref = team_org + "/" + team_slug
                role_keys.append(team_ref)

            return {
                "username": "github_" + user_data.get("login", ""),
                "first_name": first_name,
                "last_name": last_name,
                "email": email,
                "role_keys": role_keys,
            }
        else:
            return {}

#######################################
# Actual `webserver_config.py`
#######################################
from flask_appbuilder.security.manager import AUTH_OAUTH

# only needed for airflow 1.10
# from airflow import configuration as conf
# SQLALCHEMY_DATABASE_URI = conf.get("core", "SQL_ALCHEMY_CONN")

AUTH_TYPE = AUTH_OAUTH
SECURITY_MANAGER_CLASS = CustomSecurityManager

# registration configs
AUTH_USER_REGISTRATION = True  # allow users who are not already in the FAB DB
AUTH_USER_REGISTRATION_ROLE = (
    "Public"  # this role will be given in addition to any AUTH_ROLES_MAPPING
)

# the list of providers which the user can choose from
OAUTH_PROVIDERS = [
    {
        "name": "github",
        "icon": "fa-github",
        "token_key": "access_token",
        "remote_app": {
            "client_id": os.getenv("GITHUB_CLIENT_ID"),
            "client_secret": os.getenv("GITHUB_CLIENT_SECRET"),
            "api_base_url": "https://api.github.com",
            "client_kwargs": {"scope": "read:org read:user user:email"},
            "access_token_url": "https://github.com/login/oauth/access_token",
            "authorize_url": "https://github.com/login/oauth/authorize",
        },
    },
]

# a mapping from the values of `userinfo["role_keys"]` to a list of FAB roles
AUTH_ROLES_MAPPING = {
    "github-organization/airflow-users-team": ["User"],
    "github-organization/airflow-admin-team": ["Admin"],
}

# if we should replace ALL the user's roles each login, or only on registration
AUTH_ROLES_SYNC_AT_LOGIN = True

# force users to re-auth after 30min of inactivity (to keep roles in sync)
PERMANENT_SESSION_LIFETIME = 1800

8.3.3 Mlflow

In order to enable model tracking, the deployment of MLflow necessitates specific requirements. These include the establishment of a data store on AWS S3, a metadata store utilizing PostgreSQL (RDS), and the MLflow server itself. The initial two components are crafted using Terraform resources.

However, it’s worth noting that MLflow lacks native support for Kubernetes and an official Helm chart. In light of this, despite being a highly effective tool, we need to create a fundamental custom Helm chart for deploying the MLflow server. Furthermore, a custom container image is employed for running MLflow. This process entails the creation of YAML configurations for deployment, service, and configmap, all of which are executed on our Kubernetes cluster.

The provided Terraform code assumes responsibility for orchestrating the deployment of a dashboard tailored for an ML platform, relying on the power of MLflow. This intricate process encompasses various configurations and resource deployments. It commences by defining local variables, including a unique S3 bucket name. Subsequently, an S3 bucket with the name "mlflow" is generated, designed explicitly for housing MLflow artifacts.

locals {
  s3_bucket_name        = "${var.name_prefix}-${var.namespace}-${var.s3_bucket_name}"
  s3_bucket_path_prefix = "users"
}

data "aws_caller_identity" "current" {}

# create s3 bucket for artifacts
resource "aws_s3_bucket" "mlflow" {
  bucket = local.s3_bucket_name
  # tags          = var.tags
  force_destroy = var.s3_force_destroy
}

resource "aws_s3_bucket_server_side_encryption_configuration" "bucket_state_encryption" {
  bucket = aws_s3_bucket.mlflow.bucket
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

Following this, an IAM role by the name of "mlflow_s3_role" is crafted, bestowing access to the S3 bucket. This role assumes a web identity role using the OIDC provider ARN specified by var.oidc_provider_arn. In a similar vein, an IAM policy named "mlflow_s3_policy" is forged, extending specific permissions for S3 access. These permissions encompass actions such as object creation, listing, and deletion within the S3 bucket. The policy is meticulously scoped to the S3 bucket and its associated objects. Subsequently, the "mlflow_s3_policy" is attached to the "mlflow_s3_role", effectively enabling the IAM role to wield the permissions delineated in the policy.

# "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/${var.eks_oidc_provider}"
resource "aws_iam_role" "mlflow_s3_role" {
  name = "${var.namespace}-s3-access-role"

  assume_role_policy = <<EOF
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action" : "sts:AssumeRoleWithWebIdentity",
        "Effect": "Allow",
        "Principal" : {
          "Federated" : [
            "${var.oidc_provider_arn}"
          ]
        }
      }
    ]
  }
  EOF
  tags = {
    tag-key = "tag-value"
  }
}

resource "aws_iam_policy" "mlflow_s3_policy" {
  name = "${var.namespace}-s3-access-policy"
  path = "/"

  policy = jsonencode({
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Effect" : "Allow",
        "Action" : [
          "s3:*Object",
          "s3:GetObjectVersion",
          "s3:*"
        ],
        "Resource" : [
          "arn:aws:s3:::${local.s3_bucket_name}/*",
          "arn:aws:s3:::${local.s3_bucket_name}"
        ]
      },
      {
        "Effect" : "Allow",
        "Action" : [
          "s3:ListBucket",
          "s3:ListBucketVersions"
        ],
        "Resource" : [
          "arn:aws:s3:::${local.s3_bucket_name}/*",
          "arn:aws:s3:::${local.s3_bucket_name}"
        ],
        "Condition" : {
          "StringLike" : {
            "s3:prefix" : [
              "${local.s3_bucket_path_prefix}/*"
            ]
          }
        }
      }
  ] })
}

resource "aws_iam_role_policy_attachment" "mlflow_s3_policy" {
  role       = aws_iam_role.mlflow_s3_role.name
  policy_arn = aws_iam_policy.mlflow_s3_policy.arn
}

Much like the Airflow deployment, an RDS database is deployed for use as a metadata store through the specified RDS module. Here, a random password is generated for the RDS database that will be employed by MLflow.


resource "random_password" "rds_password" {
  length = 16
  # MLFlow has troubles using special characters
  special = false
}

# create rds for s3
module "rds-mlflow" {
  source                      = "../../infrastructure/rds"
  vpc_id                      = var.vpc_id
  private_subnets             = var.private_subnets
  private_subnets_cidr_blocks = var.private_subnets_cidr_blocks
  rds_port                    = var.rds_port
  rds_name                    = var.rds_name
  rds_password                = coalesce(var.rds_password, random_password.rds_password.result)
  rds_engine                  = var.rds_engine
  rds_engine_version          = var.rds_engine_version
  rds_instance_class          = var.rds_instance_class
  storage_type                = var.rds_storage_type
  max_allocated_storage       = var.rds_max_allocated_storage
}

The final phase of this deployment leverages Helm to roll out MLflow. It encompasses crucial aspects such as the Docker image selection for MLflow, configuration settings for external access through Ingress, configuration of the S3 bucket, and provision of essential information regarding the RDS database. This Helm release orchestrates the deployment of MLflow, a pivotal component within the ML platform’s dashboard.

Given the absence of a native Helm chart for MLflow, a custom Helm chart has been crafted to facilitate this deployment. The chart brings together various Kubernetes deployments, including the MLflow deployment itself, a service account, a secret, and an ingress configuration. For a detailed look at the chart and its components, you can refer to this implementation. Furthermore, as MLflow doesn’t offer a suitable container image for our specific use case, a custom container image has been defined and can be explored here.


resource "helm_release" "mlflow" {
  name             = var.name
  namespace        = var.namespace
  create_namespace = var.create_namespace

  chart = "${path.module}/helm/"
  values = [yamlencode({
    deployment = {
      image     = "seblum/mlflow:v2.4.1"
      namespace = var.namespace
      name      = var.name
    },
    ingress = {
      host = var.domain_name
      path = var.domain_suffix
    },
    artifacts = {
      s3_role_arn   = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${aws_iam_role.mlflow_s3_role.name}",
      s3_key_prefix = local.s3_bucket_path_prefix,
      s3_bucket     = local.s3_bucket_name,
    },
    rds = {
      host     = module.rds-mlflow.rds_host
      port     = var.rds_port,
      username = module.rds-mlflow.rds_username,
      password = module.rds-mlflow.rds_password,
      db_name  = module.rds-mlflow.rds_dbname
    },
  })]
}

NOTE: It’s important to note that the deployment has an open endpoint, which means it lacks sufficient security measures.

8.3.4 Jupyterhub

In our setup, JupyterHub plays a crucial role by providing an Integrated Development Environment (IDE). The Terraform code presented here defines a helm_release responsible for deploying JupyterHub onto our EKS cluster. In contrast to other components of our ML platform, JupyterHub doesn’t require additional resources to operate.

The Helm configuration used in this deployment is multifaceted, encompassing various settings and customizations. Its primary aim is to establish a JupyterHub instance that seamlessly integrates with a single-user Jupyter notebook server. This configuration encompasses user sessions, GitHub authentication, proxy settings, Ingress for external access, and various other JupyterHub-related configurations. This ensures that JupyterHub is finely tuned to meet the specific requirements of our ML platform, providing users with the ability to run interactive notebooks and access MLflow services effortlessly.

Within the Terraform code, a Helm release named "jupyterhub" is defined, orchestrating the deployment of JupyterHub into the designated Kubernetes namespace. The Helm chart is sourced from the JupyterHub Helm chart repository at a version specified by var.helm_chart_version. The values block within this configuration contains a YAML-encoded set of parameters for JupyterHub, including numerous settings related to single-user notebooks, Ingress, proxy, culling, and hub configuration.

Single-User Notebook Configuration: This segment of the configuration is dedicated to single-user notebook settings. It encompasses parameters like the default URL for notebooks, the Docker image to be employed, and lifecycle hooks. The Docker image is set to “seblum/jupyterhub-server:latest,” and a postStart lifecycle hook is defined to clone a Git repository specified by var.git_repository_url. Additionally, an environment variable MLFLOW_TRACKING_URI is configured to point to the URI of the MLflow service.
Ingress Configuration: The Ingress resource is configured to facilitate external access to JupyterHub. This entails the inclusion of annotations to tailor its behavior. Key settings include the specification of the hostname, scheme, healthcheck path, and ingress class. Hosts are configured to ${var.domain_name} and www.${var.domain_name}, facilitating access through the designated domain name.
Proxy Configuration: Within the proxy configuration, the service type for the JupyterHub proxy is set as “ClusterIP.” Additionally, the secretToken is configured with a value provided by var.proxy_secret_token.
Culling Configuration: Culling is enabled and finely tuned to manage user sessions. Users are subject to culling when their sessions become idle.
Hub Configuration: The hub configuration addresses settings pertaining to the JupyterHub’s base URL, GitHub OAuthenticator, and JupyterHub’s authenticator class. Similar to the Airflow deployment, the JupyterHub instance is configured to utilize GitHub OAuthenticator for user authentication. This OAuthenticator is then configured with the supplied GitHub credentials (var.git_client_id and var.git_client_secret), along with the oauth_callback_url parameter, which specifies a specific endpoint under the provided domain name.

resource "helm_release" "jupyterhub" {
  name             = var.name
  namespace        = var.name
  create_namespace = var.create_namespace

  repository = "https://jupyterhub.github.io/helm-chart/"
  chart      = var.helm_chart_name
  version    = var.helm_chart_version

  values = [yamlencode({
    singleuser = {
      defaultUrl = "/lab"
      image = {
        name = "seblum/jupyterhub-server"
        tag  = "latest"
      },
      lifecycleHooks = {
        postStart = {
          exec = {
            command = ["git", "clone", "${var.git_repository_url}"]
          }
        }
      },
      extraEnv = {
        "MLFLOW_TRACKING_URI" = "http://mlflow-service.mlflow.svc.cluster.local"
      }
    },
    ingress = {
      enabled : true
      annotations = {
        "external-dns.alpha.kubernetes.io/hostname" = "${var.domain_name}"
        "alb.ingress.kubernetes.io/scheme"          = "internet-facing"
        "alb.ingress.kubernetes.io/target-type"     = "ip"
        "kubernetes.io/ingress.class"               = "alb"
        "alb.ingress.kubernetes.io/group.name"      = "mlplatform"
      }
      hosts = ["${var.domain_name}", "www.${var.domain_name}"]
    },
    proxy = {
      service = {
        type = "ClusterIP"
      }
      secretToken = var.proxy_secret_token
    }
    cull = {
      enabled = true
      users   = true
    }
    hub = {
      baseUrl = "/${var.domain_suffix}"
      config = {
        GitHubOAuthenticator = {
          client_id          = var.git_client_id
          client_secret      = var.git_client_secret
          oauth_callback_url = "http://${var.domain_name}/${var.domain_suffix}/hub/oauth_callback"
        }
        JupyterHub = {
          authenticator_class = "github"
        }
      }
    }
  })]
}

8.3.5 Monitoring

Incorporated into the setup is a monitoring system leveraging Prometheus and Grafana. While not a direct component of the ML pipeline, this configuration serves as an instructive illustration of proficient cluster monitoring. It encompasses a fundamental monitoring setup accomplished through the utilization of Prometheus and Grafana.

Prometheus and Grafana play distinct yet complementary roles in the domain of monitoring and observability. Prometheus is primarily responsible for collecting, storing, and alerting based on time-series metrics. It scrapes data from various sources, defines alerting rules, and supports service discovery, making it a robust monitoring and alerting tool. It also offers a query language for metric analysis and flexible data retention policies.

On the other hand, Grafana specializes in data visualization and interactive dashboard creation. It connects to data sources like Prometheus and transforms metric data into visually engaging charts and graphs. Grafana is instrumental in designing comprehensive monitoring dashboards, visualizing alerts, and facilitating data exploration. Together, Prometheus and Grafana form a powerful monitoring stack, enabling organizations to monitor, analyze, and visualize their systems effectively.

Both tools, Prometheus and Grafana, are seamlessly deployed via a Helm chart. The crucial interconnection between Grafana and Prometheus is thoughtfully established within the Grafana Helm chart, ensuring a cohesive and comprehensive monitoring solution.

8.3.5.1 Prometheus

The provided Terraform code facilitates the deployment of Prometheus and Prometheus Operator Custom Resource Definitions (CRDs) for the ML platform dashboard. This deployment leverages Helm for efficient management, allowing for customizable configurations and streamlined monitoring and alerting system setup.

The process begins with the definition of a Helm release named "prometheus" responsible for deploying Prometheus, a comprehensive monitoring and alerting toolkit, to the specified Kubernetes namespace. The Helm chart utilized for this deployment is sourced from the Prometheus community Helm charts repository, adhering to a specified version.

Within the "values" block, you’ll find a YAML-encoded configuration for Prometheus. This configuration tailors specific aspects of the Prometheus installation, including the option to disable Alertmanager and Prometheus Pushgateway components. It also provides the flexibility to enable or disable persistent volumes for the Prometheus server.

resource "helm_release" "prometheus" {
  chart            = "prometheus"
  name             = "prometheus"
  namespace        = var.namespace
  create_namespace = var.create_namespace

  repository = "https://prometheus-community.github.io/helm-charts"
  version    = "19.7.2"

  values = [
    yamlencode({
      alertmanager = {
        enabled = false
      }
      prometheus-pushgateway = {
        enabled = false
      }
      server = {
        persistentVolume = {
          enabled = false
        }
      }
    })
  ]
}

In addition to the Prometheus release, another Helm release named "prometheus-operator-crds" is established. This release is focused on deploying the Custom Resource Definitions (CRDs) essential for the Prometheus Operator. Similarly, the Helm chart used for this deployment originates from the Prometheus community Helm charts repository but at a distinct version.

The Prometheus Operator CRDs are essential for defining and managing Prometheus instances and associated resources within the Kubernetes cluster, ensuring effective monitoring and alerting for the ML platform dashboard.

resource "helm_release" "prometheus-operator-crds" {
  chart            = "prometheus-operator-crds"
  name             = "prometheus-operator-crds"
  namespace        = var.namespace
  create_namespace = var.create_namespace

  repository = "https://prometheus-community.github.io/helm-charts"
  version    = "5.1.0"
}

8.3.5.2 Grafana

The provided Terraform code deploys Grafana, a dashboard and visualization platform, for an ML platform. The Grafana deployment is highly customized, with various settings and configurations.

A Helm release named "grafana" is defined, deploying Grafana to the specified Kubernetes namespace. It pulls the Grafana chart from the official Helm charts repository at version "6.57.4." The "values" block contains a YAML-encoded configuration for Grafana, including various settings related to service type, Ingress, data sources, dashboard providers, and more.

Service and Ingress Configuration: The Grafana service is configured to be of type "ClusterIP" and an Ingress resource is enabled for external access. Several annotations are added to the Ingress to customize how it interacts with the Kubernetes cluster, including specifying the hostname, scheme, healthcheck path, and other settings. The Ingress is set up to handle requests for the specified domain name and subdomain, allowing external access to Grafana.
Data Sources Configuration: The configuration includes data source settings, specifically for Prometheus. It defines a data source named “Prometheus” with details like the type, URL, access mode, and setting it as the default data source. This configuration allows Grafana to retrieve metrics and data from Prometheus for visualization and dashboard creation.
Dashboard Providers Configuration: Grafana’s dashboard providers are configured using a YAML block. It defines a default provider with options specifying the path to dashboards. This configuration enables Grafana to load dashboards from the specified path within the Grafana container.
Dashboards Configuration: The code defines a set of dashboards and their configurations. Each dashboard is associated with a data source (in this case, Prometheus) and has various settings such as gnetId (unique identifier), revision, and data source. These configurations determine which data is displayed on each dashboard and how it is accessed.
Grafana Configuration (grafana.ini): This section provides a set of configurations for Grafana itself, including security settings that allow embedding Grafana in iframes. It specifies the server’s domain and root URL, enabling Grafana to serve from a subpath. Additionally, GitHub authentication is enabled for user sign-up and authentication, using the provided GitHub OAuth client ID and secret.

resource "helm_release" "grafana" {
  chart            = "grafana"
  name             = "grafana"
  namespace        = var.namespace
  create_namespace = var.create_namespace

  repository = "https://grafana.github.io/helm-charts/"
  version    = "6.57.4"

  values = [
    yamlencode({
      service = {
        enabled = true
        type    = "ClusterIP"
      }
      ingress = {
        enabled = true
        annotations = {
          "external-dns.alpha.kubernetes.io/hostname"  = "${var.domain_name}",
          "alb.ingress.kubernetes.io/scheme"           = "internet-facing",
          "alb.ingress.kubernetes.io/target-type"      = "ip",
          "kubernetes.io/ingress.class"                = "alb",
          "alb.ingress.kubernetes.io/group.name"       = "mlplatform",
          "alb.ingress.kubernetes.io/healthcheck-path" = "/api/health"
        }
        labels   = {}
        path     = "${var.domain_suffix}"
        pathType = "Prefix"
        hosts = [
          "${var.domain_name}",
          "www.${var.domain_name}"
        ]
      },
      datasources = {
        "datasources.yaml" = {
          apiVersion = 1
          datasources = [
            {
              name      = "Prometheus"
              type      = "prometheus"
              url       = "http://prometheus-server.${var.namespace}.svc.cluster.local"
              access    = "proxy"
              isDefault = true
            }
          ]
        }
      },
      dashboardProviders = {
        "dashboardproviders.yaml" = {
          apiVersion = 1
          providers = [
            {
              name   = "'default'"
              orgId  = 1
              folder = "''"
              type : "file"
              disableDeletion : false
              editable : true
              options = {
                path = "/var/lib/grafana/dashboards/default"
              }
            }
          ]
        }
      }
      dashboards = {
        default = {
          prometheus-stats = {
            gnetId     = 2
            revision   = 2
            datasource = "Prometheus"
          }
          prometheus-stats-2 = {
            gnetId     = 315
            datasource = "Prometheus"
          }
          k8s-cluster = {
            gnetId     = 6417
            datasource = "Prometheus"
          }
        }
      }
      "grafana.ini" = {
        security = {
          allow_embedding = true # enables iframe loading
        },
        server = {
          domain : "${var.domain_name}"
          root_url : "%(protocol)s://%(domain)s/grafana/"
          serve_from_sub_path : true
          # https://grafana.com/docs/grafana/latest/auth/github/#enable-github-in-grafana
        },
        "auth.github" = {
          enabled       = true
          allow_sign_up = true
          scopes        = "user:email,read:org"
          auth_url      = "https://github.com/login/oauth/authorize"
          token_url     = "https://github.com/login/oauth/access_token"
          api_url       = "https://api.github.com/user"
          # team_ids: grafana-user-team
          # allowed_organizations:
          client_id     = var.git_client_id
          client_secret = var.git_client_secret
        }
      }
  })]
}

8.3.6 Sagemaker

The Sagemaker module serves as a vital component of the ML platform, ensuring seamless interaction with SageMaker by providing essential IAM roles and permissions. Additionally, it deploys a custom Helm chart designed to harness the capabilities of a straightforward Streamlit application, effectively presenting all deployed SageMaker endpoints.

The Terraform code commences by establishing local variables, meticulously capturing a spectrum of configuration details pivotal for this deployment. These variables encompass crucial information such as Docker image specifics, ECR repository nomenclature, IAM role designations, and more. The judicious utilization of these variables throughout the code enhances both consistency and reusability.

locals {
  docker_mlflow_sagemaker_base_image = var.docker_mlflow_sagemaker_base_image
  base_image_tag                     = split(":", var.docker_mlflow_sagemaker_base_image)[1]
  ecr_repository_name                = "mlflow-sagemaker-deployment"
  iam_name_sagemaker_access          = "sagemaker-access"

  sagemaker_dashboard_read_access_user_name = "sagemaker-dashboard-read-access-user"
  sagemaker_dashboard_read_access_role_name = "sagemaker-dashboard-read-access-role"
  sagemaker_dashboard_read_access_secret    = "sagemaker-dashboard-read-access-secret"
}

data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_iam_policy" "AmazonSageMakerFullAccess" {
  arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}
data "aws_iam_policy" "AmazonSageMakerReadOnlyAccess" {
  arn = "arn:aws:iam::aws:policy/AmazonSageMakerReadOnly"
}

The code efficiently leverages the "terraform-aws-modules/ecr/aws" module, effectively birthing an Elastic Container Registry (ECR) repository christened as "mlflow-sagemaker-deployment". Furthermore, a "null_resource" block is meticulously crafted to orchestrate the packaging and dissemination of a Docker image to the ECR repository. This resource leverages a local-exec provisioner, deftly executing a series of Docker commands. These commands encompass the retrieval of the base image, affixing it with the ECR repository URL, securing authentication to the ECR registry, and ultimately, the seamless transmission of the image. This strategic orchestration guarantees the availability of the MLflow base image, purpose-built for SageMaker deployments, within the ECR repository, ready for deployment when needed.

# Create Container Registry
module "ecr" {
  source          = "terraform-aws-modules/ecr/aws"
  repository_name = local.ecr_repository_name

  repository_lifecycle_policy = jsonencode({
    rules = [
      {
        rulePriority = 1,
        description  = "Keep last 30 images",
        selection = {
          tagStatus     = "tagged",
          tagPrefixList = ["v"],
          countType     = "imageCountMoreThan",
          countNumber   = 30
        },
        action = {
          type = "expire"
        }
      }
    ]
  })
  repository_force_delete = true
  # tags = {
  #   Terraform   = "true"
  #   Environment = "dev"
  # }
}

# mlflow sagemaker build-and-push-container --build --no-push -c mlflow-sagemaker-deployment
# https://mlflow.org/docs/latest/cli.html
resource "null_resource" "docker_packaging" {
  provisioner "local-exec" {
    command = <<EOF
        docker pull "${local.docker_mlflow_sagemaker_base_image}"
      docker tag "${local.docker_mlflow_sagemaker_base_image}" "${module.ecr.repository_url}:${local.base_image_tag}"
      aws ecr get-login-password --region ${data.aws_region.current.name} | docker login --username AWS --password-stdin ${data.aws_caller_identity.current.account_id}.dkr.ecr.${data.aws_region.current.name}.amazonaws.com
        docker push "${module.ecr.repository_url}:${local.base_image_tag}"
        EOF
  }

  # triggers = {
  #   "run_at" = timestamp()
  # }
  depends_on = [
    module.ecr,
  ]
}

Moreover, the Terraform module assumes responsibility for the creation of an IAM role christened as "sagemaker_access_role". This role plays a pivotal role, enabling SageMaker to assume its authority for requisite access. The trust policy governing this role stipulates SageMaker’s unequivocal authority to assume it. Notably, the code adjoins the "AmazonSageMakerFullAccess" IAM policy to the "sagemaker_access_role", conferring comprehensive access rights to SageMaker resources. Parallel to the MLflow SageMaker base image residing within ECR, the "sagemaker_access_role" becomes an indispensable component, facilitating MLflow’s deployments to AWS SageMaker.

# Access role to allow access to Sagemaker
resource "aws_iam_role" "sagemaker_access_role" {
  name                 = "${local.iam_name_sagemaker_access}-role"
  max_session_duration = 28800

  assume_role_policy = <<EOF
  {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
  }
  EOF
  # tags = {
  #   tag-key = "tag-value"
  # }
}

resource "aws_iam_role_policy_attachment" "sagemaker_access_role_policy" {
  role       = aws_iam_role.sagemaker_access_role.name
  policy_arn = data.aws_iam_policy.AmazonSageMakerFullAccess.arn
}

In the final orchestration, the ML platform’s Sagemaker Dashboard materializes through the medium of Helm. The deployment specifications encompass critical parameters, encompassing the Docker image identity, deployment nomenclature, namespace delineation, Ingress configuration tailored for efficient routing of external traffic, and essential secrets mandatorily required for authentication. Notably, the deployment relies on the Docker image bearing the label "seblum/streamlit-sagemaker-app:v1.0.0", which serves as the engine propelling the Streamlit application. For those seeking deeper insights, the inner workings of this Docker image are meticulously documented here.

# Helm Deployment
resource "helm_release" "sagemaker-dashboard" {
  name             = var.name
  namespace        = var.namespace
  create_namespace = var.create_namespace

  chart = "${path.module}/helm/"
  values = [yamlencode({
    deployment = {
      image     = "seblum/streamlit-sagemaker-app:v1.0.0",
      name      = "sagemaker-streamlit",
      namespace = "${var.namespace}"
    },
    ingress = {
      host = "${var.domain_name}"
      path = "${var.domain_suffix}"
    },
    secret = {
      aws_region            = "${data.aws_region.current.name}"
      aws_access_key_id     = "${aws_iam_access_key.sagemaker_dashboard_read_access_user_credentials.id}"
      aws_secret_access_key = "${aws_iam_access_key.sagemaker_dashboard_read_access_user_credentials.secret}"
      aws_role_name         = "${aws_iam_role.sagemaker_dashboard_read_access_role.name}"
    }
  })]
}

In tandem with this deployment, an additional IAM role christened as "sagemaker_dashboard_read_access_role" takes center stage, conferring access rights to SageMaker resources. The trust policy associated with this role casts a discerning gaze, specifying the entities deemed eligible to assume its authority. Among the authorized entities are the SageMaker user and the SageMaker dashboard read-access user. To further bolster the role’s capabilities, it is graced with the "AmazonSageMakerReadOnlyAccess" IAM policy, gracefully endowing it with read-only access privileges to SageMaker resources. Concomitantly, an IAM user, christened as "sagemaker_dashboard_read_access_user," is ushered into existence, complete with an associated access key. This user is purpose-built for interfacing with SageMaker resources, playing a pivotal role in accessing SageMaker endpoints and seamlessly presenting them within the Streamlit application.

# Access role to allow access to Sagemaker
resource "aws_iam_role" "sagemaker_dashboard_read_access_role" {
  name                 = local.sagemaker_dashboard_read_access_role_name
  max_session_duration = 28800

  assume_role_policy = <<EOF
  {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
              "AWS": "arn:aws:iam::${data.aws_caller_identity.current.account_id}:user/${aws_iam_user.sagemaker_dashboard_read_access_user.name}"
            },
            "Action": "sts:AssumeRole"
        }
    ]
  }
  EOF
  # tags = {
  #   tag-key = "tag-value"
  # }
}

resource "aws_iam_role_policy_attachment" "sagemaker_dashboard_read__access_role_policy" {
  role       = aws_iam_role.sagemaker_dashboard_read_access_role.name
  policy_arn = data.aws_iam_policy.AmazonSageMakerReadOnlyAccess.arn
}

resource "aws_iam_user" "sagemaker_dashboard_read_access_user" {
  name = local.sagemaker_dashboard_read_access_user_name
  path = "/"
}

resource "aws_iam_access_key" "sagemaker_dashboard_read_access_user_credentials" {
  user = aws_iam_user.sagemaker_dashboard_read_access_user.name
}

8.3.7 Dashboard

The Dashboard module leverages Terraform and Helm to deploy a custom Vue.js-based dashboard for an ML platform.

This deployment is accomplished through the definition of a "helm_release" resource. A custom Helm chart, akin to the mlflow deployment, takes center stage. This Helm chart is expected to reside in a directory nested within the Terraform module, conveniently defined as "${path.module}/helm/". The configuration for this deployment encompasses critical parameters pivotal to the successful deployment of the dashboard. Notably, it specifies the utilization of a bespoke Docker image, bearing the tag "seblum/vuejs-ml-dashboard:latest", meticulously tailored for this specific deployment. Moreover, the deployment name and namespace exhibit dynamic characteristics, rendering the code adaptable to diverse environments and specific requirements.

A main aspect of the code revolves around the configuration of an Ingress resource, designed to efficiently route external traffic to the dashboard. This resource employs the "var.domain_name" variable to determine the host value, which can signify the domain or subdomain intricately linked to the dashboard. Furthermore, the "path" parameter derives its value from the "var.domain_suffix" variable, delineating the path through which users can access the ML platform’s dashboard. In essence, this Terraform code exemplifies an indispensable tool, enabling the consistent and streamlined deployment and management of the ML dashboard within the Kubernetes environment.

resource "helm_release" "dashboard" {
  name             = var.name
  namespace        = var.namespace
  create_namespace = var.create_namespace

  chart = "${path.module}/helm/"
  values = [yamlencode({
    deployment = {
      image     = "seblum/vuejs-ml-dashboard:latest"
      name      = var.name
      namespace = var.namespace
    },
    ingress = {
      host = var.domain_name
      path = var.domain_suffix
    }
  })]
}

It’s worth noting that the Vue.js Dashboard, while underpinned by a free Vue.js template from Creative Tim, has been thoughtfully customized to cater to the specific requirements of the ML platform. For an in-depth exploration of Vue.js, the reader is encouraged to explore external resources, as it falls outside the scope of this documentation. Nevertheless, the complete Vue.js dashboard application is readily accessible here for those with a keen interest.