User GuideΒΆ

All commands are run from the mlops-infra-suite/ root directory.

Table of ContentsΒΆ


Pre-Deployment ChecklistΒΆ

Before deploying, ensure you have:

  • Docker 20.10+ installed and running (docker --version)

  • AWS credentials configured (aws sts get-caller-identity works)

  • AWS Marketplace subscription active for ML Provisioner

  • IAM permissions verified (see IAM Permissions)

  • Working directories created: ml/{configs,policies,templates,reports,docs}

  • Configuration file copied from Docker image and adjusted (see README)

  • Reviewed generated CloudFormation template before deploying

  • Tested with test-deploy before production deploy

Professional Tier (additional)ΒΆ

  • SNS email subscription confirmation β€” alerts_email address will receive a confirmation email after first deploy

Enterprise Tier (additional)ΒΆ

  • VPC deployed and available before running deploy-product

  • For vpc_source: ssm β€” SSM Parameter Store paths for VPC ID and subnet IDs populated (typically by vpc-provisioner)

  • For vpc_source: direct β€” VPC ID and subnet IDs set in the config file

  • For sgprov mode β€” SG Provisioner deployed and SG ID available in SSM


Quick Reference β€” Actions by Safety LevelΒΆ

Local Actions (No AWS Calls besides subscription check)ΒΆ

Action

Description

validate-config

Validate configuration YAML against schema

list-products

List available tier templates

show-product

Show resources for the selected tier

create-policy

Generate least-privilege IAM policy

create-prov-template

Generate CloudFormation provisioning template

validate-prov-template

Validate generated template locally

create-review-report

Generate pre-deployment HTML review report

Read-Only AWS ActionsΒΆ

Action

Description

show-changes

Preview what would change in the deployed stack

check-drift

Detect infrastructure drift against deployed stack

test-deploy

Deploy with test suffix for safe isolated testing

Mutating AWS Actions (--force required)ΒΆ

Action

Description

deploy-product

Deploy ML product infrastructure via CloudFormation

delete-product

Delete CloudFormation stack and all associated resources


PrerequisitesΒΆ

See README for full prerequisites and installation instructions.


ConfigurationΒΆ

For configuration file structure and field reference see Configuration Reference. For scenario-based guidance on selecting and populating the right config file see Configuration Guide.


Business WorkflowΒΆ

The commands below represent the complete lifecycle of an ML product deployment. Run them in order:

  • Steps 1–7 are local and require no AWS calls besides the subscription check

  • Steps 8–12 require AWS credentials and, for enterprise tier, the VPC must be deployed and available β€” see README for prerequisites


Scenario MatrixΒΆ

Dimension ValuesΒΆ

Dimension

Value

Meaning

source_control

codecommit

AWS CodeCommit repositories are created for model-build and model-deploy source code

s3

An existing S3 bucket is used as the pipeline source β€” no CodeCommit repositories are created

vpc_mode

standalone

ML Provisioner creates and manages its own endpoint Security Group

sgprov

Security Group is managed externally by SG Provisioner β€” ML Provisioner skips SG creation and reads the existing SG ID from SSM

vpc_source

ssm

VPC ID and subnet IDs are resolved at deploy time from SSM Parameter Store paths β€” typically populated by VPC Provisioner

direct

VPC ID and subnet IDs are hardcoded directly in the configuration file

workload

empty

No workload discriminator β€” ml_name follows standard pattern

realtime

Workload discriminator appended to ml_name β€” allows multiple ML products in the same environment

route_table_ids

empty []

No route table IDs β€” networking team manages S3 Gateway endpoint route associations manually

populated

Route table IDs provided β€” S3 Gateway endpoint route associations configured automatically at deploy time

Config File

Tier

Image

Source Control

VPC Mode

VPC Source

Notes

techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit.yaml

starter

ml-provisioner:starter

codecommit

β€”

β€”

Representative starter scenario

techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit-workload.yaml

starter

ml-provisioner:starter

codecommit

β€”

β€”

workload=realtime variant

techcorp-prod-a001-us-west-2-customer-churn-ml-s3.yaml

starter

ml-provisioner:starter

s3

β€”

β€”

Starter + S3 source

techcorp-prod-a001-us-west-2-customer-churn-ml-s3-workload.yaml

starter

ml-provisioner:starter

s3

β€”

β€”

S3 + workload=realtime variant

edge-prod-b001-us-west-2-fraud-detection-ml-codecommit.yaml

professional

ml-provisioner:professional

codecommit

β€”

β€”

Representative professional scenario

edge-prod-b001-us-west-2-fraud-detection-ml-codecommit-workload.yaml

professional

ml-provisioner:professional

codecommit

β€”

β€”

workload=realtime variant

edge-prod-b001-us-west-2-fraud-detection-ml-s3.yaml

professional

ml-provisioner:professional

s3

β€”

β€”

Professional + S3 source

edge-prod-b001-us-west-2-fraud-detection-ml-s3-workload.yaml

professional

ml-provisioner:professional

s3

β€”

β€”

S3 + workload=realtime variant

globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm.yaml

enterprise

ml-provisioner:enterprise

codecommit

standalone

ssm

Representative enterprise scenario

globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-direct.yaml

enterprise

ml-provisioner:enterprise

codecommit

standalone

direct

globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-sgprov-ssm.yaml

enterprise

ml-provisioner:enterprise

codecommit

sgprov

ssm

globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-sgprov-direct.yaml

enterprise

ml-provisioner:enterprise

codecommit

sgprov

direct

globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-ssm.yaml

enterprise

ml-provisioner:enterprise

s3

standalone

ssm

globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-direct.yaml

enterprise

ml-provisioner:enterprise

s3

standalone

direct

globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-sgprov-ssm.yaml

enterprise

ml-provisioner:enterprise

s3

sgprov

ssm

globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-sgprov-direct.yaml

enterprise

ml-provisioner:enterprise

s3

sgprov

direct

globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm-workload.yaml

enterprise

ml-provisioner:enterprise

codecommit

standalone

ssm

workload=realtime variant

globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-direct-rtb.yaml

enterprise

ml-provisioner:enterprise

codecommit

standalone

direct

route_table_ids populated variant

Representative ScenariosΒΆ

Set variables CONFIG and IMAGE for your tier, then run the commands below.

Tier

IMAGE

CONFIG

Starter

starter

techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit.yaml

Professional

professional

edge-prod-b001-us-west-2-fraud-detection-ml-codecommit.yaml

Enterprise

enterprise

globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm.yaml

Note β€” Enterprise prerequisite:

  • VPC deployed

  • SSM paths populated:

    • /vpc/globalbank-prod-c001-us-west-2-vpc/VPCId

    • /vpc/globalbank-prod-c001-us-west-2-vpc/PrivateSubnetIds

For example:

CONFIG=techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit.yaml
IMAGE=starter

# 1. List available products
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act list-products

# 2. Show product resources for this tier
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act show-product

# 3. Validate configuration
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act validate-config

# 4. Generate IAM policy
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/policies:/app/policies \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act create-policy

# 5. Generate CloudFormation template
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act create-prov-template

# 6. Validate generated template
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act validate-prov-template

# 7. Generate pre-deployment review report
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act create-review-report

# 8. Preview changes (requires deployed stack)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act show-changes

# 9. Check drift (requires deployed stack)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act check-drift

# 10. Test deploy β€” stack name printed upon completion, note it down
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act test-deploy

# Delete the test stack once verified (replace <test-stack-name> with the name printed above)
aws cloudformation delete-stack --stack-name <test-stack-name>
aws cloudformation wait stack-delete-complete --stack-name <test-stack-name>

# 11. Deploy product
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act deploy-product --force

# 12. Delete product
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/reports:/app/reports \
  ml-provisioner:${IMAGE} -con ${CONFIG} -act delete-product --force

Multi-Environment DeploymentΒΆ

The recommended pattern is to deploy a different tier per environment, each in its own AWS account, following the AWS Well-Architected best practice of account-per-environment isolation.

Environment

Tier

Image

dev

starter

ml-provisioner:starter

staging

professional

ml-provisioner:professional

prod

enterprise

ml-provisioner:enterprise

Each environment requires a separate config file. See Application Architecture for architecture details.

Switching AWS AccountsΒΆ

Each Docker command mounts ~/.aws from the local machine. To target a different AWS account per environment, set the AWS_PROFILE environment variable in the Docker run command:

# Deploy dev (starter tier, dev AWS account)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  -e AWS_PROFILE=dev-profile \
  ml-provisioner:starter \
  -con techcorp-dev-a001-us-west-2-customer-churn-ml-codecommit.yaml \
  -act deploy-product --force

# Deploy staging (professional tier, staging AWS account)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  -e AWS_PROFILE=staging-profile \
  ml-provisioner:professional \
  -con edge-staging-b001-us-west-2-fraud-detection-ml-codecommit.yaml \
  -act deploy-product --force

# Deploy prod (enterprise tier, prod AWS account)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  -e AWS_PROFILE=prod-profile \
  ml-provisioner:enterprise \
  -con globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm.yaml \
  -act deploy-product --force

Volume MountsΒΆ

Mount

Container Path

Purpose

Required For

~/.aws

/home/mluser/.aws

AWS credentials

All actions

ml/configs

/app/configs

Input configuration files

All actions

ml/policies

/app/policies

Generated IAM policies

create-policy

ml/templates

/app/templates

CloudFormation templates

create-prov-template, validate-prov-template, create-review-report, deploy-product, show-changes

ml/reports

/app/reports

Execution logs and HTML reports

All actions

Notes:

  • Mount configs/ and ~/.aws/ as read-only (:ro) β€” the tool never writes to these

  • Output directories (policies/, templates/, reports/) must be writable β€” do not use :ro


AWS CredentialsΒΆ

All actions require AWS credentials for subscription validation. Actions that interact with AWS infrastructure (show-changes, check-drift, test-deploy, deploy-product, delete-product) also require permissions for CloudFormation, SageMaker, CodePipeline, CodeBuild, IAM, S3, and SSM.

Option 1: AWS Profile (Recommended)

-v ~/.aws:/home/mluser/.aws:ro

Note: To target a specific named profile, add -e AWS_PROFILE=<profile-name> to the Docker command. Useful when targeting different AWS accounts per environment.

Option 2: Environment Variables

-e AWS_ACCESS_KEY_ID=<access_key> \
-e AWS_SECRET_ACCESS_KEY=<secret_key> \
-e AWS_DEFAULT_REGION=us-west-2

Option 3: IAM Role (when running on EC2/ECS)

# No credentials needed β€” uses instance role

Note: Most commonly used in enterprise CI/CD pipelines where the ML Provisioner runs from an EC2 instance or ECS task with an attached IAM role. Ensure the instance role has the permissions generated by create-policy.


Best PracticesΒΆ

  1. Always validate first β€” Run validate-config before any AWS operations

  2. Review the IAM policy β€” Run create-policy and attach the generated policy before deploying

  3. Review the template β€” Run create-prov-template and inspect the generated CloudFormation template before deploying

  4. Generate a review report β€” Run create-review-report and share with stakeholders before production deploy

  5. Test before production β€” Use test-deploy to validate the full stack in isolation first

  6. Preview changes β€” Run show-changes before re-deploying to an existing stack

  7. Monitor drift β€” Run check-drift periodically to detect manual changes outside CloudFormation

  8. Use IAM roles over access keys in production CI/CD pipelines

  9. Version control configs β€” Store configuration files in Git for change tracking and rollback

  10. Separate environments β€” Use separate config files and AWS accounts per environment


FAQΒΆ

Q: Can I modify the generated CloudFormation template? A: Yes, but changes will be overwritten on next create-prov-template. Use the YAML configuration to customise your setup instead.

Q: How do I upgrade to a new version? A: Pull the latest Docker image for your tier. Existing deployed stacks are not affected unless you redeploy.

Q: What happens if deployment fails? A: CloudFormation automatically rolls back all resources. Check the stack events and the log file in ml/reports/ for details. See Troubleshooting.

Q: Can I deploy to multiple regions? A: Yes β€” create separate configuration files for each region and run the tool for each config.

Q: How do I delete everything? A: Use delete-product --force to delete the CloudFormation stack and all associated resources.

Q: What is the workload field for? A: It allows multiple ML products in the same AWS account and region by appending a discriminator to resource names β€” avoiding naming collisions.

Q: Can I use an existing VPC? A: Yes β€” that is the enterprise tier use case. Set vpc_integration.mode to standalone or sgprov and provide your VPC ID either directly or via SSM Parameter Store.

Q: What SSM parameters does the tool publish after deployment? A: See README under What Gets Created for the full list of SSM parameter names per tier.