Application ArchitectureΒΆ

Table of ContentsΒΆ

  1. Overview

  2. Position in the MLOps Suite

  3. Design Decisions

  4. Product Tier System

  5. Configuration System

  6. CloudFormation Generation

  7. SSM Parameter Store Integration

  8. Actions Reference

  9. Source Tree

  10. Future Roadmap


OverviewΒΆ

ML Provisioner is a Docker-based, config-driven tool that scaffolds ML project infrastructure on AWS via CloudFormation. It generates tier-based ML project environments β€” CodeCommit repositories, CodeBuild projects, CodePipeline pipelines, SageMaker resources, and IAM roles β€” from a simple YAML configuration file.

What it is: Infrastructure scaffolding for ML projects.

What it is not: A complete ML solution. It does not provide data pipelines, trained models, notebooks, or solution-specific ML code. Those are delivered by Phase 3 ML Solution modules (future).


Position in the MLOps SuiteΒΆ

ML Provisioner is the fifth module in the Axon Tech Labs MLOps Infrastructure Suite. It sits between the security/network layer and the SageMaker environment layer:

Phase 1 β€” Infrastructure (Complete)
  β”œβ”€β”€ VPC Provisioner      β†’ Network foundation (subnets, gateways, routing)
  β”œβ”€β”€ SG Provisioner       β†’ Security groups (tier-based, cross-tier references)
  β”œβ”€β”€ SEC Provisioner      β†’ IAM groups, roles, policies
  β”œβ”€β”€ S3 Provisioner       β†’ ML-optimized bucket structure
  └── LB Provisioner       β†’ Load balancer provisioning (planned β€” ALB/NLB)

Phase 2 β€” ML Platform (Current)
  β”œβ”€β”€ ML Provisioner       β†’ ML project scaffolding  ← THIS MODULE
  └── SageMaker Provisioner β†’ Studio environment + extensions (next)

Phase 3 β€” ML Solutions (Future)
  β”œβ”€β”€ Customer Churn Solution
  β”œβ”€β”€ Fraud Detection Solution
  β”œβ”€β”€ Demand Forecasting Solution
  └── (additional solutions)

Dependency Chain via SSM Parameter StoreΒΆ

Each provisioner publishes its outputs to SSM Parameter Store. Downstream provisioners read those outputs automatically β€” no manual wiring required.

S3 Provisioner  (independent β€” no VPC dependency)
  └── Provisions: ML data lake bucket structure
  └── Publishes to SSM: Bucket names, folder paths

─────────────────────────────────────────────────────────

VPC Provisioner
  └── Provisions: VPC, subnets, routing
  └── Publishes to SSM: VPC ID, subnet IDs
        ↓
SG Provisioner  (reads VPC ID from SSM)
  └── Provisions: Security groups scoped to the VPC
  └── Publishes to SSM: Security group IDs
        ↓
LB Provisioner  (reads VPC ID, subnet IDs, SG IDs from SSM β€” planned)
  └── Provisions: Application/Network load balancers
  └── Publishes to SSM: Load balancer ARNs, DNS names
        ↓
ML Provisioner  (reads VPC ID, subnet IDs, SG IDs, LB outputs and S3 bucket names from SSM)
  └── Provisions: ML pipelines, SageMaker registry, VPC endpoints
  └── Publishes to SSM: Model registry ARN, KMS key ARN, endpoint IDs
        ↓
SageMaker Provisioner  (reads ML outputs and S3 bucket names from SSM β€” next module)
  └── Provisions: Studio domain, lifecycle configs

Design DecisionsΒΆ

Decision 1: Generic Templates, Parameterized Use CasesΒΆ

The use case (e.g., fraud-detection, customer-churn) is a naming parameter, not a separate template. All tiers use generic CloudFormation templates. The use case name flows through to resource naming only.

Rationale: Avoids false promise of delivering a complete ML solution. Keeps v1.0.0 scope manageable. Use-case-specific templates are deferred to Phase 3 ML Solutions.

Implementation: The tier YAML files in schemas/products/ are pure data definitions β€” valid YAML with no placeholders or substitution tokens. They define what resources exist and their structure. The cfn_generator.py reads the tier definition and the client config separately, then constructs the CloudFormation template programmatically by building Python dict structures and dumping to YAML.

This is the same battle-tested approach used in the SG Provisioner. It avoids the fragility of string substitution (invalid YAML before substitution, silent errors, hard to validate) and keeps the tier definitions clean and independently readable.

The client edits the config file directly β€” no partial overrides or merging logic. Every field is explicit.

Decision 2: Tier as Primary Template DimensionΒΆ

Three templates: starter.yaml, professional.yaml, enterprise.yaml. Tier determines which AWS resources are provisioned.

Rationale: Infrastructure complexity scales with tier, not use case. A fraud-detection starter has the same infrastructure as a churn starter β€” only the names differ.

Decision 3: CodeCommit as Primary, S3 as FallbackΒΆ

Source control is configurable via source_control: codecommit or source_control: s3.

Rationale: CodeCommit returned to GA in November 2025 with AWS investment (Git LFS in Q1 2026, regional expansion Q3 2026). S3 fallback provided for organizations with CodeCommit restrictions or compliance requirements.

Decision 4: 12 Actions β€” Same Pattern as SG ProvisionerΒΆ

Consistent CLI interface across all provisioners. Developers who know one provisioner know all of them.

Decision 5: SSM Parameter Store OutputsΒΆ

All deployed resource IDs stored in SSM Parameter Store for downstream consumers (SageMaker Provisioner, CI/CD pipelines, other automation).

Decision 6: Separate CI/CD Artifacts Bucket from ML Data LakeΒΆ

The ML Provisioner (Professional/Enterprise tier) creates a dedicated CI/CD artifacts bucket separate from the S3 Provisioner data lake bucket.

S3 Provisioner bucket β€” ML data lake, scoped per tenant:

s3://{company_prefix}-{environment}-{tenant_id}-{region}/solutions/{use-case}/

Contains: ML data (raw/curated/processed), code, models, notebooks, configs. Read/written by data scientists and ML engineers. Long retention (years).

ML Provisioner bucket β€” CI/CD pipeline artifacts, scoped per project:

s3://{company_prefix}-{environment}-{tenant_id}-{use_case}-ml-artifacts/

Contains: CodeBuild outputs, CodePipeline stage artifacts, deployment packages. Read/written exclusively by CodeBuild and CodePipeline service roles. Short retention (30-90 days).

Rationale for separation:

  • Different access patterns β€” data scientists vs CI/CD service roles

  • Different lifecycle policies β€” years vs weeks/months

  • Different scope β€” one data lake shared across projects vs one artifacts bucket per project

  • Cleaner IAM β€” no overlap between human and pipeline permissions

Scope comparison:

S3 Provisioner Bucket

ML Provisioner Bucket

Purpose

ML data lake

CI/CD pipeline artifacts

Scope

Per tenant (shared across projects)

Per ML project

Structure

130+ ML folders

Flat, pipeline-managed

Who writes

Data engineers, scientists

CodeBuild, CodePipeline

Retention

Years

30-90 days

Tier

Always created

Always created

Decision 7: No Service Catalog DependencyΒΆ

Initial design considered AWS Service Catalog as distribution mechanism. Rejected due to:

  • API/console permission inconsistency

  • Narrow scope β€” designed for internal IT governance, not ML developer workflows

  • Additional complexity without proportional value

ML product templates distributed directly via S3 (consistent with docs hosting pattern).

Enterprise self-service pattern: Enterprise clients who want to wrap generated CloudFormation templates in Service Catalog for IT admin governance and self-service vending to data science teams can do so independently. The pattern for wrapping ML Provisioner generated templates in a Service Catalog product will be documented in INTEGRATION_EXAMPLES.md. This gives enterprise clients the governance model without adding Service Catalog complexity to the tool itself.

Revisit during SageMaker Provisioner: Service Catalog integration may become relevant again when designing the SageMaker Provisioner, particularly for SageMaker Projects which have native Service Catalog integration. This decision should be reviewed at that point.

Decision 8: IAM Resource Naming β€” Region OmittedΒΆ

IAM is a global AWS service β€” role and policy names are unique per AWS account, not per region. Region therefore adds no differentiation value in IAM names and is omitted to stay within the 64-character AWS::IAM::Role name limit.

IAM naming pattern:

{company_prefix}-{env}-{tenant_id}-{use_case}-{suffix}

Example β€” standard resource vs IAM resource:

# Standard resource (region included)
globalbank-prod-c001-us-west-2-demand-forecasting-ml-build        ← CodeBuild project
globalbank-prod-c001-us-west-2-demand-forecasting-ml-dashboard    ← CloudWatch dashboard

# IAM resource (region omitted)
globalbank-prod-c001-demand-forecasting-ml-codebuild-role         ← IAM role
globalbank-prod-c001-demand-forecasting-ml-build-policy           ← IAM managed policy

All other resources retain the full standard pattern including region. This is the minimum deviation required to satisfy the AWS hard limit while preserving the naming convention’s collision-free guarantees.

use_case maximum length: The validation schema enforces a 20-character maximum on ml_product.use_case. This is derived from AWS::IAM::Role being the tightest naming constraint β€” with typical config values, a use case longer than 20 characters would cause IAM role names to exceed the 64-character limit.

Decision 9: EventBridge Rule β†’ CodePipeline (Direct Invocation)ΒΆ

The model approval automation uses a direct two-stage event-driven chain:

EventBridge Event Bus
    β†’ EventBridge Rule  (filters for SageMaker model approval events)
        β†’ CodePipeline Deploy Pipeline  (triggered directly as Rule target)

Why this architecture:

  • Direct invocation β€” EventBridge Rules natively support CodePipeline as a target. No intermediate resource is needed.

  • Fewer resources β€” eliminates AWS::Pipes::Pipe and the pipe-execution-role IAM role, reducing the resource count by 2 per stack.

  • Simpler security β€” the existing codepipeline-role is reused as the Rule target role. No additional IAM role required.

  • Lower latency β€” direct invocation removes an intermediate step.

  • No maintenance overhead β€” no Pipe resource to monitor, update, or troubleshoot.

CloudFormation implementation:

The Rule’s Targets property references the deploy pipeline ARN constructed via Fn::Sub:

Targets:
  - Id: DeployPipelineTarget
    Arn: !Sub "arn:aws:codepipeline:${AWS::Region}:${AWS::AccountId}:{deploy-pipeline-name}"
    RoleArn: !GetAtt CodepipelineRole.Arn

Note: EventBridge Pipes (AWS::Pipes::Pipe) was evaluated and rejected. Despite architectural appeal, Pipes does not support CodePipeline as a target in its PipeTargetParameters schema. The direct Rule β†’ CodePipeline pattern is both simpler and fully supported.

Event-driven automation flow:

Data Scientist
  └── Approves model in SageMaker Model Registry
        ↓
EventBridge Event Bus
  └── Receives: ModelPackageGroupStateChange event
        ↓
EventBridge Rule
  └── Filters: status = Approved
  └── Target: Deploy Pipeline (direct invocation)
        ↓
CodePipeline β€” Deploy Pipeline
  └── Pulls approved model artifact from Model Registry
  └── Runs deployment stages
  └── Deploys model to SageMaker endpoint

Decision 10: License Per AWS Account, No Template Sharing MechanismΒΆ

On generated templates: Generated CloudFormation templates in templates/ are plain YAML files with no embedded license. Sharing them is the MLOps engineer’s responsibility and cannot be blocked β€” this is consistent with all IaC tools (Terraform, CDK, etc.). What is licensed is the tool itself β€” the Docker image that generates templates, validates config, deploys, checks drift, and generates reports.

No template sharing mechanism will be built into the tool. Rationale:

  • Outside the tool’s scope

  • Every company has different sharing mechanisms (S3, Git, Confluence, etc.)

  • Adds complexity without licensing value

  • INTEGRATION_EXAMPLES.md will document the pattern for sharing templates via S3 if needed

Decision 11: IAM Policy β€” CodeCommit Resource ScopingΒΆ

The CodeCommitManagement statement in the generated IAM policy restricts all CodeCommit actions to repositories whose names begin with {ml_name}- for a specific AWS account:

"Resource": "arn:aws:codecommit:{region}:{account}:{ml_name}-*"

Since ml_name encodes the full project identity ({company_prefix}-{env}-{tenant_id}-{region}-{use_case}[-{workload}]-ml), a user holding this policy can only manage repositories belonging to that one project. Two different projects produce two non-overlapping ml_name values and therefore two non-overlapping resource scopes β€” Principle of Least Privilege in action.

Note: CodeCommit ARNs do not use a path separator before the repository name (unlike some other services). The pattern arn:aws:codecommit:{region}:{account}:{ml_name}-* is correct β€” no leading slash before {ml_name}.


Product Tier SystemΒΆ

Starter TierΒΆ

Foundation MLOps platform. Suitable for small teams and proof-of-concept projects.

  • 13 β€” CodeCommit:

    • AWS::SageMaker::ModelPackageGroup β€” Model Registry, approval gate, version management and traceability

    • AWS::CodeCommit::Repository (x2) β€” model-build and model-deploy repos

    • AWS::CodeBuild::Project (x2) β€” build and deploy projects

    • AWS::S3::Bucket β€” CodePipeline artifact store with S3 Versioning enabled

    • AWS::CodePipeline::Pipeline (x2) β€” build and deploy pipelines

    • AWS::IAM::Role (x3) β€” CodeBuild, CodePipeline, SageMaker execution roles

    • AWS::SSM::Parameter (x2) β€” ModelPackageGroupArn, RepositoryUrl

  • 10 β€” S3 (3 resources removed vs CodeCommit):

    • AWS::CodeCommit::Repository (x2) β€” not created

    • AWS::SSM::Parameter reduced to (x1) β€” RepositoryUrl not published

Use case: Small ML team, single use case, standard security.

Professional TierΒΆ

Starter plus enhanced monitoring, event-driven automation, and additional policies.

  • 19 β€” CodeCommit (all Starter resources for CodeCommit scenario plus):

    • AWS::Events::Rule β€” EventBridge rule triggering Deploy pipeline on model approval

    • AWS::CloudWatch::Dashboard β€” ML pipeline monitoring dashboard

    • AWS::IAM::ManagedPolicy (x2) β€” custom policies for enhanced access control

    • AWS::SSM::Parameter (x2) β€” BucketName, DashboardName (total x4 with Starter)

  • 16 β€” S3 (3 resources removed vs CodeCommit):

    • AWS::CodeCommit::Repository (x2) β€” not created

    • AWS::SSM::Parameter reduced β€” RepositoryUrl not published (total x3 with Starter)

Use case: Growing ML team, multiple use cases, enhanced security and monitoring.

Enterprise TierΒΆ

Professional plus VPC integration, KMS encryption, compliance monitoring, and permission boundaries.

Scenario Counts:

Source Control

VPC Mode

CFN Resources

SSM Parameters

CodeCommit

standalone

41

11

CodeCommit

sgprov

39

10

S3

standalone

38

10

S3

sgprov

36

9

  • 41 β€” CodeCommit + standalone:

    • AWS::SageMaker::ModelPackageGroup β€” Model Registry

    • AWS::CodeCommit::Repository (x2) β€” model-build and model-deploy repos

    • AWS::CodeBuild::Project (x2) β€” build and deploy projects

    • AWS::S3::Bucket β€” CodePipeline artifact store

    • AWS::CodePipeline::Pipeline (x2) β€” build and deploy pipelines

    • AWS::IAM::Role (x3) β€” CodeBuild, CodePipeline, SageMaker execution roles

    • AWS::IAM::ManagedPolicy (x3) β€” build policy, deploy policy, permission boundary

    • AWS::Events::Rule β€” EventBridge rule triggering Deploy pipeline on model approval

    • AWS::CloudWatch::Dashboard β€” ML pipeline monitoring dashboard

    • AWS::CloudWatch::Alarm (x2) β€” root account usage, unauthorized API calls

    • AWS::Logs::LogGroup β€” security compliance log group

    • AWS::Logs::MetricFilter (x2) β€” security alarm filters

    • AWS::SNS::Topic β€” security alerts topic

    • AWS::SNS::Subscription β€” alert email subscription

    • AWS::KMS::Key β€” encryption key for ML artifacts

    • AWS::KMS::Alias β€” key alias

    • AWS::EC2::VPCEndpoint (x4) β€” SageMaker API, SageMaker Runtime, S3 (Gateway), STS

    • AWS::EC2::SecurityGroup β€” dedicated SG for VPC endpoint traffic

    • AWS::SSM::Parameter (x11) β€” ModelPackageGroupArn, RepositoryUrl, BucketName, DashboardName, KmsKeyArn, LogGroupName, VpcEndpointIdSagemakerApi, VpcEndpointIdSagemakerRuntime, VpcEndpointIdS3, VpcEndpointIdSts, SecurityGroupId

  • 39 β€” CodeCommit + sgprov (2 resources removed vs CodeCommit + standalone):

    • AWS::EC2::SecurityGroup β€” not created (managed by SG Provisioner)

    • AWS::SSM::Parameter reduced to (x10) β€” SecurityGroupId not published

  • 38 β€” S3 + standalone (3 resources removed vs CodeCommit + standalone):

    • AWS::CodeCommit::Repository (x2) β€” not created

    • AWS::SSM::Parameter reduced to (x10) β€” RepositoryUrl not published

  • 36 β€” S3 + sgprov (5 resources removed vs CodeCommit + standalone):

    • AWS::CodeCommit::Repository (x2) β€” not created

    • AWS::EC2::SecurityGroup β€” not created (managed by SG Provisioner)

    • AWS::SSM::Parameter reduced to (x9) β€” RepositoryUrl and SecurityGroupId not published

Use case: Enterprise ML organization, strict security and compliance requirements, VPC-integrated workloads.

VPC Integration Modes (Enterprise Tier)ΒΆ

Enterprise tier supports two VPC integration modes configured via vpc_integration.mode in the YAML config:

Standalone mode β€” client has ML Provisioner only:

ml_product:
  tier: enterprise
  vpc_integration:
    mode: standalone
    vpc_source: parameter-store
    vpc_parameter_store_path: /vpc/globalbank-prod-c001-us-west-2-vpc/VPCId
    subnet_parameter_store_path: /vpc/globalbank-prod-c001-us-west-2-vpc/PrivateSubnetIds

Creates a dedicated AWS::EC2::SecurityGroup for VPC endpoint traffic plus all 4 VPC endpoints.

SG Provisioner mode β€” client has both ML Provisioner and SG Provisioner (or a bundle):

ml_product:
  tier: enterprise
  vpc_integration:
    mode: sg-provisioner
    vpc_source: parameter-store
    vpc_parameter_store_path: /vpc/globalbank-prod-c001-us-west-2-vpc/VPCId
    subnet_parameter_store_path: /vpc/globalbank-prod-c001-us-west-2-vpc/PrivateSubnetIds
    sg_parameter_store_path: /sg/globalbank-prod-c001-us-west-2-sg/app/SecurityGroupId

Reads the existing SG ID from SSM Parameter Store. Creates only the 4 VPC endpoints β€” no new security group created, no conflict with SG Provisioner.

Note: A future bundle combining ML Provisioner and SG Provisioner will be offered. The bundle discount reflects the tighter integration between the two provisioners in enterprise deployments.

Note on S3 Gateway endpoint route table associations: The route_table_parameter_store_path (parameter-store mode) and route_table_ids (direct mode) fields are optional. When left empty, the S3 Gateway VPC endpoint is created without explicit route table associations β€” the networking team is responsible for associating the endpoint with the appropriate route tables. When populated, the generator includes RouteTableIds in the endpoint resource and associations are configured automatically at deploy time.


Configuration SystemΒΆ

YAML Configuration FileΒΆ

The configuration file is a complete, self-contained YAML file. A client edits it directly to enforce their own settings β€” no partial overrides or merging logic. Every field is explicit and visible.

client:
  company_name: Global Bank
  company_prefix: globalbank
  account_id: "123456789012"
  tenant_id: "c001"

environment:
  env: prod
  region: us-west-2

ml_product:
  use_case: fraud-detection        # naming parameter only β€” not a solution
  tier: professional               # starter | professional | enterprise
  source_control: codecommit       # codecommit | s3
  product_name_override: ""        # optional override for auto-generated name
  workload: ""                     # optional discriminator for multiple products

tags:
  cost_center: Fraud Operations
  project: Real-time Credit Card Fraud Detection System
  owner: fraud-ml-engineering-team

Product Naming ConventionΒΆ

Format

Pattern

Example

Without workload

{prefix}-{env}-{tenant}-{region}-{use_case}-ml

globalbank-prod-c001-us-west-2-fraud-detection-ml

With workload

{prefix}-{env}-{tenant}-{region}-{use_case}-{workload}-ml

globalbank-prod-c001-us-west-2-fraud-detection-v2-ml

Multi-* SystemΒΆ

The ML Provisioner is designed from the ground up as a multi-* system. Every dimension of variation is encoded in the configuration file and flows through to resource naming automatically. No special multi-* logic is needed in the tool.

The five dimensions:

Multi-Company (subsidiaries with own company prefix)
  globalbank-prod-c001-us-west-2-fraud-detection-ml
  globalbank-europe-prod-c001-eu-west-1-fraud-detection-ml
  globalbank-asia-prod-c001-ap-southeast-1-fraud-detection-ml

Multi-Tenant (multiple tenants within same AWS account)
  globalbank-prod-c001-us-west-2-fraud-detection-ml  ← tenant c001
  globalbank-prod-c002-us-west-2-fraud-detection-ml  ← tenant c002
  globalbank-prod-c003-us-west-2-fraud-detection-ml  ← tenant c003

Multi-Environment (dev, staging, prod)
  globalbank-dev-c001-us-west-2-fraud-detection-ml
  globalbank-staging-c001-us-west-2-fraud-detection-ml
  globalbank-prod-c001-us-west-2-fraud-detection-ml

Multi-Region
  globalbank-prod-c001-us-west-2-fraud-detection-ml
  globalbank-prod-c001-us-east-1-fraud-detection-ml
  globalbank-prod-c001-eu-west-1-fraud-detection-ml

Multi-Use-Case (within same tenant/env/region)
  globalbank-prod-c001-us-west-2-fraud-detection-ml
  globalbank-prod-c001-us-west-2-customer-churn-ml
  globalbank-prod-c001-us-west-2-demand-forecasting-ml

All five dimensions are handled by the same tool, same config pattern, same 12 commands. Each combination produces a completely isolated CloudFormation stack with its own resources and SSM Parameter Store paths.

The workload discriminator is the key differentiator that allows a client to deploy multiple distinct ML solutions within the exact same company/tenant/env/region combination without any naming collision. Without it, only one stack per use-case per environment is possible. With it, a client can create as many isolated variations as needed:

# Without workload β€” only one allowed per combination
globalbank-prod-c001-us-west-2-fraud-detection-ml

# With workload β€” unlimited isolated variations
globalbank-prod-c001-us-west-2-fraud-detection-realtime-ml
globalbank-prod-c001-us-west-2-fraud-detection-batch-ml
globalbank-prod-c001-us-west-2-fraud-detection-cards-ml
globalbank-prod-c001-us-west-2-fraud-detection-loans-ml

Each workload gets its own completely isolated CloudFormation stack, CodeCommit repos, pipelines, artifacts bucket, and SSM Parameter Store paths. Same company, same tenant, same environment, same region, same use case β€” but four independent ML scaffolding environments for different fraud detection workloads.

Config fields driving each dimension:

Dimension

Config Field

Company

client.company_prefix

Tenant

client.tenant_id

Environment

environment.env

Region

environment.region

Use Case

ml_product.use_case

Workload

ml_product.workload

This is one of the strongest differentiators of the Axon Tech Labs MLOps Suite β€” a single tool handles the full complexity of a large enterprise with subsidiaries, multiple teams, multiple environments, and multiple regions, all from simple YAML configuration files.

Multi-environment deployments are handled by creating separate configuration files per environment. No special multi-environment logic is needed in the tool β€” isolation is automatic through resource naming.

Configuration files per environment:

configs/
β”œβ”€β”€ globalbank-dev-c001-us-west-2-fraud-detection-ml.yaml
β”œβ”€β”€ globalbank-staging-c001-us-west-2-fraud-detection-ml.yaml
└── globalbank-prod-c001-us-west-2-fraud-detection-ml.yaml

Each config sets environment.env to dev, staging, or prod respectively. The generator produces three completely isolated CloudFormation stacks:

globalbank-dev-c001-us-west-2-fraud-detection-ml-stack
globalbank-staging-c001-us-west-2-fraud-detection-ml-stack
globalbank-prod-c001-us-west-2-fraud-detection-ml-stack

Each stack has its own isolated resources β€” CodeCommit repos, CodeBuild projects, CodePipeline pipelines, S3 artifacts bucket β€” and its own SSM Parameter Store paths under /ml/globalbank-{env}-c001-.../.

Tiers can differ per environment β€” a common and recommended pattern:

Environment

Tier

Rationale

dev

starter

Cheap, fast iteration, no compliance overhead

staging

professional

Mirrors prod, event-driven approval gate, monitoring

prod

enterprise

VPC-only, KMS encryption, compliance logging, permission boundaries

This pattern creates a natural upgrade path and keeps costs low in non-production environments while maintaining full enterprise controls in production.

Note on licensing: The ML Provisioner license is enforced per AWS account via AWS License Manager. Each AWS account requires its own Marketplace subscription. Clients running all environments in a single AWS account need only one subscription. Clients following the recommended account-per-environment isolation pattern will need one subscription per account β€” this is consistent with standard AWS Marketplace licensing across all IaC tools.

Each environment is deployed independently:

# Deploy dev (starter tier)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  -e AWS_PROFILE=dev-profile \
  ml-provisioner:starter \
  -con globalbank-dev-c001-us-west-2-fraud-detection-ml.yaml \
  -act deploy-product --force

# Deploy staging (professional tier)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  -e AWS_PROFILE=staging-profile \
  ml-provisioner:professional \
  -con globalbank-staging-c001-us-west-2-fraud-detection-ml.yaml \
  -act deploy-product --force

# Deploy prod (enterprise tier)
docker run --rm \
  -v ~/.aws:/home/mluser/.aws:ro \
  -v $(pwd)/ml/configs:/app/configs:ro \
  -v $(pwd)/ml/templates:/app/templates \
  -v $(pwd)/ml/reports:/app/reports \
  -e AWS_PROFILE=prod-profile \
  ml-provisioner:enterprise \
  -con globalbank-prod-c001-us-west-2-fraud-detection-ml.yaml \
  -act deploy-product --force

See VPC Integration Modes in the Enterprise Tier section above. The vpc_integration block in the config replaces the simple vpc_source field and supports both standalone and SG Provisioner integration modes.


CloudFormation GenerationΒΆ

The cfn_generator.py module generates CloudFormation templates from two inputs β€” the selected tier blueprint and the client YAML configuration. The blueprint defines structure. The client config provides identity. They meet only inside the generator β€” no string substitution, no placeholders.

Generation FlowΒΆ

YAML Config
    ↓
ConfigLoader (parses YAML, resolves paths)
    ↓
ConfigValidator (validates against tier JSON schema)
    ↓
ProductLoader (loads tier blueprint)
    ↓
ProductValidator (security + schema checks)
    ↓
CfnGenerator (constructs CFN as Python dicts, dumps to YAML)
    ↓
CloudFormation Template (saved to templates/)

Security ValidationΒΆ

The ProductValidator runs before template generation and blocks or warns on dangerous patterns. Checks are scoped strictly to resources provisioned by ML Provisioner.

Blocking Checks (generation halted)ΒΆ

IAM:

  • IAM roles with * resource and no condition β€” enforces Principle of Least Privilege

  • Inline IAM policies β€” blocked in favor of managed policies for better versioning and reusability

  • Hardcoded credentials in generated template β€” scans for plaintext AWS access key patterns (AKIA...) before saving

Storage:

  • Public S3 buckets (enterprise tier) β€” blocks PublicAccessBlockConfiguration disabled

  • Missing KMS encryption (enterprise tier) β€” Customer Managed Keys required for auditability and control

Compute:

  • CodeBuild projects with privileged mode enabled without justification β€” prevents root-level access to host Docker daemon, a common privilege escalation vector

Networking:

  • SSH (port 22) or RDP (port 3389) open to 0.0.0.0/0 on endpoint SecurityGroup (enterprise standalone mode)

Logging:

  • CloudWatch LogGroup retention below 90 days (enterprise tier) β€” enforces minimum retention for audit and incident response

Warning Checks (generation proceeds with warning)ΒΆ

IAM:

  • Roles containing high-risk actions: iam:PassRole, iam:CreateAccessKey, s3:DeleteBucket β€” warned rather than blocked because iam:PassRole is legitimately required for SageMaker execution roles. Warning includes justification guidance

Tags:

  • Missing required tags on any taggable resource β€” essential for ABAC (Attribute-Based Access Control) and governance

Out of ScopeΒΆ

  • VPC Flow Logs β€” VPC Provisioner responsibility. ML Provisioner does not create VPCs

  • Security Groups for application tiers β€” SG Provisioner responsibility

  • RDS PubliclyAccessible β€” ML Provisioner does not provision RDS

  • Load Balancer / CloudFront HTTPS enforcement β€” ML Provisioner does not provision these resources

  • EC2 public IP assignment β€” ML Provisioner does not provision EC2 instances

  • S3 account-level public access blocks β€” ML Provisioner does not modify account-level settings

For the full technical reference including blueprint schema, generation algorithm, client data injection, naming conventions, conditional generation logic, and concrete examples see CFN_GENERATOR.md.


SSM Parameter Store IntegrationΒΆ

All deployed resource identifiers are stored in SSM Parameter Store at deployment time under the path /ml/{product-name}/, where {product-name} is derived from the configuration as:

{company_prefix}-{env}-{tenant_id}-{region}-{use_case}-ml

Example (globalbank enterprise deployment):

/ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/ModelPackageGroupArn
/ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/RepositoryUrl
/ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/BucketName
...

Full parameter list by tier:

/ml/{product-name}/ModelPackageGroupArn           (all tiers)
/ml/{product-name}/RepositoryUrl                  (codecommit only)
/ml/{product-name}/BucketName                     (professional + enterprise)
/ml/{product-name}/DashboardName                  (professional + enterprise)
/ml/{product-name}/KmsKeyArn                      (enterprise only)
/ml/{product-name}/LogGroupName                   (enterprise only)
/ml/{product-name}/SecurityGroupId                (enterprise standalone mode only)
/ml/{product-name}/VpcEndpointIdSagemakerApi       (enterprise only)
/ml/{product-name}/VpcEndpointIdSagemakerRuntime   (enterprise only)
/ml/{product-name}/VpcEndpointIdS3                (enterprise only)
/ml/{product-name}/VpcEndpointIdSts               (enterprise only)

These paths are available for consumption by downstream tooling β€” such as a SageMaker Provisioner β€” to configure Studio domains and projects without manual cross-referencing.


Actions ReferenceΒΆ

All actions require AWS credentials for subscription validation. Actions marked Mutating additionally require --force.

Action

AWS Calls

–force

Purpose

validate-config

subscription only

❌

Validate YAML schema and field values

list-products

subscription only

❌

List available tier templates

show-product

subscription only

❌

Display tier resources and configuration

create-policy

subscription only

❌

Generate least-privilege IAM policy

create-prov-template

subscription only

❌

Generate CloudFormation template

validate-prov-template

subscription only

❌

Validate template locally

create-review-report

subscription only

❌

Generate pre-deployment HTML report

show-changes

read-only

❌

Preview changes against deployed stack

check-drift

read-only

❌

Detect infrastructure drift

test-deploy

read-only

❌

Deploy with isolated suffix for testing

deploy-product

mutating

βœ…

Deploy ML product infrastructure

delete-product

mutating

βœ…

Delete stack and all resources


Source TreeΒΆ

packages/ml-provisioner-tool/
β”œβ”€β”€ configs
β”‚   β”œβ”€β”€ edge-prod-b001-us-west-2-fraud-detection-ml-codecommit-workload.yaml
β”‚   β”œβ”€β”€ edge-prod-b001-us-west-2-fraud-detection-ml-codecommit.yaml
β”‚   β”œβ”€β”€ edge-prod-b001-us-west-2-fraud-detection-ml-s3-workload.yaml
β”‚   β”œβ”€β”€ edge-prod-b001-us-west-2-fraud-detection-ml-s3.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-sgprov-direct.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-sgprov-ssm.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-direct-rtb.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-direct.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm-workload.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-sgprov-direct.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-sgprov-ssm.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-direct-rtb.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-direct.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-ssm-workload.yaml
β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-ssm.yaml
β”‚   β”œβ”€β”€ techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit-workload.yaml
β”‚   β”œβ”€β”€ techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit.yaml
β”‚   β”œβ”€β”€ techcorp-prod-a001-us-west-2-customer-churn-ml-s3-workload.yaml
β”‚   β”œβ”€β”€ techcorp-prod-a001-us-west-2-customer-churn-ml-s3.yaml
β”‚   └── examples
β”‚       β”œβ”€β”€ enterprise
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-sgprov-direct.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-sgprov-ssm.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-direct-rtb.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-direct.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm-workload.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-standalone-ssm.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-sgprov-direct.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-sgprov-ssm.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-direct-rtb.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-direct.yaml
β”‚       β”‚   β”œβ”€β”€ globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-ssm-workload.yaml
β”‚       β”‚   └── globalbank-prod-c001-us-west-2-demand-forecasting-ml-s3-standalone-ssm.yaml
β”‚       β”œβ”€β”€ professional
β”‚       β”‚   β”œβ”€β”€ edge-prod-b001-us-west-2-fraud-detection-ml-codecommit-workload.yaml
β”‚       β”‚   β”œβ”€β”€ edge-prod-b001-us-west-2-fraud-detection-ml-codecommit.yaml
β”‚       β”‚   β”œβ”€β”€ edge-prod-b001-us-west-2-fraud-detection-ml-s3-workload.yaml
β”‚       β”‚   └── edge-prod-b001-us-west-2-fraud-detection-ml-s3.yaml
β”‚       └── starter
β”‚           β”œβ”€β”€ techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit-workload.yaml
β”‚           β”œβ”€β”€ techcorp-prod-a001-us-west-2-customer-churn-ml-codecommit.yaml
β”‚           β”œβ”€β”€ techcorp-prod-a001-us-west-2-customer-churn-ml-s3-workload.yaml
β”‚           └── techcorp-prod-a001-us-west-2-customer-churn-ml-s3.yaml
β”œβ”€β”€ docker
β”‚   β”œβ”€β”€ Dockerfile
β”‚   └── entrypoint.sh
β”œβ”€β”€ docs
β”‚   β”œβ”€β”€ sphinx
β”‚   β”‚   └── source
β”‚   β”‚       β”œβ”€β”€ conf.py
β”‚   β”‚       β”œβ”€β”€ index.rst
β”‚   β”‚       └── onboarding
β”‚   β”œβ”€β”€ APPLICATION_ARCHITECTURE.md
β”‚   β”œβ”€β”€ CFN_GENERATOR.md                      # internal
β”‚   β”œβ”€β”€ CFN_GENERATOR_IMPLEMENTATION_STEPS.md # internal
β”‚   β”œβ”€β”€ CONFIGURATION.md
β”‚   β”œβ”€β”€ CONFIGURATION_GUIDE.md
β”‚   β”œβ”€β”€ FEEDBACK.md
β”‚   β”œβ”€β”€ IAM_PERMISSIONS.md
β”‚   β”œβ”€β”€ INTEGRATION_EXAMPLES.md
β”‚   β”œβ”€β”€ MIGRATION_GUIDE.md
β”‚   β”œβ”€β”€ NAMING_CONVENTIONS.md
β”‚   β”œβ”€β”€ PREREQUISITES.md
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ RELEASE_NOTES.md
β”‚   β”œβ”€β”€ RESOURCES_EXPLAINED.md
β”‚   β”œβ”€β”€ ROADMAP.md
β”‚   β”œβ”€β”€ SAMPLE_REPORTS.md
β”‚   β”œβ”€β”€ SECURITY_GUIDELINES.md
β”‚   β”œβ”€β”€ SUPPORT.md
β”‚   β”œβ”€β”€ TROUBLESHOOTING.md
β”‚   β”œβ”€β”€ UPDATE_PROCEDURES.md
β”‚   └── USER_GUIDE.md
β”œβ”€β”€ policies
β”œβ”€β”€ reports
β”œβ”€β”€ schemas
β”‚   β”œβ”€β”€ products
β”‚   β”‚   β”œβ”€β”€ enterprise.yaml
β”‚   β”‚   β”œβ”€β”€ professional.yaml
β”‚   β”‚   └── starter.yaml
β”‚   β”œβ”€β”€ validation-schema-enterprise.yaml
β”‚   β”œβ”€β”€ validation-schema-professional.yaml
β”‚   β”œβ”€β”€ validation-schema-starter.yaml
β”‚   └── validation-schema.yaml
β”œβ”€β”€ src
β”‚   └── ml_provisioner
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ __main__.py
β”‚       β”œβ”€β”€ cli.py
β”‚       β”œβ”€β”€ config
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ app_config.yaml
β”‚       β”‚   └── loader.py
β”‚       β”œβ”€β”€ core
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   └── ml_manager.py
β”‚       β”œβ”€β”€ generators
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   └── cfn_generator.py
β”‚       β”œβ”€β”€ license
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   └── validator.py
β”‚       β”œβ”€β”€ models
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   └── product.py
β”‚       β”œβ”€β”€ products
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ loader.py
β”‚       β”‚   └── validator.py
β”‚       └── utils
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ html_generator.py
β”‚           └── review_report.py
β”œβ”€β”€ templates
β”œβ”€β”€ tests
β”œβ”€β”€ LICENSE.txt
β”œβ”€β”€ README.MD
β”œβ”€β”€ Makefile
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ setup.py
└── uv.lock

Future RoadmapΒΆ

See Roadmap for the full roadmap including planned features and deferred enhancements.