Troubleshooting

Common issues and solutions for the ML Provisioner.

Table of Contents


Quick Diagnostics

Check AWS access

aws sts get-caller-identity

Example output:

{
    "UserId": "AIDACKCEVSQ6C2EXAMPLE",
    "Account": "123456789012",
    "Arn": "arn:aws:iam::123456789012:user/ml-deploy-user"
}

Check deployed stack status

ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws cloudformation describe-stacks \
  --stack-name ${ML_NAME}-stack \
  --region us-west-2 \
  --query 'Stacks[0].StackStatus' \
  --output text

Expected output when stack exists:

CREATE_COMPLETE

Expected output when stack does not exist:

An error occurred (ValidationError) when calling the DescribeStacks operation:
Stack with id globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack does not exist

Check SSM parameters

ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws ssm get-parameters-by-path \
  --path /ml/${ML_NAME}/ \
  --region us-west-2 \
  --query 'Parameters[*].Name' \
  --output table

Example output for existing CFN stack:

--------------------------------------------------------------------------------------------
|                                    GetParametersByPath                                   |
+------------------------------------------------------------------------------------------+
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/BucketName                     |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/DashboardName                  |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/KmsKeyArn                      |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/LogGroupName                   |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/ModelPackageGroupArn           |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/RepositoryUrl                  |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/SecurityGroupId                |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdS3                |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdSagemakerApi      |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdSagemakerRuntime  |
|  /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdSts               |
+------------------------------------------------------------------------------------------+

Check recent logs

ls -lt ml/reports/*.log | head -5
grep -i "error\|failed" ml/reports/*.log | tail -20

Example output:

-rw-r--r-- 1 mluser mluser 8609 Jun  8 14:43 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-check-drift-20260608_214318_761.log
-rw-r--r-- 1 mluser mluser 8632 Jun  8 14:43 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-show-changes-20260608_214316_105.log
-rw-r--r-- 1 mluser mluser 9104 Jun  8 14:43 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-deploy-product-20260608_214143_015.log
-rw-r--r-- 1 mluser mluser 7782 Jun  8 14:41 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-create-review-report-20260608_214142_050.log
-rw-r--r-- 1 mluser mluser 7826 Jun  8 14:41 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-validate-prov-template-20260608_214141_020.log

Configuration Errors

Schema validation failed

Configuration validation failed: 'use_case' does not match pattern '^[a-z][a-z0-9-]*$'

Cause: A config field value doesn’t match the schema pattern.

Fix: Check the field value — use_case must be lowercase with hyphens only, no spaces or uppercase.

# ❌ Wrong
use_case: Customer Churn

# ✅ Correct
use_case: customer-churn

use_case too long

Configuration validation failed: 'use_case' is too long (maximum 20 characters)

Cause: use_case exceeds the 20-character maximum enforced by the schema.

Fix: Shorten the use case name. The 20-character limit exists because IAM role names (which omit region) must stay within AWS’s 64-character limit.

# ❌ Wrong — 24 characters
use_case: real-time-fraud-detection

# ✅ Correct
use_case: fraud-detection

Tier mismatch

Tier mismatch: config tier 'enterprise' is not allowed by purchased tier 'starter'.

Cause: The tier in your config file does not match the Docker image you are running.

Fix: Use the correct image for your config tier:

Config tier

Image tag

starter

ml-provisioner:starter

professional

ml-provisioner:professional

enterprise

ml-provisioner:enterprise

Config file not found

FileNotFoundError: Configuration file not found: /app/configs/my-config.yaml

Cause: The config file name is wrong or the configs volume is not mounted correctly.

Fix: Verify the filename and the -v mount:

# Check the file exists locally
ls -la ml/configs/

# Ensure the mount is correct
-v $(pwd)/ml/configs:/app/configs:ro

AWS Credential Issues

No credentials found

NoCredentialsError: Unable to locate credentials

Fix: Mount AWS credentials read-only:

-v ~/.aws:/home/mluser/.aws:ro

Or set environment variables:

-e AWS_ACCESS_KEY_ID=<access_key> \
-e AWS_SECRET_ACCESS_KEY=<secret_key> \
-e AWS_DEFAULT_REGION=us-west-2

Access denied

ClientError: An error occurred (AccessDenied) when calling the CreateStack operation

Cause: The IAM user or role lacks the required permissions.

Fix:

  1. Run create-policy to generate a scoped IAM policy for your config

  2. Attach the generated policy to your IAM user or role

  3. See IAM_PERMISSIONS.md for the full permissions reference

Expired credentials

ClientError: An error occurred (ExpiredTokenException)

Fix: Refresh your credentials:

# For AWS SSO
aws sso login --profile your-profile

# For assumed roles — regenerate temporary credentials
aws sts assume-role --role-arn <role-arn> --role-session-name session

VPC and SSM Prerequisite Errors (Enterprise Tier)

VPC SSM parameters not found

ClientError: Parameter /vpc/globalbank-prod-c001-us-west-2-vpc/VPCId not found

Cause: VPC Provisioner has not been run, or the SSM path in your config is wrong.

Fix: Verify the SSM paths exist:

VPC_NAME=globalbank-prod-c001-us-west-2-vpc
aws ssm get-parameters-by-path \
  --path /vpc/${VPC_NAME}/ \
  --region us-west-2 \
  --query 'Parameters[*].Name' \
  --output table

Expected output must include /vpc/globalbank-prod-c001-us-west-2-vpc/VPCId and /vpc/globalbank-prod-c001-us-west-2-vpc/PrivateSubnetIds. If missing, deploy the VPC Provisioner first or switch to vpc_source: direct.

SG Provisioner SSM parameters not found

ClientError: Parameter /sg/globalbank-prod-c001-us-west-2-vpc/AppSecurityGroupOutput not found

Cause: SG Provisioner has not been run or the SSM path in your config is wrong.

Fix: Verify the SG SSM params exist:

SGPROV_NAME=globalbank-prod-c001-us-west-2-sg
aws ssm get-parameters-by-path \
  --path /sg/${SGPROV_NAME}/ \
  --region us-west-2 \
  --query 'Parameters[*].Name' \
  --output table

If missing, deploy the SG Provisioner first or switch to vpc_integration.mode: standalone.

VPC does not exist

ClientError: The vpc ID 'vpc-xxxxxxxx' does not exist

Cause: The VPC ID in your config (vpc_source: direct) points to a VPC that doesn’t exist.

Fix: Verify the VPC ID:

aws ec2 describe-vpcs \
  --vpc-ids vpc-xxxxxxxx \
  --region us-west-2 \
  --query 'Vpcs[0].State'

CloudFormation Errors

Stack already exists

ClientError: Stack with id globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack already exists

Cause: deploy-product was called on an already-deployed stack. The ML Provisioner does not support stack updates — deploy-product is for initial creation only.

Fix:

  • To preview changes against the existing stack: use show-changes

  • To replace the stack: run delete-product --force first, then deploy-product --force

  • See UPDATE_PROCEDURES.md for the full guidance on modifications

Stack creation failed

WaiterError: Waiter StackCreateComplete failed: Waiter encountered a terminal failure state

Cause: One or more resources failed to create. The CFN error is logged by the provisioner.

Fix: Check the stack events in the AWS Console or CLI:

ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws cloudformation describe-stack-events \
  --stack-name ${ML_NAME}-stack \
  --region us-west-2 \
  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
  --output table

IAM name too long

CREATE_FAILED: Resource handler returned message: Value ... for field Name is too long

Cause: use_case or company_prefix + tenant_id combination produces an IAM role name exceeding 64 characters.

Fix: Shorten use_case (maximum 20 characters enforced by schema) or company_prefix.


Template Generation Errors

Template file not found at deploy time

RuntimeError: Template file not found: /app/templates/globalbank-prod-c001-us-west-2-vpc-template.yaml

Cause: create-prov-template was not run before deploy-product, or the templates volume is not mounted.

Fix:

# Run create-prov-template first
docker run --rm ... -act create-prov-template

# Ensure templates volume is mounted for deploy-product
-v $(pwd)/ml/templates:/app/templates

Unresolved !Ref or !GetAtt

Template validation FAILED: Unresolved !Ref 'SomeResource' at Resources.xxx

Cause: A resource reference in the generated template points to a resource that doesn’t exist. This indicates a bug in the template generator.

Fix: Report to support with the config file and the full validation output. See Getting Help.


Docker Errors

Permission denied on volume mount

PermissionError: [Errno 13] Permission denied: '/app/configs/my-config.yaml'

Fix: Ensure the local directory has read permissions:

chmod 755 ml/configs/
chmod 644 ml/configs/*.yaml

Image not found

Unable to find image 'ml-provisioner:enterprise' locally

Cause: The Docker image has not been pulled from AWS Marketplace.

Fix: Pull the image from your AWS Marketplace ECR subscription. See UPDATE_PROCEDURES.md.

Wrong user in container

Error: cannot open /home/mluser/.aws/credentials: Permission denied

Cause: The AWS credentials are mounted to the wrong user path.

Fix: Ensure the mount uses /home/mluser/.aws:

-v ~/.aws:/home/mluser/.aws:ro

SSM Parameter Issues

Parameters not found after deployment

aws ssm get-parameters-by-path returns empty Parameters list

Cause: The stack deployed successfully but SSM parameters were not written, or the path in the query is wrong.

Fix: Verify the correct ml_name:

# Get the exact ml_name from the stack outputs
ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws cloudformation describe-stacks \
  --stack-name ${ML_NAME}-stack \
  --region us-west-2 \
  --query 'Stacks[0].Outputs'

Stale parameters after failed deletion

If delete-product fails partway through, SSM parameters may be left behind after the CFN stack is gone.

Fix: Delete the stale parameters manually:

ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
# List parameters
aws ssm get-parameters-by-path \
  --path /ml/${ML_NAME}/ \
  --region us-west-2 \
  --query 'Parameters[*].Name' \
  --output text | tr '\t' '\n' | while read name; do
    echo "Deleting $name"
    aws ssm delete-parameter --name "$name" --region us-west-2
done

Drift Detection Issues

No deployed stack found

⚠️  No deployed stack found for drift detection

Cause: The stack has not been deployed yet, or was deleted.

Fix: Deploy the stack first with deploy-product --force.

Drift detected

⚠️  DRIFT DETECTED in stack: globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack

Cause: Resources were modified outside of CloudFormation (console, CLI, or another tool).

Fix:

  1. Identify the change — check CloudTrail for who made the modification

  2. If the change is desired — update your config to reflect it and regenerate the template

  3. If the change is not desired — redeploy to restore the intended state


Deletion Errors

Stack does not exist

RuntimeError: ML stack 'globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack' does not exist

Cause: The stack was already deleted or was never deployed.

Fix: Verify the stack name:

aws cloudformation list-stacks \
  --region us-west-2 \
  --query 'StackSummaries[?contains(StackName, `globalbank-prod-c001-us-west-2-demand-forecasting-ml`)].[StackName,StackStatus]' \
  --output table

Force flag required

⚠️  ML product deletion requires --force flag

Fix: Add --force to the delete command:

docker run --rm ... -act delete-product --force

Delete failed — S3 bucket not empty

CREATE_FAILED: Cannot delete entity, it has dependent objects

Cause: The S3 artifacts bucket contains objects — CloudFormation cannot delete a non-empty bucket.

Fix: Empty the bucket first, then retry deletion:

ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws s3 rm s3://${ML_NAME}-artifacts --recursive --region us-west-2

# Then retry deletion
docker run --rm ... -act delete-product --force

Getting Help

If you cannot resolve an issue with this guide:

  1. Collect diagnostic information:

    # Stack events
    aws cloudformation describe-stack-events \
      --stack-name globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack \
      --region us-west-2 \
      --output json > stack-events.json
    
    # Recent provisioner logs
    cp ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-*-validate-config-*.log ./diagnostic-logs/
    
  2. Check the AWS Console — CloudFormation → Stacks → Events tab shows detailed failure reasons

  3. Contact support — see SUPPORT.md for contact details. Include:

    • Config file (with account ID replaced by 123456789012)

    • Provisioner log file

    • Stack events output

    • Exact error message