Troubleshooting¶
Common issues and solutions for the ML Provisioner.
Table of Contents¶
Quick Diagnostics¶
Check AWS access¶
aws sts get-caller-identity
Example output:
{
"UserId": "AIDACKCEVSQ6C2EXAMPLE",
"Account": "123456789012",
"Arn": "arn:aws:iam::123456789012:user/ml-deploy-user"
}
Check deployed stack status¶
ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws cloudformation describe-stacks \
--stack-name ${ML_NAME}-stack \
--region us-west-2 \
--query 'Stacks[0].StackStatus' \
--output text
Expected output when stack exists:
CREATE_COMPLETE
Expected output when stack does not exist:
An error occurred (ValidationError) when calling the DescribeStacks operation:
Stack with id globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack does not exist
Check SSM parameters¶
ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws ssm get-parameters-by-path \
--path /ml/${ML_NAME}/ \
--region us-west-2 \
--query 'Parameters[*].Name' \
--output table
Example output for existing CFN stack:
--------------------------------------------------------------------------------------------
| GetParametersByPath |
+------------------------------------------------------------------------------------------+
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/BucketName |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/DashboardName |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/KmsKeyArn |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/LogGroupName |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/ModelPackageGroupArn |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/RepositoryUrl |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/SecurityGroupId |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdS3 |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdSagemakerApi |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdSagemakerRuntime |
| /ml/globalbank-prod-c001-us-west-2-demand-forecasting-ml/VpcEndpointIdSts |
+------------------------------------------------------------------------------------------+
Check recent logs¶
ls -lt ml/reports/*.log | head -5
grep -i "error\|failed" ml/reports/*.log | tail -20
Example output:
-rw-r--r-- 1 mluser mluser 8609 Jun 8 14:43 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-check-drift-20260608_214318_761.log
-rw-r--r-- 1 mluser mluser 8632 Jun 8 14:43 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-show-changes-20260608_214316_105.log
-rw-r--r-- 1 mluser mluser 9104 Jun 8 14:43 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-deploy-product-20260608_214143_015.log
-rw-r--r-- 1 mluser mluser 7782 Jun 8 14:41 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-create-review-report-20260608_214142_050.log
-rw-r--r-- 1 mluser mluser 7826 Jun 8 14:41 ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-codecommit-ssm-validate-prov-template-20260608_214141_020.log
Configuration Errors¶
Schema validation failed¶
Configuration validation failed: 'use_case' does not match pattern '^[a-z][a-z0-9-]*$'
Cause: A config field value doesn’t match the schema pattern.
Fix: Check the field value — use_case must be lowercase with hyphens only, no spaces or uppercase.
# ❌ Wrong
use_case: Customer Churn
# ✅ Correct
use_case: customer-churn
use_case too long¶
Configuration validation failed: 'use_case' is too long (maximum 20 characters)
Cause: use_case exceeds the 20-character maximum enforced by the schema.
Fix: Shorten the use case name. The 20-character limit exists because IAM role names (which omit region) must stay within AWS’s 64-character limit.
# ❌ Wrong — 24 characters
use_case: real-time-fraud-detection
# ✅ Correct
use_case: fraud-detection
Tier mismatch¶
Tier mismatch: config tier 'enterprise' is not allowed by purchased tier 'starter'.
Cause: The tier in your config file does not match the Docker image you are running.
Fix: Use the correct image for your config tier:
Config |
Image tag |
|---|---|
|
|
|
|
|
|
Config file not found¶
FileNotFoundError: Configuration file not found: /app/configs/my-config.yaml
Cause: The config file name is wrong or the configs volume is not mounted correctly.
Fix: Verify the filename and the -v mount:
# Check the file exists locally
ls -la ml/configs/
# Ensure the mount is correct
-v $(pwd)/ml/configs:/app/configs:ro
AWS Credential Issues¶
No credentials found¶
NoCredentialsError: Unable to locate credentials
Fix: Mount AWS credentials read-only:
-v ~/.aws:/home/mluser/.aws:ro
Or set environment variables:
-e AWS_ACCESS_KEY_ID=<access_key> \
-e AWS_SECRET_ACCESS_KEY=<secret_key> \
-e AWS_DEFAULT_REGION=us-west-2
Access denied¶
ClientError: An error occurred (AccessDenied) when calling the CreateStack operation
Cause: The IAM user or role lacks the required permissions.
Fix:
Run
create-policyto generate a scoped IAM policy for your configAttach the generated policy to your IAM user or role
See IAM_PERMISSIONS.md for the full permissions reference
Expired credentials¶
ClientError: An error occurred (ExpiredTokenException)
Fix: Refresh your credentials:
# For AWS SSO
aws sso login --profile your-profile
# For assumed roles — regenerate temporary credentials
aws sts assume-role --role-arn <role-arn> --role-session-name session
VPC and SSM Prerequisite Errors (Enterprise Tier)¶
VPC SSM parameters not found¶
ClientError: Parameter /vpc/globalbank-prod-c001-us-west-2-vpc/VPCId not found
Cause: VPC Provisioner has not been run, or the SSM path in your config is wrong.
Fix: Verify the SSM paths exist:
VPC_NAME=globalbank-prod-c001-us-west-2-vpc
aws ssm get-parameters-by-path \
--path /vpc/${VPC_NAME}/ \
--region us-west-2 \
--query 'Parameters[*].Name' \
--output table
Expected output must include /vpc/globalbank-prod-c001-us-west-2-vpc/VPCId and /vpc/globalbank-prod-c001-us-west-2-vpc/PrivateSubnetIds.
If missing, deploy the VPC Provisioner first or switch to vpc_source: direct.
SG Provisioner SSM parameters not found¶
ClientError: Parameter /sg/globalbank-prod-c001-us-west-2-vpc/AppSecurityGroupOutput not found
Cause: SG Provisioner has not been run or the SSM path in your config is wrong.
Fix: Verify the SG SSM params exist:
SGPROV_NAME=globalbank-prod-c001-us-west-2-sg
aws ssm get-parameters-by-path \
--path /sg/${SGPROV_NAME}/ \
--region us-west-2 \
--query 'Parameters[*].Name' \
--output table
If missing, deploy the SG Provisioner first or switch to vpc_integration.mode: standalone.
VPC does not exist¶
ClientError: The vpc ID 'vpc-xxxxxxxx' does not exist
Cause: The VPC ID in your config (vpc_source: direct) points to a VPC that doesn’t exist.
Fix: Verify the VPC ID:
aws ec2 describe-vpcs \
--vpc-ids vpc-xxxxxxxx \
--region us-west-2 \
--query 'Vpcs[0].State'
CloudFormation Errors¶
Stack already exists¶
ClientError: Stack with id globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack already exists
Cause: deploy-product was called on an already-deployed stack. The ML Provisioner
does not support stack updates — deploy-product is for initial creation only.
Fix:
To preview changes against the existing stack: use
show-changesTo replace the stack: run
delete-product --forcefirst, thendeploy-product --forceSee UPDATE_PROCEDURES.md for the full guidance on modifications
Stack creation failed¶
WaiterError: Waiter StackCreateComplete failed: Waiter encountered a terminal failure state
Cause: One or more resources failed to create. The CFN error is logged by the provisioner.
Fix: Check the stack events in the AWS Console or CLI:
ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws cloudformation describe-stack-events \
--stack-name ${ML_NAME}-stack \
--region us-west-2 \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
--output table
IAM name too long¶
CREATE_FAILED: Resource handler returned message: Value ... for field Name is too long
Cause: use_case or company_prefix + tenant_id combination produces an IAM role
name exceeding 64 characters.
Fix: Shorten use_case (maximum 20 characters enforced by schema) or company_prefix.
Template Generation Errors¶
Template file not found at deploy time¶
RuntimeError: Template file not found: /app/templates/globalbank-prod-c001-us-west-2-vpc-template.yaml
Cause: create-prov-template was not run before deploy-product, or the templates
volume is not mounted.
Fix:
# Run create-prov-template first
docker run --rm ... -act create-prov-template
# Ensure templates volume is mounted for deploy-product
-v $(pwd)/ml/templates:/app/templates
Unresolved !Ref or !GetAtt¶
Template validation FAILED: Unresolved !Ref 'SomeResource' at Resources.xxx
Cause: A resource reference in the generated template points to a resource that doesn’t exist. This indicates a bug in the template generator.
Fix: Report to support with the config file and the full validation output. See Getting Help.
Docker Errors¶
Permission denied on volume mount¶
PermissionError: [Errno 13] Permission denied: '/app/configs/my-config.yaml'
Fix: Ensure the local directory has read permissions:
chmod 755 ml/configs/
chmod 644 ml/configs/*.yaml
Image not found¶
Unable to find image 'ml-provisioner:enterprise' locally
Cause: The Docker image has not been pulled from AWS Marketplace.
Fix: Pull the image from your AWS Marketplace ECR subscription. See UPDATE_PROCEDURES.md.
Wrong user in container¶
Error: cannot open /home/mluser/.aws/credentials: Permission denied
Cause: The AWS credentials are mounted to the wrong user path.
Fix: Ensure the mount uses /home/mluser/.aws:
-v ~/.aws:/home/mluser/.aws:ro
SSM Parameter Issues¶
Parameters not found after deployment¶
aws ssm get-parameters-by-path returns empty Parameters list
Cause: The stack deployed successfully but SSM parameters were not written, or the path in the query is wrong.
Fix: Verify the correct ml_name:
# Get the exact ml_name from the stack outputs
ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws cloudformation describe-stacks \
--stack-name ${ML_NAME}-stack \
--region us-west-2 \
--query 'Stacks[0].Outputs'
Stale parameters after failed deletion¶
If delete-product fails partway through, SSM parameters may be left behind after
the CFN stack is gone.
Fix: Delete the stale parameters manually:
ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
# List parameters
aws ssm get-parameters-by-path \
--path /ml/${ML_NAME}/ \
--region us-west-2 \
--query 'Parameters[*].Name' \
--output text | tr '\t' '\n' | while read name; do
echo "Deleting $name"
aws ssm delete-parameter --name "$name" --region us-west-2
done
Drift Detection Issues¶
No deployed stack found¶
⚠️ No deployed stack found for drift detection
Cause: The stack has not been deployed yet, or was deleted.
Fix: Deploy the stack first with deploy-product --force.
Drift detected¶
⚠️ DRIFT DETECTED in stack: globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack
Cause: Resources were modified outside of CloudFormation (console, CLI, or another tool).
Fix:
Identify the change — check CloudTrail for who made the modification
If the change is desired — update your config to reflect it and regenerate the template
If the change is not desired — redeploy to restore the intended state
Deletion Errors¶
Stack does not exist¶
RuntimeError: ML stack 'globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack' does not exist
Cause: The stack was already deleted or was never deployed.
Fix: Verify the stack name:
aws cloudformation list-stacks \
--region us-west-2 \
--query 'StackSummaries[?contains(StackName, `globalbank-prod-c001-us-west-2-demand-forecasting-ml`)].[StackName,StackStatus]' \
--output table
Force flag required¶
⚠️ ML product deletion requires --force flag
Fix: Add --force to the delete command:
docker run --rm ... -act delete-product --force
Delete failed — S3 bucket not empty¶
CREATE_FAILED: Cannot delete entity, it has dependent objects
Cause: The S3 artifacts bucket contains objects — CloudFormation cannot delete a non-empty bucket.
Fix: Empty the bucket first, then retry deletion:
ML_NAME=globalbank-prod-c001-us-west-2-demand-forecasting-ml
aws s3 rm s3://${ML_NAME}-artifacts --recursive --region us-west-2
# Then retry deletion
docker run --rm ... -act delete-product --force
Getting Help¶
If you cannot resolve an issue with this guide:
Collect diagnostic information:
# Stack events aws cloudformation describe-stack-events \ --stack-name globalbank-prod-c001-us-west-2-demand-forecasting-ml-stack \ --region us-west-2 \ --output json > stack-events.json # Recent provisioner logs cp ml/reports/globalbank-prod-c001-us-west-2-demand-forecasting-ml-*-validate-config-*.log ./diagnostic-logs/
Check the AWS Console — CloudFormation → Stacks → Events tab shows detailed failure reasons
Contact support — see SUPPORT.md for contact details. Include:
Config file (with account ID replaced by
123456789012)Provisioner log file
Stack events output
Exact error message