Troubleshooting¶
Table of Contents¶
Quick Diagnostics¶
# Check AWS credentials
aws sts get-caller-identity
# Check Docker version
docker --version
# Check available SEC images
docker images sec-provisioner
# Test IAM access
aws iam list-groups --max-items 5
# Validate configuration
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action validate-config
Common Pitfalls¶
1. Not quoting account_id in YAML
❌
account_id: 123456789012(loses leading zeros)✅
account_id: "123456789012"(preserves format)
2. Skipping configuration validation
Always run
validate-configbefore AWS operationsCatches most errors before deployment — saves time and avoids failed stacks
3. Not reviewing generated resources
Export groups, roles, and policies before deploying
Use
validate-prov-templateto catch reference errors locallyPrevents unexpected IAM resource creation
4. Using access keys in production
❌ Environment variables with long-lived access keys
✅ IAM roles (EC2/ECS) or AWS profiles with MFA
Access keys are visible in process lists and Docker history
5. Tier mismatch
Config
security_profilemust match the Docker image tiersec-provisioner:medium-10requiressecurity_profile: medium-10
Common Errors¶
AWS Credentials¶
Error: Unable to locate credentials¶
Error: Unable to locate credentials. You can configure credentials by running "aws configure".
Solution:
# Option 1: Mount AWS credentials (recommended)
-v ~/.aws:/home/secuser/.aws:ro
# Option 2: Environment variables
-e AWS_ACCESS_KEY_ID=<access_key>
-e AWS_SECRET_ACCESS_KEY=<secret_key>
-e AWS_DEFAULT_REGION=us-west-1
# Verify
aws sts get-caller-identity
Error: The security token included in the request is invalid¶
Error: The security token included in the request is invalid
Causes:
Expired temporary credentials
Invalid access key
Credentials from different account
Solution:
# Refresh credentials
aws sts get-session-token
# Verify current identity
aws sts get-caller-identity
Error: Access Denied¶
Error: An error occurred (AccessDenied) when calling the CreateStack operation
Solution:
# Check current user
aws sts get-caller-identity
# Generate required IAM policy
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/policies:/app/policies \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action export-iam-policy
# Review and attach generated policy
cat sec/policies/edge-prod-b001-us-west-1-medium-sec-iam-policy.json
Configuration Errors¶
Error: Configuration file not found¶
Error: Configuration file not found: /app/configs/my-config.yaml
Solution:
# Verify file exists on host
ls -la sec/configs/
# Verify volume mount path
# Correct:
-v $(pwd)/sec/configs:/app/configs:ro
# Wrong:
-v $(pwd)/configs:/app/configs:ro
Error: Invalid YAML in config file¶
Error: Invalid YAML in config file: while parsing a block mapping...
Causes:
Incorrect indentation (YAML uses spaces, not tabs)
Missing quotes around account_id or tenant_id
Misaligned keys
Solution:
# Wrong (loses leading zeros)
client:
account_id: 123456789012
# Correct (preserves as string)
client:
account_id: "123456789012"
tenant_id: "b001"
Error: Configuration validation failed¶
Error: Configuration validation failed: 'client' is a required property
Solution: Ensure configuration has all required sections:
tier:
name: medium
version: 1.0.0
description: ...
security_profile: medium-10
client:
company_name: Edge Corp
company_prefix: edge
account_id: "123456789012"
tenant_id: "b001"
environment:
env: prod
region: us-west-1
deployment:
template_bucket: edge-prod-b001-us-west-1-s3
template_prefix: solutions/master-solution/templates
security:
iam_groups: {}
service_roles: {}
assumable_roles: {}
cross_account_roles: {}
security_profiles: {}
tags:
cost_center: Engineering
Tier and Profile Errors¶
Error: Tier mismatch¶
Error: Tier mismatch: Config security_profile 'enterprise-12' does not match purchased tier 'medium-10'
Cause: The security_profile in your config doesn’t match the Docker image tier.
Solution: Match the config profile to the image tag:
Image Tag |
Required security_profile |
|---|---|
|
|
|
|
|
|
# For medium-10 image:
tier:
security_profile: medium-10
Error: Cannot auto-detect schema for tier¶
Error: Cannot auto-detect schema for tier: custom
Cause: The tier.name field must be one of: startup, medium, enterprise.
Solution:
tier:
name: medium # Must be: startup, medium, or enterprise
Error: Group not in security profile¶
Symptoms: A group defined in iam_groups is not created during deployment.
Cause: The group is not listed in the enabled_groups of the active security profile.
Solution: Add the group to the security profile:
security:
security_profiles:
medium-10:
enabled_groups:
- data_scientists
- your_missing_group # Add here
Template Errors¶
Error: Template file not found¶
Error: Template file not found
Solution: Generate the template first, or let validate-prov-template auto-generate it:
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/templates:/app/templates \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action create-prov-template
Error: Invalid !Ref target¶
Error: !Ref target 'InvalidResource' not found in Resources or Parameters
Causes:
Template references a resource that doesn’t exist
Template was manually edited
Policy template has unresolved placeholder
Solution: Regenerate the template from configuration:
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/templates:/app/templates \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action create-prov-template
CloudFormation Errors¶
Error: Stack already exists¶
Error: Stack [edge-prod-b001-us-west-1-medium-sec-stack] already exists
Solution:
# Option 1: Delete existing stack first
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action delete-stack \
--force
# Option 2: Use a different environment or tenant_id
Error: CloudFormation rollback¶
Error: Stack creation failed and rolled back
Solution:
# Check CloudFormation events
aws cloudformation describe-stack-events \
--stack-name edge-prod-b001-us-west-1-medium-sec-stack \
--max-items 20 \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
--output table
# Common causes:
# 1. Insufficient IAM permissions
# 2. IAM entity limit exceeded
# 3. Policy size limit exceeded
# 4. Duplicate resource names
# Review logs
cat sec/reports/*.log
Error: IAM entity limit exceeded¶
Error: LimitExceeded: Cannot exceed quota for GroupsPerAccount
Solution:
# Check current IAM limits
aws iam get-account-summary --query 'SummaryMap.{Groups:Groups,GroupsQuota:GroupsQuota,Roles:Roles,RolesQuota:RolesQuota,Policies:Policies,PoliciesQuota:PoliciesQuota}'
# Request limit increase if needed
aws service-quotas request-service-quota-increase \
--service-code iam \
--quota-code L-F55AF5E4 \
--desired-value 500
S3 Template Upload Errors¶
Error: NoSuchBucket¶
Error: The specified bucket does not exist
Cause: Medium and enterprise tiers upload templates to S3. The bucket specified in deployment.template_bucket must exist.
Solution:
Create the S3 bucket using the S3 Provisioner first
Verify the bucket name matches your config:
deployment:
template_bucket: edge-prod-b001-us-west-1-s3 # Must exist
template_prefix: solutions/master-solution/templates
# Verify bucket exists
aws s3 ls s3://edge-prod-b001-us-west-1-s3/
Note: Startup tier uses TemplateBody (inline) and does not require an S3 bucket.
Error: template_bucket not configured¶
Error: template_bucket not configured in deployment section
Solution: Add the deployment section to your config:
deployment:
template_bucket: edge-prod-b001-us-west-1-s3
template_prefix: solutions/master-solution/templates
Change Preview Errors¶
Error: Stack does not exist for show-changes¶
Error: Stack does not exist: edge-prod-b001-us-west-1-medium-sec-stack
Cause: show-changes requires a deployed stack to compare against.
Solution: Deploy the stack first, then preview changes:
# Deploy first
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action deploy \
--force
# Then preview changes
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/templates:/app/templates \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action show-changes
Error: No changes detected¶
Symptoms: show-changes reports no pending changes.
Cause: The deployed stack matches the current template.
Solution: This is expected when no configuration changes have been made. Modify your configuration, regenerate the template, then re-run show-changes.
Drift Detection Errors¶
Error: Stack does not exist for check-drift¶
Error: Stack does not exist: edge-prod-b001-us-west-1-medium-sec-stack
Cause: check-drift requires a deployed stack to detect drift against.
Solution: Deploy the stack first.
Error: Drift detection timeout¶
Symptoms: Drift detection takes too long or times out.
Causes:
Large number of IAM resources in the stack
AWS API throttling
Solution:
# Check drift detection status manually
aws cloudformation describe-stack-drift-detection-status \
--stack-drift-detection-id <detection-id>
# Retry
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action check-drift
Note: CloudFormation does not support drift detection for IAM Groups. Only IAM Roles and Policies are checked.
Deletion Errors¶
Error: Stack deletion failed¶
Error: Stack deletion failed or timed out
Solution:
# Check stack events
aws cloudformation describe-stack-events \
--stack-name edge-prod-b001-us-west-1-medium-sec-stack \
--query 'StackEvents[?ResourceStatus==`DELETE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
--output table
# Common causes:
# 1. IAM roles in use by other services (EC2 instances, Lambda functions)
# 2. Policies attached to users outside CloudFormation
# 3. Insufficient permissions to delete IAM resources
Error: Role is in use¶
Error: Cannot delete role, it is in use by other resources
Cause: An IAM role created by the stack is attached to an EC2 instance, Lambda function, or other AWS resource.
Solution:
Identify what’s using the role:
aws iam list-instance-profiles-for-role --role-name edge-prod-b001-role-sagemaker-execution
Detach the role from the resource
Retry delete-stack
Docker Errors¶
Error: Cannot connect to Docker daemon¶
Error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock
Solution:
# Start Docker
sudo systemctl start docker
# Verify
docker ps
Error: Permission denied accessing volume¶
Error: Permission denied: '/app/configs/your-config.yaml'
Solution:
# Check file permissions
ls -la sec/configs/
# Fix permissions
chmod 644 sec/configs/your-config.yaml
Error: Volume mount not found¶
Error: No such file or directory: '/app/configs/your-config.yaml'
Solution:
# Verify file exists on host
ls -la sec/configs/your-config.yaml
# Use $(pwd) for absolute path
-v $(pwd)/sec/configs:/app/configs:ro
License Validation Errors¶
Error: License validation failed¶
Error: AWS Marketplace subscription not found
Solution:
Verify AWS Marketplace subscription is active
Check IAM permissions for AWS Marketplace
Contact AWS Marketplace support
IAM-Specific Issues¶
Policy Size Limit¶
Symptoms: Stack creation fails with policy size error.
Cause: IAM inline policies have a 10,240 character limit. Groups with many policy assignments can exceed this.
Solution: The tool uses combined policies (e.g., mlops-services-a, mlops-services-b, mlops-services-c) to split large permission sets across multiple standalone managed policies. If you hit this limit with custom configurations, split your policy assignments across multiple combined policies.
10-Policy Limit Per Group¶
Symptoms: Group creation fails with policy attachment limit error.
Cause: AWS limits each IAM group to 10 managed policies.
Solution: The tool works within this limit by using:
Inline policies for service-level permissions
Combined policies to merge multiple services into one attachment
Managed policies only where necessary
Review your group’s managed_policies list — it should not exceed 10 entries.
Assumable Role Trust Policy¶
Symptoms: Users cannot assume a role despite being in the correct group.
Causes:
Trust policy doesn’t include the correct account ID
Group doesn’t have
sts:AssumeRolepermission for the role ARNRole ARN in group’s
assumable_rolesdoesn’t match the deployed role
Solution:
# Export and review the role definition
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/roles:/app/roles \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action export-roles
# Check the trust policy
cat sec/roles/edge-prod-b001-role-aml-engineer.json
# Verify the group has the role in assumable_roles
cat sec/groups/edge-prod-b001-group-ml-engineers.json
Cross-Account Role Access Denied¶
Symptoms: Cross-account assume role fails with AccessDenied.
Causes:
External ID mismatch
Trusted account ID incorrect
Caller’s account not in
trusted_accountslist
Solution:
# Test assume role with external ID
aws sts assume-role \
--role-arn arn:aws:iam::123456789012:role/edge-prod-b001-xacct-deployment-role \
--role-session-name test \
--external-id deployment-external-id-12345
# Export and review the cross-account role
cat sec/roles/edge-prod-b001-xacct-deployment-role.json
Verify in your config:
cross_account_roles:
deployment_role:
trusted_accounts:
- "999888777666" # Must match the calling account
external_id: "deployment-external-id-12345" # Must match exactly
Performance Issues¶
Slow Stack Creation¶
Symptoms: Stack creation takes longer than expected (> 3 minutes).
Causes:
Large number of IAM resources (enterprise tier: 44+ resources)
AWS API throttling on IAM operations
S3 template upload latency (medium/enterprise)
Solution:
# Monitor stack events
aws cloudformation describe-stack-events \
--stack-name edge-prod-b001-us-west-1-medium-sec-stack
# Typical deployment times:
# Startup: 30-60 seconds
# Medium: 90-120 seconds
# Enterprise: 120-180 seconds
Advanced Troubleshooting¶
Enable Debug Logging¶
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action deploy \
--force \
--verbose debug
Inspect Container¶
# Run interactive shell
docker run --rm -it \
-v $(pwd)/sec/configs:/app/configs:ro \
--entrypoint /bin/bash \
sec-provisioner:medium-10
# Inside container
ls -la /app/
ls -la /app/schemas/
ls -la /app/policy-templates/
cat /app/configs/your-config.yaml
Review Generated Files¶
# Check generated template
cat sec/templates/edge-prod-b001-us-west-1-medium-sec-template.yaml
# Check generated IAM policy
cat sec/policies/edge-prod-b001-us-west-1-medium-sec-iam-policy.json
# Check exported groups
ls sec/groups/
cat sec/groups/edge-prod-b001-group-data-scientists.json
# Check exported roles
ls sec/roles/
cat sec/roles/edge-prod-b001-role-sagemaker-execution.json
# Check execution logs
ls -lt sec/reports/*.log | head -5
cat sec/reports/*.log | tail -50
Check AWS API Calls¶
# Check CloudTrail for IAM operations
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=CreateGroup \
--max-results 10
# Check for errors
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=CreateRole \
--query 'Events[?ErrorCode!=`null`]'
Getting Help¶
Collect Diagnostic Information¶
# System info
docker --version
aws --version
uname -a
# AWS identity
aws sts get-caller-identity
# IAM limits
aws iam get-account-summary
# Error output
docker run --rm \
-v ~/.aws:/home/secuser/.aws:ro \
-v $(pwd)/sec/configs:/app/configs:ro \
-v $(pwd)/sec/reports:/app/reports \
sec-provisioner:medium-10 \
--config edge-prod-b001-us-west-1-sec.yaml \
--action validate-config 2>&1 | tee error.log
Contact Support¶
Include in support request:
Docker image version and tier
AWS region
Sanitized configuration file
Complete error message
Steps to reproduce
Expected vs actual behavior
Log files from reports/ directory
See Support for contact information.