Troubleshooting

Table of Contents


Quick Diagnostics

# Check AWS credentials
aws sts get-caller-identity

# Check Docker version
docker --version

# Check available SEC images
docker images sec-provisioner

# Test IAM access
aws iam list-groups --max-items 5

# Validate configuration
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action validate-config

Common Pitfalls

1. Not quoting account_id in YAML

  • account_id: 123456789012 (loses leading zeros)

  • account_id: "123456789012" (preserves format)

2. Skipping configuration validation

  • Always run validate-config before AWS operations

  • Catches most errors before deployment — saves time and avoids failed stacks

3. Not reviewing generated resources

  • Export groups, roles, and policies before deploying

  • Use validate-prov-template to catch reference errors locally

  • Prevents unexpected IAM resource creation

4. Using access keys in production

  • ❌ Environment variables with long-lived access keys

  • ✅ IAM roles (EC2/ECS) or AWS profiles with MFA

  • Access keys are visible in process lists and Docker history

5. Tier mismatch

  • Config security_profile must match the Docker image tier

  • sec-provisioner:medium-10 requires security_profile: medium-10


Common Errors

AWS Credentials

Error: Unable to locate credentials

Error: Unable to locate credentials. You can configure credentials by running "aws configure".

Solution:

# Option 1: Mount AWS credentials (recommended)
-v ~/.aws:/home/secuser/.aws:ro

# Option 2: Environment variables
-e AWS_ACCESS_KEY_ID=<access_key>
-e AWS_SECRET_ACCESS_KEY=<secret_key>
-e AWS_DEFAULT_REGION=us-west-1

# Verify
aws sts get-caller-identity

Error: The security token included in the request is invalid

Error: The security token included in the request is invalid

Causes:

  • Expired temporary credentials

  • Invalid access key

  • Credentials from different account

Solution:

# Refresh credentials
aws sts get-session-token

# Verify current identity
aws sts get-caller-identity

Error: Access Denied

Error: An error occurred (AccessDenied) when calling the CreateStack operation

Solution:

# Check current user
aws sts get-caller-identity

# Generate required IAM policy
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/policies:/app/policies \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action export-iam-policy

# Review and attach generated policy
cat sec/policies/edge-prod-b001-us-west-1-medium-sec-iam-policy.json

Configuration Errors

Error: Configuration file not found

Error: Configuration file not found: /app/configs/my-config.yaml

Solution:

# Verify file exists on host
ls -la sec/configs/

# Verify volume mount path
# Correct:
-v $(pwd)/sec/configs:/app/configs:ro

# Wrong:
-v $(pwd)/configs:/app/configs:ro

Error: Invalid YAML in config file

Error: Invalid YAML in config file: while parsing a block mapping...

Causes:

  • Incorrect indentation (YAML uses spaces, not tabs)

  • Missing quotes around account_id or tenant_id

  • Misaligned keys

Solution:

# Wrong (loses leading zeros)
client:
  account_id: 123456789012

# Correct (preserves as string)
client:
  account_id: "123456789012"
  tenant_id: "b001"

Error: Configuration validation failed

Error: Configuration validation failed: 'client' is a required property

Solution: Ensure configuration has all required sections:

tier:
  name: medium
  version: 1.0.0
  description: ...
  security_profile: medium-10

client:
  company_name: Edge Corp
  company_prefix: edge
  account_id: "123456789012"
  tenant_id: "b001"

environment:
  env: prod
  region: us-west-1

deployment:
  template_bucket: edge-prod-b001-us-west-1-s3
  template_prefix: solutions/master-solution/templates

security:
  iam_groups: {}
  service_roles: {}
  assumable_roles: {}
  cross_account_roles: {}
  security_profiles: {}

tags:
  cost_center: Engineering

Tier and Profile Errors

Error: Tier mismatch

Error: Tier mismatch: Config security_profile 'enterprise-12' does not match purchased tier 'medium-10'

Cause: The security_profile in your config doesn’t match the Docker image tier.

Solution: Match the config profile to the image tag:

Image Tag

Required security_profile

sec-provisioner:startup-5

startup-5

sec-provisioner:medium-10

medium-10

sec-provisioner:enterprise-12

enterprise-12

# For medium-10 image:
tier:
  security_profile: medium-10

Error: Cannot auto-detect schema for tier

Error: Cannot auto-detect schema for tier: custom

Cause: The tier.name field must be one of: startup, medium, enterprise.

Solution:

tier:
  name: medium  # Must be: startup, medium, or enterprise

Error: Group not in security profile

Symptoms: A group defined in iam_groups is not created during deployment.

Cause: The group is not listed in the enabled_groups of the active security profile.

Solution: Add the group to the security profile:

security:
  security_profiles:
    medium-10:
      enabled_groups:
        - data_scientists
        - your_missing_group  # Add here

Template Errors

Error: Template file not found

Error: Template file not found

Solution: Generate the template first, or let validate-prov-template auto-generate it:

docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/templates:/app/templates \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action create-prov-template

Error: Invalid !Ref target

Error: !Ref target 'InvalidResource' not found in Resources or Parameters

Causes:

  • Template references a resource that doesn’t exist

  • Template was manually edited

  • Policy template has unresolved placeholder

Solution: Regenerate the template from configuration:

docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/templates:/app/templates \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action create-prov-template

CloudFormation Errors

Error: Stack already exists

Error: Stack [edge-prod-b001-us-west-1-medium-sec-stack] already exists

Solution:

# Option 1: Delete existing stack first
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action delete-stack \
  --force

# Option 2: Use a different environment or tenant_id

Error: CloudFormation rollback

Error: Stack creation failed and rolled back

Solution:

# Check CloudFormation events
aws cloudformation describe-stack-events \
  --stack-name edge-prod-b001-us-west-1-medium-sec-stack \
  --max-items 20 \
  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
  --output table

# Common causes:
# 1. Insufficient IAM permissions
# 2. IAM entity limit exceeded
# 3. Policy size limit exceeded
# 4. Duplicate resource names

# Review logs
cat sec/reports/*.log

Error: IAM entity limit exceeded

Error: LimitExceeded: Cannot exceed quota for GroupsPerAccount

Solution:

# Check current IAM limits
aws iam get-account-summary --query 'SummaryMap.{Groups:Groups,GroupsQuota:GroupsQuota,Roles:Roles,RolesQuota:RolesQuota,Policies:Policies,PoliciesQuota:PoliciesQuota}'

# Request limit increase if needed
aws service-quotas request-service-quota-increase \
  --service-code iam \
  --quota-code L-F55AF5E4 \
  --desired-value 500

S3 Template Upload Errors

Error: NoSuchBucket

Error: The specified bucket does not exist

Cause: Medium and enterprise tiers upload templates to S3. The bucket specified in deployment.template_bucket must exist.

Solution:

  1. Create the S3 bucket using the S3 Provisioner first

  2. Verify the bucket name matches your config:

deployment:
  template_bucket: edge-prod-b001-us-west-1-s3  # Must exist
  template_prefix: solutions/master-solution/templates
# Verify bucket exists
aws s3 ls s3://edge-prod-b001-us-west-1-s3/

Note: Startup tier uses TemplateBody (inline) and does not require an S3 bucket.


Error: template_bucket not configured

Error: template_bucket not configured in deployment section

Solution: Add the deployment section to your config:

deployment:
  template_bucket: edge-prod-b001-us-west-1-s3
  template_prefix: solutions/master-solution/templates

Change Preview Errors

Error: Stack does not exist for show-changes

Error: Stack does not exist: edge-prod-b001-us-west-1-medium-sec-stack

Cause: show-changes requires a deployed stack to compare against.

Solution: Deploy the stack first, then preview changes:

# Deploy first
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action deploy \
  --force

# Then preview changes
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/templates:/app/templates \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action show-changes

Error: No changes detected

Symptoms: show-changes reports no pending changes.

Cause: The deployed stack matches the current template.

Solution: This is expected when no configuration changes have been made. Modify your configuration, regenerate the template, then re-run show-changes.


Drift Detection Errors

Error: Stack does not exist for check-drift

Error: Stack does not exist: edge-prod-b001-us-west-1-medium-sec-stack

Cause: check-drift requires a deployed stack to detect drift against.

Solution: Deploy the stack first.


Error: Drift detection timeout

Symptoms: Drift detection takes too long or times out.

Causes:

  • Large number of IAM resources in the stack

  • AWS API throttling

Solution:

# Check drift detection status manually
aws cloudformation describe-stack-drift-detection-status \
  --stack-drift-detection-id <detection-id>

# Retry
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action check-drift

Note: CloudFormation does not support drift detection for IAM Groups. Only IAM Roles and Policies are checked.


Deletion Errors

Error: Stack deletion failed

Error: Stack deletion failed or timed out

Solution:

# Check stack events
aws cloudformation describe-stack-events \
  --stack-name edge-prod-b001-us-west-1-medium-sec-stack \
  --query 'StackEvents[?ResourceStatus==`DELETE_FAILED`].[LogicalResourceId,ResourceStatusReason]' \
  --output table

# Common causes:
# 1. IAM roles in use by other services (EC2 instances, Lambda functions)
# 2. Policies attached to users outside CloudFormation
# 3. Insufficient permissions to delete IAM resources

Error: Role is in use

Error: Cannot delete role, it is in use by other resources

Cause: An IAM role created by the stack is attached to an EC2 instance, Lambda function, or other AWS resource.

Solution:

  1. Identify what’s using the role:

aws iam list-instance-profiles-for-role --role-name edge-prod-b001-role-sagemaker-execution
  1. Detach the role from the resource

  2. Retry delete-stack


Docker Errors

Error: Cannot connect to Docker daemon

Error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock

Solution:

# Start Docker
sudo systemctl start docker

# Verify
docker ps

Error: Permission denied accessing volume

Error: Permission denied: '/app/configs/your-config.yaml'

Solution:

# Check file permissions
ls -la sec/configs/

# Fix permissions
chmod 644 sec/configs/your-config.yaml

Error: Volume mount not found

Error: No such file or directory: '/app/configs/your-config.yaml'

Solution:

# Verify file exists on host
ls -la sec/configs/your-config.yaml

# Use $(pwd) for absolute path
-v $(pwd)/sec/configs:/app/configs:ro

License Validation Errors

Error: License validation failed

Error: AWS Marketplace subscription not found

Solution:

  1. Verify AWS Marketplace subscription is active

  2. Check IAM permissions for AWS Marketplace

  3. Contact AWS Marketplace support


IAM-Specific Issues

Policy Size Limit

Symptoms: Stack creation fails with policy size error.

Cause: IAM inline policies have a 10,240 character limit. Groups with many policy assignments can exceed this.

Solution: The tool uses combined policies (e.g., mlops-services-a, mlops-services-b, mlops-services-c) to split large permission sets across multiple standalone managed policies. If you hit this limit with custom configurations, split your policy assignments across multiple combined policies.


10-Policy Limit Per Group

Symptoms: Group creation fails with policy attachment limit error.

Cause: AWS limits each IAM group to 10 managed policies.

Solution: The tool works within this limit by using:

  • Inline policies for service-level permissions

  • Combined policies to merge multiple services into one attachment

  • Managed policies only where necessary

Review your group’s managed_policies list — it should not exceed 10 entries.


Assumable Role Trust Policy

Symptoms: Users cannot assume a role despite being in the correct group.

Causes:

  • Trust policy doesn’t include the correct account ID

  • Group doesn’t have sts:AssumeRole permission for the role ARN

  • Role ARN in group’s assumable_roles doesn’t match the deployed role

Solution:

# Export and review the role definition
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/roles:/app/roles \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action export-roles

# Check the trust policy
cat sec/roles/edge-prod-b001-role-aml-engineer.json

# Verify the group has the role in assumable_roles
cat sec/groups/edge-prod-b001-group-ml-engineers.json

Cross-Account Role Access Denied

Symptoms: Cross-account assume role fails with AccessDenied.

Causes:

  • External ID mismatch

  • Trusted account ID incorrect

  • Caller’s account not in trusted_accounts list

Solution:

# Test assume role with external ID
aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/edge-prod-b001-xacct-deployment-role \
  --role-session-name test \
  --external-id deployment-external-id-12345

# Export and review the cross-account role
cat sec/roles/edge-prod-b001-xacct-deployment-role.json

Verify in your config:

cross_account_roles:
  deployment_role:
    trusted_accounts:
      - "999888777666"  # Must match the calling account
    external_id: "deployment-external-id-12345"  # Must match exactly

Performance Issues

Slow Stack Creation

Symptoms: Stack creation takes longer than expected (> 3 minutes).

Causes:

  • Large number of IAM resources (enterprise tier: 44+ resources)

  • AWS API throttling on IAM operations

  • S3 template upload latency (medium/enterprise)

Solution:

# Monitor stack events
aws cloudformation describe-stack-events \
  --stack-name edge-prod-b001-us-west-1-medium-sec-stack

# Typical deployment times:
# Startup: 30-60 seconds
# Medium: 90-120 seconds
# Enterprise: 120-180 seconds

Advanced Troubleshooting

Enable Debug Logging

docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action deploy \
  --force \
  --verbose debug

Inspect Container

# Run interactive shell
docker run --rm -it \
  -v $(pwd)/sec/configs:/app/configs:ro \
  --entrypoint /bin/bash \
  sec-provisioner:medium-10

# Inside container
ls -la /app/
ls -la /app/schemas/
ls -la /app/policy-templates/
cat /app/configs/your-config.yaml

Review Generated Files

# Check generated template
cat sec/templates/edge-prod-b001-us-west-1-medium-sec-template.yaml

# Check generated IAM policy
cat sec/policies/edge-prod-b001-us-west-1-medium-sec-iam-policy.json

# Check exported groups
ls sec/groups/
cat sec/groups/edge-prod-b001-group-data-scientists.json

# Check exported roles
ls sec/roles/
cat sec/roles/edge-prod-b001-role-sagemaker-execution.json

# Check execution logs
ls -lt sec/reports/*.log | head -5
cat sec/reports/*.log | tail -50

Check AWS API Calls

# Check CloudTrail for IAM operations
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateGroup \
  --max-results 10

# Check for errors
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateRole \
  --query 'Events[?ErrorCode!=`null`]'

Getting Help

Collect Diagnostic Information

# System info
docker --version
aws --version
uname -a

# AWS identity
aws sts get-caller-identity

# IAM limits
aws iam get-account-summary

# Error output
docker run --rm \
  -v ~/.aws:/home/secuser/.aws:ro \
  -v $(pwd)/sec/configs:/app/configs:ro \
  -v $(pwd)/sec/reports:/app/reports \
  sec-provisioner:medium-10 \
  --config edge-prod-b001-us-west-1-sec.yaml \
  --action validate-config 2>&1 | tee error.log

Contact Support

Include in support request:

  1. Docker image version and tier

  2. AWS region

  3. Sanitized configuration file

  4. Complete error message

  5. Steps to reproduce

  6. Expected vs actual behavior

  7. Log files from reports/ directory

See Support for contact information.