Roles ArchitectureΒΆ

Table of ContentsΒΆ


OverviewΒΆ

This document captures the architectural decisions for adding an IAM Roles layer to the Security Provisioner Tool. The existing tool creates IAM Groups and inline policies. This new layer introduces standalone IAM Roles with managed policy attachments and an N:M (many-to-many) mapping between Groups and Roles.

Current state: Groups β†’ inline policies (permissions directly on groups)

Target state: Groups β†’ assume Roles β†’ managed policies (permissions on roles, groups get AssumeRole access)

This separation decouples who (groups/people) from what (permissions/access), enabling flexible, auditable, and scalable security management.


Architectural DecisionsΒΆ

Decision SummaryΒΆ

ID

Decision

Answer

Version

AD-1

Group-to-Role mapping

N:M (many-to-many)

v1

AD-2

Role definition

Standalone objects (not nested under groups)

v1

AD-3

Trust policy

Account root principal (AWS constraint)

v1

AD-4

N:M enforcement

Two-way handshake (group-side AssumeRole + role-side trust)

v1

AD-5

Policy composition

Managed policies attached to roles (not inline)

v1

AD-6

Security model flag

Config-driven toggle: groups-only or roles-based

v1

AD-7

MFA / session duration

Conditional assumption controls

v2

AD-8

Permission boundaries

Delegated role creation guardrails

v2


AD-1: Group-to-Role Mapping (N:M)ΒΆ

Decision: Many-to-many relationship between IAM Groups and IAM Roles.

Rationale:

A 1:1 mapping (one group β†’ one role) is insufficient for real-world scenarios:

  • Lack of flexibility β€” A user in a 1:1 model can only take on one persona. In reality, a data scientist might need to act as an ML engineer temporarily when debugging a pipeline.

  • Role assumption limitations β€” Users often need to switch between different roles (e.g., viewing logs vs. changing infrastructure) without logging out.

  • Scalability challenges β€” Creating a new group for every unique role/permission combination leads to β€œgroup explosion” as organizations grow.

  • Cross-account complexity β€” A single, static 1:1 mapping doesn’t handle dynamic cross-account role assumption efficiently.

N:M enables:

  • A group can assume multiple roles (data-scientists can assume both a standard role and an experiment role)

  • A role can be assumed by multiple groups (a read-only role can be shared across data-scientists, auditors, and business-consumers)

  • Group membership changes automatically update access without touching roles


AD-2: Roles as Standalone ObjectsΒΆ

Decision: IAM Roles are defined as independent, top-level objects in the configuration β€” not nested under groups.

Rationale:

  • Reusability β€” A single role (e.g., arole-s3-readonly) can be mapped to multiple groups without duplicating the role definition.

  • Decoupling β€” Updating a role’s permissions reflects across all associated groups instantly.

  • N:M compatibility β€” Nesting would force either role duplication under every group or confusing ownership logic. Standalone objects use a clean mapping/assignment layer to bridge groups and roles.

  • Auditability β€” Standalone roles make it easy to audit exactly who has access to what without digging through nested structures.

Principle: Groups represent β€œWho” (people/teams). Roles represent β€œWhat” (permissions). Keeping them separate allows organizational structure to change without breaking security logic.


AD-3: Trust Policy β€” Account Root PrincipalΒΆ

Decision: All assumable roles trust the AWS account root as the principal in their trust policy.

Rationale:

This is an AWS constraint β€” you cannot directly reference an IAM Group as a principal in a role’s trust policy. The account root principal acts as the first layer of trust, and the group-side AssumeRole policy acts as the second filter.

AssumeRolePolicyDocument:
  Version: "2012-10-17"
  Statement:
    - Effect: Allow
      Principal:
        AWS: "arn:aws:iam::{account_id}:root"
      Action: "sts:AssumeRole"

Cross-account note: If groups and roles are in different accounts, the Principal must point to the originating account ID instead of the role’s own account root.


AD-4: Two-Way Handshake for N:M EnforcementΒΆ

Decision: The N:M relationship is enforced through a two-way handshake β€” identity-based policies on the group side and trust policies on the role side.

Rationale:

Since IAM Groups cannot be referenced as principals in trust policies, both sides must be configured:

Side 1 β€” Group (identity policy): Each group gets an inline policy allowing sts:AssumeRole on specific role ARNs.

# Group-side: which roles can this group assume?
Type: AWS::IAM::Group
Properties:
  GroupName: "{company_prefix}-{env}-{tenant_id}-group-data-scientists"
  Policies:
    - PolicyName: AllowAssumeRoles
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action: "sts:AssumeRole"
            Resource:
              - "arn:aws:iam::{account_id}:role/{company_prefix}-{env}-{tenant_id}-arole-ds-standard"
              - "arn:aws:iam::{account_id}:role/{company_prefix}-{env}-{tenant_id}-arole-ds-experiment"

Side 2 β€” Role (trust policy): Each role trusts the account root.

# Role-side: who is trusted to assume this role?
Type: AWS::IAM::Role
Properties:
  RoleName: "{company_prefix}-{env}-{tenant_id}-arole-ds-standard"
  AssumeRolePolicyDocument:
    Version: "2012-10-17"
    Statement:
      - Effect: Allow
        Principal:
          AWS: "arn:aws:iam::{account_id}:root"
        Action: "sts:AssumeRole"

How the N:M mapping is managed:

  • To grant a group access to a role β†’ add the role ARN to the group’s AssumeRole Resource list

  • To revoke access β†’ remove the role ARN from the group’s policy

  • The template generator reads the assignments mapping and auto-generates both sides

The Hallway Analogy:

Imagine two IAM users. One has a sign on their back that says β€œDS” (Data Scientists group). The other has β€œMLEng” (ML Engineers group). Both belong to the same AWS account.

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   DOOR 1        β”‚
                    β”‚   Trust Policy   β”‚
                    β”‚   (Account Root) β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                    Both DS and MLEng
                    pass through (same account)
                             β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   LONG HALL     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚                       β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   DOOR 2a       β”‚     β”‚   DOOR 2b       β”‚
        β”‚   DS Group      β”‚     β”‚   MLEng Group   β”‚
        β”‚   AssumeRole    β”‚     β”‚   AssumeRole    β”‚
        β”‚   Policy        β”‚     β”‚   Policy        β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚                       β”‚
          The truth is              The truth is
          revealed:                 revealed:
          β”œβ”€β”€ arole-ds-standard βœ…   β”œβ”€β”€ arole-ds-experiment βœ…
          β”œβ”€β”€ arole-ds-experiment βœ…  β”œβ”€β”€ arole-ml-deploy βœ…
          └── arole-ml-deploy ❌     └── arole-bedrock-manage βœ…

Door 1 (trust policy) lets everyone from the same account into the hall β€” this is the account root principal. The sign on your back (your group membership) determines which corridor you walk down. Door 2 (the group’s AssumeRole policy) reveals the truth β€” exactly which roles you can assume. No more, no less.


AD-5: Managed Policy CompositionΒΆ

Decision: Roles are composed by attaching multiple granular managed policies β€” not by writing inline permissions.

Rationale:

Three approaches were evaluated:

Approach

Description

Verdict

Policy-to-Role (managed)

Attach modular managed policies to roles

βœ… Best practice

Role-to-Group mapping

Groups assume roles via AssumeRole

βœ… Used for N:M

Inline policy bundling

Write JSON permissions directly inside roles

❌ Maintenance nightmare

Why managed policies win:

  • DRY β€” Update the S3 read-only policy once, every role using it is updated instantly

  • Auditable β€” You can see exactly which building blocks make up a role

  • Reusable β€” The same policy (e.g., ECR read-only) attaches to multiple roles

  • Testable β€” Each policy can be validated independently

Note: This requires converting our current inline policies to standalone managed policies. The NAMING_CONVENTIONS.md already has a roadmap note for this: {company_prefix}-{env}-{tenant_id}-policy-{policy_name}.


AD-6: Security Model FlagΒΆ

Decision: Support both groups-only and roles-based security models via a config flag (security_model: "groups-only" | "roles-based").

Rationale:

Three options were evaluated:

Option

Approach

Pros

Cons

A

Keep group-level policies AND add Roles layer

Backward compatible, no breakage

Double maintenance, technical debt

B

Migrate all policies from groups to roles

Clean architecture, single model

Breaking change, high risk

C

Feature flag to support both modes

Incremental migration, instant rollback

Adds code complexity

Option C was selected because:

  • Product feature, not just migration tool β€” Different clients need different maturity levels:

    • Startup clients (3 groups, small team) may prefer groups-only β€” simple, no AssumeRole workflow to teach

    • Enterprise clients (11 groups, compliance) need roles-based for audit trails and session-based access

    • Medium clients may start groups-only and migrate to roles-based as they mature

  • Safety net β€” Allows migrating groups one by one or rolling back instantly if a workflow breaks

  • End goal is Option B β€” The flag provides the clean architecture of Option B with the safety of Option A during transition

Config usage:

security:
  security_model: "groups-only"    # Policies attached directly to groups (current behavior)
  # or
  security_model: "roles-based"    # Policies on roles, groups get AssumeRole only

Template generator behavior:

  • groups-only β€” Current behavior. Groups get managed_policies and custom_policies directly.

  • roles-based β€” Groups get only an AssumeRole inline policy. Assumable roles get the managed policies. Assignments mapping drives the wiring.


Architecture OverviewΒΆ

Three-Layer ModelΒΆ

The architecture has three distinct layers:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYER 1: GROUPS (Who)                          β”‚
β”‚  IAM Groups representing job functions          β”‚
β”‚  e.g., data-scientists, ml-engineers, auditors  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ sts:AssumeRole (N:M mapping)
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYER 2: ROLES (What)                          β”‚
β”‚  Assumable roles as permission bundles          β”‚
β”‚  e.g., arole-ds-standard, arole-ml-deploy       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ Policy attachments
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYER 3: POLICIES (How)                        β”‚
β”‚  Granular managed policies per service/level    β”‚
β”‚  e.g., s3-read-only, ecr-dev-read-write         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Resource Relationship DiagramΒΆ

IAM Group (data-scientists)
    β”œβ”€β”€ can assume β†’ arole-ds-standard
    β”‚                   β”œβ”€β”€ attached: policy-s3-project-buckets-only
    β”‚                   β”œβ”€β”€ attached: policy-ecr-read-only
    β”‚                   β”œβ”€β”€ attached: policy-pipeline-read-only
    β”‚                   β”œβ”€β”€ attached: policy-sagemaker-dev-invoke
    β”‚                   └── attached: policy-bedrock-invoke-only
    β”‚
    └── can assume β†’ arole-ds-experiment
                        β”œβ”€β”€ attached: policy-s3-project-buckets-full
                        β”œβ”€β”€ attached: policy-sagemaker-dev-invoke
                        └── attached: policy-bedrock-invoke-only

IAM Group (ml-engineers)
    β”œβ”€β”€ can assume β†’ arole-ds-experiment (shared with data-scientists)
    β”œβ”€β”€ can assume β†’ arole-ml-deploy
    β”‚                   β”œβ”€β”€ attached: policy-ecr-dev-read-write
    β”‚                   β”œβ”€β”€ attached: policy-pipeline-project-dev
    β”‚                   └── attached: policy-lambda-deploy-manage
    β”‚
    └── can assume β†’ arole-bedrock-manage
                        β”œβ”€β”€ attached: policy-bedrock-model-manage
                        └── attached: policy-bedrock-observability

IAM Group (platform-administrators)
    └── can assume β†’ arole-platform-full
                        β”œβ”€β”€ attached: policy-s3-full
                        β”œβ”€β”€ attached: policy-ecr-full
                        β”œβ”€β”€ attached: policy-pipeline-full
                        β”œβ”€β”€ attached: policy-sagemaker-full
                        β”œβ”€β”€ attached: policy-lambda-full
                        └── attached: policy-bedrock-full

Real-World ExampleΒΆ

Scenario: Edge AI, medium tier (9 groups)

Sarah (data scientist) is in the data-scientists group. Her day:

  1. Morning β€” Assumes arole-ds-standard to read training data from S3 and invoke Bedrock for embeddings

  2. Afternoon β€” Assumes arole-ds-experiment to write experiment results to S3 and invoke dev SageMaker endpoints

  3. End of day β€” Session expires, no standing permissions

James (ML engineer) is in the ml-engineers group. His day:

  1. Morning β€” Assumes arole-ml-deploy to push Docker images and update a pipeline

  2. Afternoon β€” Assumes arole-bedrock-manage to configure a new guardrail for the production chatbot

  3. Debugging β€” Assumes arole-ds-experiment (shared role) to reproduce a data scientist’s issue


Naming ConventionsΒΆ

Assumable RolesΒΆ

Pattern: {company_prefix}-{env}-{tenant_id}-arole-{role_name}

Examples:

  • edge-prod-b001-arole-ds-standard

  • edge-prod-b001-arole-ml-deploy

  • edge-prod-b001-arole-platform-full

  • globalbank-prod-c001-arole-bedrock-manage

Rationale: The arole- prefix distinguishes assumable roles from service roles (role-). This distinction matters because:

  • Service roles (role-sagemaker-execution) are assumed by AWS services to run jobs

  • Assumable roles (arole-data-scientists) are assumed by humans/groups to get permissions

  • Different trust relationships, different audit trails, different lifecycle management

Managed PoliciesΒΆ

Pattern: {company_prefix}-{env}-{tenant_id}-policy-{service}-{level}

Examples:

  • edge-prod-b001-policy-s3-read-only

  • edge-prod-b001-policy-ecr-dev-read-write

  • edge-prod-b001-policy-pipeline-project-dev

  • edge-prod-b001-policy-sagemaker-dev-invoke

  • edge-prod-b001-policy-lambda-deploy-manage

  • edge-prod-b001-policy-bedrock-invoke-only

  • edge-prod-b001-policy-bedrock-model-manage

  • edge-prod-b001-policy-bedrock-full

  • edge-prod-b001-policy-kms-level1-read-only

  • edge-prod-b001-policy-trusted-advisor-level1-read-only

Test ResourcesΒΆ

Pattern: {base_name}-test-{6_digit_random}

Examples:

  • edge-prod-b001-arole-ds-standard-test-a3f9c2

  • edge-prod-b001-policy-s3-read-only-test-a3f9c2


Config Schema DesignΒΆ

Roles SectionΒΆ

Roles are standalone objects. Each role defines a name and a list of managed policy references.

roles:
  - name: "ds-standard"
    description: "Standard data science access β€” read data, invoke models"
    policies:
      - s3-project-buckets-only
      - ecr-read-only
      - pipeline-read-only
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ds-experiment"
    description: "Experiment access β€” write results, invoke dev endpoints"
    policies:
      - s3-project-buckets-full
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ml-deploy"
    description: "ML deployment access β€” push images, manage pipelines, deploy functions"
    policies:
      - ecr-dev-read-write
      - pipeline-project-dev
      - lambda-deploy-manage

  - name: "bedrock-manage"
    description: "Bedrock management β€” guardrails, model access, imports"
    policies:
      - bedrock-model-manage

  - name: "platform-full"
    description: "Full platform administration"
    policies:
      - s3-full
      - ecr-full
      - pipeline-full
      - sagemaker-full
      - lambda-full
      - bedrock-full

Assignments SectionΒΆ

The N:M mapping between groups and roles.

assignments:
  - group: "data-scientists"
    roles:
      - "ds-standard"
      - "ds-experiment"

  - group: "ml-engineers"
    roles:
      - "ds-experiment"
      - "ml-deploy"
      - "bedrock-manage"

  - group: "platform-administrators"
    roles:
      - "platform-full"

  - group: "operations-support"
    roles:
      - "ds-standard"
      - "ml-deploy"

# Groups using direct policy assignments (no assumable roles):
#   - business-consumers: Non-technical users (dashboards, reports). Minimal invoke-only
#     permissions (sagemaker level1-prod, bedrock level1). Adding sts:AssumeRole friction
#     provides no security benefit for users who only call inference endpoints.
#   - external-contractors: Minimal read-only access (s3 level1, sagemaker level1, bedrock level1).
#     Role assumption deferred to v2 (TODO #8) when MFA + IP + session constraints are
#     implemented β€” adding a role without those controls adds plumbing with no security value.

Complete Config ExampleΒΆ

client: "edge"
environment: "prod"
tenant_id: "b001"
region: "us-west-2"
tier: "medium-9"

groups:
  - name: "data-scientists"
    description: "Data science team"
  - name: "ml-engineers"
    description: "ML engineering team"
  - name: "platform-administrators"
    description: "Platform admin team"
  - name: "business-consumers"
    description: "Business stakeholders"
  - name: "operations-support"
    description: "Operations team"

roles:
  - name: "ds-standard"
    description: "Standard data science access"
    policies:
      - s3-project-buckets-only
      - ecr-read-only
      - pipeline-read-only
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ds-experiment"
    description: "Experiment access with write permissions"
    policies:
      - s3-project-buckets-full
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ml-deploy"
    description: "ML deployment and CI/CD access"
    policies:
      - ecr-dev-read-write
      - pipeline-project-dev
      - lambda-deploy-manage

  - name: "bedrock-manage"
    description: "Bedrock model and guardrail management"
    policies:
      - bedrock-model-manage

  - name: "platform-full"
    description: "Full platform administration"
    policies:
      - s3-full
      - ecr-full
      - pipeline-full
      - sagemaker-full
      - lambda-full
      - bedrock-full

assignments:
  - group: "data-scientists"
    roles:
      - "ds-standard"
      - "ds-experiment"

  - group: "ml-engineers"
    roles:
      - "ds-experiment"
      - "ml-deploy"
      - "bedrock-manage"

  - group: "platform-administrators"
    roles:
      - "platform-full"

  - group: "operations-support"
    roles:
      - "ds-standard"
      - "ml-deploy"

# Groups using direct policy assignments (no assumable roles):
#   - business-consumers: Non-technical users (dashboards, reports). Minimal invoke-only
#     permissions (sagemaker level1-prod, bedrock level1). Adding sts:AssumeRole friction
#     provides no security benefit for users who only call inference endpoints.
#   - external-contractors: Minimal read-only access (s3 level1, sagemaker level1, bedrock level1).
#     Role assumption deferred to v2 (TODO #8) when MFA + IP + session constraints are
#     implemented β€” adding a role without those controls adds plumbing with no security value.

CloudFormation Resource GenerationΒΆ

The template generator reads the config and produces these CloudFormation resources:

Role ResourceΒΆ

For each role in the roles section:

# Generated for: arole-ds-standard
EdgeProdB001AroleDsStandard:
  Type: AWS::IAM::Role
  Properties:
    RoleName: "edge-prod-b001-arole-ds-standard"
    Description: "Standard data science access β€” read data, invoke models"
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        - Effect: Allow
          Principal:
            AWS: !Sub "arn:aws:iam::${AWS::AccountId}:root"
          Action: "sts:AssumeRole"
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-s3-project-buckets-only"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-ecr-read-only"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-pipeline-read-only"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-sagemaker-dev-invoke"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-bedrock-invoke-only"

Group AssumeRole PolicyΒΆ

For each group in the assignments section, an inline policy is added:

# Generated for: group-data-scientists
EdgeProdB001GroupDataScientists:
  Type: AWS::IAM::Group
  Properties:
    GroupName: "edge-prod-b001-group-data-scientists"
    Policies:
      - PolicyName: "AllowAssumeRoles"
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action: "sts:AssumeRole"
              Resource:
                - !Sub "arn:aws:iam::${AWS::AccountId}:role/edge-prod-b001-arole-ds-standard"
                - !Sub "arn:aws:iam::${AWS::AccountId}:role/edge-prod-b001-arole-ds-experiment"

Service Roles and Policy AssignmentsΒΆ

Service roles are IAM Roles assumed by AWS services and CI/CD platforms β€” not by human users. Unlike assumable roles (Layer 2), which are elevated by humans via sts:AssumeRole, service roles are assumed automatically by machines via trust policies.

Why service_account Was Removed from iam_groupsΒΆ

The original enterprise config had a service_account IAM group with PowerUserAccess + inline iam:PassRole + cloudformation:* on *. This was an anti-pattern for three reasons:

  1. IAM Groups are for humans β€” Groups are collections of IAM Users. Service accounts/machine identities should be IAM Roles assumed by the CI/CD platform, not IAM Users with long-lived access keys.

  2. PowerUserAccess is a sledgehammer β€” It grants access to every AWS service except IAM management. A CI/CD pipeline only needs access to the specific services it deploys to.

  3. iam:PassRole on * is privilege escalation β€” Unscoped PassRole allows passing any role to any service, effectively granting admin access through role chaining.

Decision: Remove service_account from iam_groups and replace with ci_cd_deployment_role under service_roles, using the policy_assignments system for scoped permissions.

Service Roles Use Policy AssignmentsΒΆ

Service roles can reference the same policy level system as IAM groups. This ensures consistency β€” a sagemaker: level4-ci assignment on a service role uses the exact same policy definition as it would on a group.

service_roles:
  ci_cd_deployment:
    description: "CI/CD deployment role β€” build, test, deploy across the ML platform"
    trust_policy:
      Version: '2012-10-17'
      Statement:
        - Effect: Allow
          Principal:
            Service: codepipeline.amazonaws.com
          Action: sts:AssumeRole
    policy_assignments:
      s3: level2              # project-buckets-only
      ecr: level3             # ci-read-write
      pipeline: level3        # project-ci
      sagemaker: level4-ci    # deploy-only β€” no delete, no traffic shifting
      lambda: level2          # deploy-manage

The key difference from group policy_assignments:

  • Groups get policies attached directly (inline or customer-managed)

  • Service roles get the same policies attached to the role itself

  • Trust policy controls who can assume the role (AWS service, OIDC provider, or another account)

ci_cd_deployment_role DesignΒΆ

The CI/CD deployment role replaces the old service_account group. It covers the full deployment lifecycle:

Service

Level

CI/CD Stage

What It Does

S3

level2 (project-buckets-only)

Build + Deploy

Read/write model artifacts, deployment packages

ECR

level3 (ci-read-write)

Build

Build and push container images to registry

Pipeline

level3 (project-ci)

Orchestration

Create, execute, manage ML pipelines

SageMaker

level4-ci (deploy-only)

Deploy

Create endpoints, register models, configure autoscaling. No delete, no traffic shifting

Lambda

level2 (deploy-manage)

Deploy

Deploy and update inference functions

What’s NOT included:

  • No Bedrock β€” model access and guardrail management is a human decision

  • No iam:PassRole on * β€” PassRole is scoped within SageMaker level4-ci to {company_prefix}-{env}-*-role-* conditioned to sagemaker.amazonaws.com

  • No cloudformation:* β€” CloudFormation access is outside the 6-service scope and should be handled separately if needed for IaC deployments

  • No managed policies β€” the level system covers everything

SageMaker level4-ci guardrails:

  • Explicit deny on DeleteEndpoint, DeleteEndpointConfig, DeleteModel, DeleteModelPackage, DeleteModelPackageGroup

  • Explicit deny on UpdateEndpointWeightsAndCapacities (traffic shifting)

  • Explicit deny on DeleteDomain, DeleteUserProfile

  • Pipelines deploy forward β€” teardown requires separate authorization

Trust Policy PatternsΒΆ

The trust policy depends on the CI/CD platform. Common patterns:

AWS CodePipeline / CodeBuild:

trust_policy:
  Version: '2012-10-17'
  Statement:
    - Effect: Allow
      Principal:
        Service:
          - codepipeline.amazonaws.com
          - codebuild.amazonaws.com
      Action: sts:AssumeRole

GitHub Actions (OIDC):

trust_policy:
  Version: '2012-10-17'
  Statement:
    - Effect: Allow
      Principal:
        Federated: arn:aws:iam::{account_id}:oidc-provider/token.actions.githubusercontent.com
      Action: sts:AssumeRoleWithWebIdentity
      Condition:
        StringEquals:
          token.actions.githubusercontent.com:aud: sts.amazonaws.com
        StringLike:
          token.actions.githubusercontent.com:sub: repo:{org}/{repo}:*

GitLab CI (OIDC):

trust_policy:
  Version: '2012-10-17'
  Statement:
    - Effect: Allow
      Principal:
        Federated: arn:aws:iam::{account_id}:oidc-provider/gitlab.com
      Action: sts:AssumeRoleWithWebIdentity
      Condition:
        StringEquals:
          gitlab.com:aud: https://gitlab.com
        StringLike:
          gitlab.com:sub: project_path:{group}/{project}:*

The trust policy is client-specific and configured per deployment. The policy_assignments remain the same regardless of which CI/CD platform assumes the role.


Implementation PlanΒΆ

Phase 1: Architecture and Config DesignΒΆ

  1. ~~Design architectural decisions~~ βœ…

  2. ~~Design config schema with roles and assignments sections~~ βœ…

  3. ~~Update NAMING_CONVENTIONS.md with arole- and policy- patterns~~ βœ…

  4. ~~Document architecture in POLICY_GUIDE.md~~ βœ…

Phase 2: Config and ValidationΒΆ

  1. ~~Update validation schemas (startup, medium, enterprise) with roles and assignments~~ βœ…

  2. ~~Update client configs with role definitions and assignments~~ βœ…

  3. ~~Update config validation code to validate N:M mappings (no dangling references)~~ βœ…

Phase 3: Template GeneratorΒΆ

  1. ~~Extend template generator to emit IAM Role resources with trust policies~~ βœ…

  2. ~~Emit managed policy resources from POLICY_GUIDE.md definitions~~ βœ…

  3. ~~Emit AssumeRole inline policies on groups from assignments mapping~~ βœ…

  4. ~~Convert existing inline policies to standalone managed policies~~ βœ…

Phase 4: Deployment and TestingΒΆ

  1. ~~Test with --template-only (validate generated CloudFormation)~~ βœ…

  2. ~~Test with --test-deploy (real AWS, unique names)~~ βœ…

  3. ~~Add troubleshooting scenarios for AssumeRole denied errors~~ βœ…

  4. ~~Update POLICY_GUIDE.md Assignment Recommendations~~ βœ…

Phase 5: Future WorkΒΆ

  1. Config-driven Bedrock model scoping β€” see ROADMAP.md

  2. Lambda VPC governance β€” see ROADMAP.md


v2 RoadmapΒΆ

AD-7: MFA and Session Duration ControlsΒΆ

What: Support time-bound and conditional role assumption β€” requiring MFA for sensitive roles and setting session duration limits.

Why:

  • Reduced attack window β€” Limiting session duration (1–12 hours) restricts the usefulness of stolen session tokens

  • Session hijacking mitigation β€” Prevents persistent sessions from bypassing MFA

  • Contextual security β€” Require MFA only for high-risk roles (e.g., platform-full) without constant user friction

Implementation approach:

  • Add Condition block to trust policies requiring aws:MultiFactorAuthPresent

  • Add max_session_duration to role config (default 1 hour, configurable up to 12 hours)

  • Add mfa_required: true/false to assignments config per group-role pair

Best practices to follow:

  • Role-based durations: critical roles use shorter sessions (1 hour), standard roles use longer (4–8 hours)

  • Avoid MFA fatigue: don’t force MFA for low-risk, everyday roles

  • Use idle timeouts (15–60 minutes) alongside absolute session limits

AD-8: Permission BoundariesΒΆ

What: Allow client team leads to create roles within guardrails without risking privilege escalation.

Implementation approach:

  • Define permission boundary policies per tier

  • Attach boundaries to delegated admin roles

  • Ensure no role can exceed its boundary regardless of attached policies


Motivating Scenarios β€” Why Groups-Only Is InsufficientΒΆ

The following real-world scenarios demonstrate why an IAM Groups-only architecture (without assumable roles) would fail to meet enterprise client needs. These are the problems the Roles layer solves.

Scenario 1: Temporary Cross-Team AccessΒΆ

Situation: A data scientist needs to debug a failing ML pipeline. The pipeline logs and configuration are only accessible to the ml-engineers group.

Groups-only problem: The admin must add the data scientist to the ml-engineers group, giving them all ML engineer permissions (ECR write, pipeline management, Lambda deploy). After debugging, the admin must remember to remove them. This is error-prone, over-privileged, and leaves no audit trail of the temporary access.

Roles solution: The data scientist assumes arole-pipeline-viewer (read-only pipeline access) for a single session. No group membership changes, no over-provisioning, session expires automatically.

Scenario 2: Environment-Specific Access ControlΒΆ

Situation: A backend developer needs to invoke production SageMaker endpoints for the live application but should never touch dev/staging endpoints (to avoid accidentally routing production traffic to unstable models).

Groups-only problem: Group policies are static bundles. You’d need separate groups for β€œbackend-dev-invoke” and β€œbackend-prod-invoke”, leading to group explosion. With 5 environments Γ— 10 job functions, you’d need 50 groups.

Roles solution: Define arole-prod-invoke and arole-dev-invoke as separate roles. The backend-developers group is assigned only arole-prod-invoke. Clean, no group explosion.

Scenario 3: Least-Privilege for AutomationΒΆ

Situation: A CI/CD pipeline needs different permissions at different stages β€” read ECR during build, write ECR during push, invoke SageMaker during integration tests, deploy Lambda during release.

Groups-only problem: The pipeline’s IAM entity must be in a single group with all permissions combined. It has write access during the build stage when it only needs read access. No stage-level isolation.

Roles solution: The pipeline assumes different roles at each stage: arole-ecr-read during build, arole-ecr-write during push, arole-sagemaker-invoke during tests, arole-lambda-deploy during release. Each stage has exactly the permissions it needs.

Scenario 4: Compliance and Audit RequirementsΒΆ

Situation: An enterprise client (e.g., a bank) requires that every privileged action is traceable to a specific permission grant, with clear evidence of when access was assumed and when it expired.

Groups-only problem: Group membership is a persistent state. CloudTrail shows the user’s identity but not which specific permission bundle they were using. There’s no session boundary β€” the user has all group permissions all the time.

Roles solution: Every sts:AssumeRole call is logged in CloudTrail with the role ARN, session name, and timestamp. Auditors can see exactly which role was assumed, when, and for how long. Session expiry provides natural access boundaries.

Scenario 5: FinOps Cost ControlΒΆ

Situation: Only two people on the platform team should be able to create Bedrock provisioned throughput (which can cost thousands per month). Other platform admins should have full access to everything else.

Groups-only problem: All platform-administrators share the same group policy. Either everyone can provision throughput, or no one can. You can’t differentiate within a group.

Roles solution: Define arole-platform-full (everything except throughput) and arole-bedrock-throughput (adds provisioned throughput). Assign arole-bedrock-throughput only to the two approved FinOps engineers. Same group, different role access.

Scenario 6: Onboarding and OffboardingΒΆ

Situation: A new ML engineer joins the team. They should start with read-only access for the first week, then graduate to full ML engineer access.

Groups-only problem: You either put them in the ml-engineers group immediately (over-privileged on day one) or create a temporary β€œml-engineers-readonly” group (group explosion, manual cleanup).

Roles solution: Add them to the ml-engineers group on day one. The group has access to both arole-ml-readonly and arole-ml-deploy. During onboarding, they only use arole-ml-readonly. After the first week, they start assuming arole-ml-deploy. No group changes needed β€” the access control is in which role they choose to assume.


Why 1:1 Mapping Was RejectedΒΆ

For reference, a 1:1 mapping (one group β†’ one role) was initially considered and rejected:

Concern

1:1 Limitation

N:M Solution

Flexibility

User locked to one persona

User assumes different roles as needed

Role switching

Requires group membership changes

Switch roles via AssumeRole

Scalability

Group explosion (one group per permission combo)

Shared roles reduce total count

Cross-account

Static mapping, no dynamic assumption

Role ARNs can reference other accounts

Temporary access

Requires adding/removing from groups

Assume a role temporarily, session expires

When 1:1 is acceptable:

  • Very small setups (2–3 users, personal accounts)

  • Strictly defined personas (e.g., a CI/CD pipeline with one fixed function)


ReferencesΒΆ


Document Version: 1.0 Last Updated: 2025 Maintained By: MLOps Platform Team