Roles Architecture¶

Table of Contents¶

Overview
Architectural Decisions
Architecture Overview
Naming Conventions
Config Schema Design
CloudFormation Resource Generation
Service Roles and Policy Assignments
Implementation Plan
v2 Roadmap
Motivating Scenarios — Why Groups-Only Is Insufficient
Why 1:1 Mapping Was Rejected
References

Overview¶

This document captures the architectural decisions for adding an IAM Roles layer to the Security Provisioner Tool. The existing tool creates IAM Groups and inline policies. This new layer introduces standalone IAM Roles with managed policy attachments and an N:M (many-to-many) mapping between Groups and Roles.

Current state: Groups → inline policies (permissions directly on groups)

Target state: Groups → assume Roles → managed policies (permissions on roles, groups get AssumeRole access)

This separation decouples who (groups/people) from what (permissions/access), enabling flexible, auditable, and scalable security management.

Architectural Decisions¶

Decision Summary¶

ID	Decision	Answer	Version
AD-1	Group-to-Role mapping	N:M (many-to-many)	v1
AD-2	Role definition	Standalone objects (not nested under groups)	v1
AD-3	Trust policy	Account root principal (AWS constraint)	v1
AD-4	N:M enforcement	Two-way handshake (group-side AssumeRole + role-side trust)	v1
AD-5	Policy composition	Managed policies attached to roles (not inline)	v1
AD-6	Security model flag	Config-driven toggle: `groups-only` or `roles-based`	v1
AD-7	MFA / session duration	Conditional assumption controls	v2
AD-8	Permission boundaries	Delegated role creation guardrails	v2

AD-1: Group-to-Role Mapping (N:M)¶

Decision: Many-to-many relationship between IAM Groups and IAM Roles.

Rationale:

A 1:1 mapping (one group → one role) is insufficient for real-world scenarios:

Lack of flexibility — A user in a 1:1 model can only take on one persona. In reality, a data scientist might need to act as an ML engineer temporarily when debugging a pipeline.
Role assumption limitations — Users often need to switch between different roles (e.g., viewing logs vs. changing infrastructure) without logging out.
Scalability challenges — Creating a new group for every unique role/permission combination leads to “group explosion” as organizations grow.
Cross-account complexity — A single, static 1:1 mapping doesn’t handle dynamic cross-account role assumption efficiently.

N:M enables:

A group can assume multiple roles (data-scientists can assume both a standard role and an experiment role)
A role can be assumed by multiple groups (a read-only role can be shared across data-scientists, auditors, and business-consumers)
Group membership changes automatically update access without touching roles

AD-2: Roles as Standalone Objects¶

Decision: IAM Roles are defined as independent, top-level objects in the configuration — not nested under groups.

Rationale:

Reusability — A single role (e.g., arole-s3-readonly) can be mapped to multiple groups without duplicating the role definition.
Decoupling — Updating a role’s permissions reflects across all associated groups instantly.
N:M compatibility — Nesting would force either role duplication under every group or confusing ownership logic. Standalone objects use a clean mapping/assignment layer to bridge groups and roles.
Auditability — Standalone roles make it easy to audit exactly who has access to what without digging through nested structures.

Principle: Groups represent “Who” (people/teams). Roles represent “What” (permissions). Keeping them separate allows organizational structure to change without breaking security logic.

AD-3: Trust Policy — Account Root Principal¶

Decision: All assumable roles trust the AWS account root as the principal in their trust policy.

Rationale:

This is an AWS constraint — you cannot directly reference an IAM Group as a principal in a role’s trust policy. The account root principal acts as the first layer of trust, and the group-side AssumeRole policy acts as the second filter.

AssumeRolePolicyDocument:
  Version: "2012-10-17"
  Statement:
    - Effect: Allow
      Principal:
        AWS: "arn:aws:iam::{account_id}:root"
      Action: "sts:AssumeRole"

Cross-account note: If groups and roles are in different accounts, the Principal must point to the originating account ID instead of the role’s own account root.

AD-4: Two-Way Handshake for N:M Enforcement¶

Decision: The N:M relationship is enforced through a two-way handshake — identity-based policies on the group side and trust policies on the role side.

Rationale:

Since IAM Groups cannot be referenced as principals in trust policies, both sides must be configured:

Side 1 — Group (identity policy): Each group gets an inline policy allowing sts:AssumeRole on specific role ARNs.

# Group-side: which roles can this group assume?
Type: AWS::IAM::Group
Properties:
  GroupName: "{company_prefix}-{env}-{tenant_id}-group-data-scientists"
  Policies:
    - PolicyName: AllowAssumeRoles
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action: "sts:AssumeRole"
            Resource:
              - "arn:aws:iam::{account_id}:role/{company_prefix}-{env}-{tenant_id}-arole-ds-standard"
              - "arn:aws:iam::{account_id}:role/{company_prefix}-{env}-{tenant_id}-arole-ds-experiment"

Side 2 — Role (trust policy): Each role trusts the account root.

# Role-side: who is trusted to assume this role?
Type: AWS::IAM::Role
Properties:
  RoleName: "{company_prefix}-{env}-{tenant_id}-arole-ds-standard"
  AssumeRolePolicyDocument:
    Version: "2012-10-17"
    Statement:
      - Effect: Allow
        Principal:
          AWS: "arn:aws:iam::{account_id}:root"
        Action: "sts:AssumeRole"

How the N:M mapping is managed:

To grant a group access to a role → add the role ARN to the group’s AssumeRole Resource list
To revoke access → remove the role ARN from the group’s policy
The template generator reads the assignments mapping and auto-generates both sides

The Hallway Analogy:

Imagine two IAM users. One has a sign on their back that says “DS” (Data Scientists group). The other has “MLEng” (ML Engineers group). Both belong to the same AWS account.

                    ┌─────────────────┐
                    │   DOOR 1        │
                    │   Trust Policy   │
                    │   (Account Root) │
                    └────────┬────────┘
                             │
                    Both DS and MLEng
                    pass through (same account)
                             │
                    ┌────────┴────────┐
                    │   LONG HALL     │
                    └────────┬────────┘
                             │
                 ┌───────────┴───────────┐
                 │                       │
        ┌────────┴────────┐     ┌────────┴────────┐
        │   DOOR 2a       │     │   DOOR 2b       │
        │   DS Group      │     │   MLEng Group   │
        │   AssumeRole    │     │   AssumeRole    │
        │   Policy        │     │   Policy        │
        └────────┬────────┘     └────────┬────────┘
                 │                       │
          The truth is              The truth is
          revealed:                 revealed:
          ├── arole-ds-standard ✅   ├── arole-ds-experiment ✅
          ├── arole-ds-experiment ✅  ├── arole-ml-deploy ✅
          └── arole-ml-deploy ❌     └── arole-bedrock-manage ✅

Door 1 (trust policy) lets everyone from the same account into the hall — this is the account root principal. The sign on your back (your group membership) determines which corridor you walk down. Door 2 (the group’s AssumeRole policy) reveals the truth — exactly which roles you can assume. No more, no less.

AD-5: Managed Policy Composition¶

Decision: Roles are composed by attaching multiple granular managed policies — not by writing inline permissions.

Rationale:

Three approaches were evaluated:

Approach	Description	Verdict
Policy-to-Role (managed)	Attach modular managed policies to roles	✅ Best practice
Role-to-Group mapping	Groups assume roles via AssumeRole	✅ Used for N:M
Inline policy bundling	Write JSON permissions directly inside roles	❌ Maintenance nightmare

Why managed policies win:

DRY — Update the S3 read-only policy once, every role using it is updated instantly
Auditable — You can see exactly which building blocks make up a role
Reusable — The same policy (e.g., ECR read-only) attaches to multiple roles
Testable — Each policy can be validated independently

Note: This requires converting our current inline policies to standalone managed policies. The NAMING_CONVENTIONS.md already has a roadmap note for this: {company_prefix}-{env}-{tenant_id}-policy-{policy_name}.

AD-6: Security Model Flag¶

Decision: Support both groups-only and roles-based security models via a config flag (security_model: "groups-only" | "roles-based").

Rationale:

Three options were evaluated:

Option	Approach	Pros	Cons
A	Keep group-level policies AND add Roles layer	Backward compatible, no breakage	Double maintenance, technical debt
B	Migrate all policies from groups to roles	Clean architecture, single model	Breaking change, high risk
C	Feature flag to support both modes	Incremental migration, instant rollback	Adds code complexity

Option C was selected because:

Product feature, not just migration tool — Different clients need different maturity levels:
- Startup clients (3 groups, small team) may prefer groups-only — simple, no AssumeRole workflow to teach
- Enterprise clients (11 groups, compliance) need roles-based for audit trails and session-based access
- Medium clients may start groups-only and migrate to roles-based as they mature
Safety net — Allows migrating groups one by one or rolling back instantly if a workflow breaks
End goal is Option B — The flag provides the clean architecture of Option B with the safety of Option A during transition

Config usage:

security:
  security_model: "groups-only"    # Policies attached directly to groups (current behavior)
  # or
  security_model: "roles-based"    # Policies on roles, groups get AssumeRole only

Template generator behavior:

groups-only — Current behavior. Groups get managed_policies and custom_policies directly.
roles-based — Groups get only an AssumeRole inline policy. Assumable roles get the managed policies. Assignments mapping drives the wiring.

Architecture Overview¶

Three-Layer Model¶

The architecture has three distinct layers:

┌─────────────────────────────────────────────────┐
│  LAYER 1: GROUPS (Who)                          │
│  IAM Groups representing job functions          │
│  e.g., data-scientists, ml-engineers, auditors  │
└──────────────────────┬──────────────────────────┘
                       │ sts:AssumeRole (N:M mapping)
                       ▼
┌─────────────────────────────────────────────────┐
│  LAYER 2: ROLES (What)                          │
│  Assumable roles as permission bundles          │
│  e.g., arole-ds-standard, arole-ml-deploy       │
└──────────────────────┬──────────────────────────┘
                       │ Policy attachments
                       ▼
┌─────────────────────────────────────────────────┐
│  LAYER 3: POLICIES (How)                        │
│  Granular managed policies per service/level    │
│  e.g., s3-read-only, ecr-dev-read-write         │
└─────────────────────────────────────────────────┘

Resource Relationship Diagram¶

IAM Group (data-scientists)
    ├── can assume → arole-ds-standard
    │                   ├── attached: policy-s3-project-buckets-only
    │                   ├── attached: policy-ecr-read-only
    │                   ├── attached: policy-pipeline-read-only
    │                   ├── attached: policy-sagemaker-dev-invoke
    │                   └── attached: policy-bedrock-invoke-only
    │
    └── can assume → arole-ds-experiment
                        ├── attached: policy-s3-project-buckets-full
                        ├── attached: policy-sagemaker-dev-invoke
                        └── attached: policy-bedrock-invoke-only

IAM Group (ml-engineers)
    ├── can assume → arole-ds-experiment (shared with data-scientists)
    ├── can assume → arole-ml-deploy
    │                   ├── attached: policy-ecr-dev-read-write
    │                   ├── attached: policy-pipeline-project-dev
    │                   └── attached: policy-lambda-deploy-manage
    │
    └── can assume → arole-bedrock-manage
                        ├── attached: policy-bedrock-model-manage
                        └── attached: policy-bedrock-observability

IAM Group (platform-administrators)
    └── can assume → arole-platform-full
                        ├── attached: policy-s3-full
                        ├── attached: policy-ecr-full
                        ├── attached: policy-pipeline-full
                        ├── attached: policy-sagemaker-full
                        ├── attached: policy-lambda-full
                        └── attached: policy-bedrock-full

Real-World Example¶

Scenario: Edge AI, medium tier (9 groups)

Sarah (data scientist) is in the data-scientists group. Her day:

Morning — Assumes arole-ds-standard to read training data from S3 and invoke Bedrock for embeddings
Afternoon — Assumes arole-ds-experiment to write experiment results to S3 and invoke dev SageMaker endpoints
End of day — Session expires, no standing permissions

James (ML engineer) is in the ml-engineers group. His day:

Morning — Assumes arole-ml-deploy to push Docker images and update a pipeline
Afternoon — Assumes arole-bedrock-manage to configure a new guardrail for the production chatbot
Debugging — Assumes arole-ds-experiment (shared role) to reproduce a data scientist’s issue

Naming Conventions¶

Assumable Roles¶

Pattern: {company_prefix}-{env}-{tenant_id}-arole-{role_name}

Examples:

edge-prod-b001-arole-ds-standard
edge-prod-b001-arole-ml-deploy
edge-prod-b001-arole-platform-full
globalbank-prod-c001-arole-bedrock-manage

Rationale: The arole- prefix distinguishes assumable roles from service roles (role-). This distinction matters because:

Service roles (role-sagemaker-execution) are assumed by AWS services to run jobs
Assumable roles (arole-data-scientists) are assumed by humans/groups to get permissions
Different trust relationships, different audit trails, different lifecycle management

Managed Policies¶

Pattern: {company_prefix}-{env}-{tenant_id}-policy-{service}-{level}

Examples:

edge-prod-b001-policy-s3-read-only
edge-prod-b001-policy-ecr-dev-read-write
edge-prod-b001-policy-pipeline-project-dev
edge-prod-b001-policy-sagemaker-dev-invoke
edge-prod-b001-policy-lambda-deploy-manage
edge-prod-b001-policy-bedrock-invoke-only
edge-prod-b001-policy-bedrock-model-manage
edge-prod-b001-policy-bedrock-full
edge-prod-b001-policy-kms-level1-read-only
edge-prod-b001-policy-trusted-advisor-level1-read-only

Test Resources¶

Pattern: {base_name}-test-{6_digit_random}

Examples:

edge-prod-b001-arole-ds-standard-test-a3f9c2
edge-prod-b001-policy-s3-read-only-test-a3f9c2

Config Schema Design¶

Roles Section¶

Roles are standalone objects. Each role defines a name and a list of managed policy references.

roles:
  - name: "ds-standard"
    description: "Standard data science access — read data, invoke models"
    policies:
      - s3-project-buckets-only
      - ecr-read-only
      - pipeline-read-only
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ds-experiment"
    description: "Experiment access — write results, invoke dev endpoints"
    policies:
      - s3-project-buckets-full
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ml-deploy"
    description: "ML deployment access — push images, manage pipelines, deploy functions"
    policies:
      - ecr-dev-read-write
      - pipeline-project-dev
      - lambda-deploy-manage

  - name: "bedrock-manage"
    description: "Bedrock management — guardrails, model access, imports"
    policies:
      - bedrock-model-manage

  - name: "platform-full"
    description: "Full platform administration"
    policies:
      - s3-full
      - ecr-full
      - pipeline-full
      - sagemaker-full
      - lambda-full
      - bedrock-full

Assignments Section¶

The N:M mapping between groups and roles.

assignments:
  - group: "data-scientists"
    roles:
      - "ds-standard"
      - "ds-experiment"

  - group: "ml-engineers"
    roles:
      - "ds-experiment"
      - "ml-deploy"
      - "bedrock-manage"

  - group: "platform-administrators"
    roles:
      - "platform-full"

  - group: "operations-support"
    roles:
      - "ds-standard"
      - "ml-deploy"

# Groups using direct policy assignments (no assumable roles):
#   - business-consumers: Non-technical users (dashboards, reports). Minimal invoke-only
#     permissions (sagemaker level1-prod, bedrock level1). Adding sts:AssumeRole friction
#     provides no security benefit for users who only call inference endpoints.
#   - external-contractors: Minimal read-only access (s3 level1, sagemaker level1, bedrock level1).
#     Role assumption deferred to v2 (TODO #8) when MFA + IP + session constraints are
#     implemented — adding a role without those controls adds plumbing with no security value.

Complete Config Example¶

client: "edge"
environment: "prod"
tenant_id: "b001"
region: "us-west-2"
tier: "medium-9"

groups:
  - name: "data-scientists"
    description: "Data science team"
  - name: "ml-engineers"
    description: "ML engineering team"
  - name: "platform-administrators"
    description: "Platform admin team"
  - name: "business-consumers"
    description: "Business stakeholders"
  - name: "operations-support"
    description: "Operations team"

roles:
  - name: "ds-standard"
    description: "Standard data science access"
    policies:
      - s3-project-buckets-only
      - ecr-read-only
      - pipeline-read-only
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ds-experiment"
    description: "Experiment access with write permissions"
    policies:
      - s3-project-buckets-full
      - sagemaker-dev-invoke
      - bedrock-invoke-only

  - name: "ml-deploy"
    description: "ML deployment and CI/CD access"
    policies:
      - ecr-dev-read-write
      - pipeline-project-dev
      - lambda-deploy-manage

  - name: "bedrock-manage"
    description: "Bedrock model and guardrail management"
    policies:
      - bedrock-model-manage

  - name: "platform-full"
    description: "Full platform administration"
    policies:
      - s3-full
      - ecr-full
      - pipeline-full
      - sagemaker-full
      - lambda-full
      - bedrock-full

assignments:
  - group: "data-scientists"
    roles:
      - "ds-standard"
      - "ds-experiment"

  - group: "ml-engineers"
    roles:
      - "ds-experiment"
      - "ml-deploy"
      - "bedrock-manage"

  - group: "platform-administrators"
    roles:
      - "platform-full"

  - group: "operations-support"
    roles:
      - "ds-standard"
      - "ml-deploy"

# Groups using direct policy assignments (no assumable roles):
#   - business-consumers: Non-technical users (dashboards, reports). Minimal invoke-only
#     permissions (sagemaker level1-prod, bedrock level1). Adding sts:AssumeRole friction
#     provides no security benefit for users who only call inference endpoints.
#   - external-contractors: Minimal read-only access (s3 level1, sagemaker level1, bedrock level1).
#     Role assumption deferred to v2 (TODO #8) when MFA + IP + session constraints are
#     implemented — adding a role without those controls adds plumbing with no security value.

CloudFormation Resource Generation¶

The template generator reads the config and produces these CloudFormation resources:

Role Resource¶

For each role in the roles section:

# Generated for: arole-ds-standard
EdgeProdB001AroleDsStandard:
  Type: AWS::IAM::Role
  Properties:
    RoleName: "edge-prod-b001-arole-ds-standard"
    Description: "Standard data science access — read data, invoke models"
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        - Effect: Allow
          Principal:
            AWS: !Sub "arn:aws:iam::${AWS::AccountId}:root"
          Action: "sts:AssumeRole"
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-s3-project-buckets-only"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-ecr-read-only"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-pipeline-read-only"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-sagemaker-dev-invoke"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-bedrock-invoke-only"

Group AssumeRole Policy¶

For each group in the assignments section, an inline policy is added:

# Generated for: group-data-scientists
EdgeProdB001GroupDataScientists:
  Type: AWS::IAM::Group
  Properties:
    GroupName: "edge-prod-b001-group-data-scientists"
    Policies:
      - PolicyName: "AllowAssumeRoles"
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action: "sts:AssumeRole"
              Resource:
                - !Sub "arn:aws:iam::${AWS::AccountId}:role/edge-prod-b001-arole-ds-standard"
                - !Sub "arn:aws:iam::${AWS::AccountId}:role/edge-prod-b001-arole-ds-experiment"

Service Roles and Policy Assignments¶

Service roles are IAM Roles assumed by AWS services and CI/CD platforms — not by human users. Unlike assumable roles (Layer 2), which are elevated by humans via sts:AssumeRole, service roles are assumed automatically by machines via trust policies.

Why service_account Was Removed from iam_groups¶

The original enterprise config had a service_account IAM group with PowerUserAccess + inline iam:PassRole + cloudformation:* on *. This was an anti-pattern for three reasons:

IAM Groups are for humans — Groups are collections of IAM Users. Service accounts/machine identities should be IAM Roles assumed by the CI/CD platform, not IAM Users with long-lived access keys.
PowerUserAccess is a sledgehammer — It grants access to every AWS service except IAM management. A CI/CD pipeline only needs access to the specific services it deploys to.
iam:PassRole on * is privilege escalation — Unscoped PassRole allows passing any role to any service, effectively granting admin access through role chaining.

Decision: Remove service_account from iam_groups and replace with ci_cd_deployment_role under service_roles, using the policy_assignments system for scoped permissions.

Service Roles Use Policy Assignments¶

Service roles can reference the same policy level system as IAM groups. This ensures consistency — a sagemaker: level4-ci assignment on a service role uses the exact same policy definition as it would on a group.

service_roles:
  ci_cd_deployment:
    description: "CI/CD deployment role — build, test, deploy across the ML platform"
    trust_policy:
      Version: '2012-10-17'
      Statement:
        - Effect: Allow
          Principal:
            Service: codepipeline.amazonaws.com
          Action: sts:AssumeRole
    policy_assignments:
      s3: level2              # project-buckets-only
      ecr: level3             # ci-read-write
      pipeline: level3        # project-ci
      sagemaker: level4-ci    # deploy-only — no delete, no traffic shifting
      lambda: level2          # deploy-manage

The key difference from group policy_assignments:

Groups get policies attached directly (inline or customer-managed)
Service roles get the same policies attached to the role itself
Trust policy controls who can assume the role (AWS service, OIDC provider, or another account)

ci_cd_deployment_role Design¶

The CI/CD deployment role replaces the old service_account group. It covers the full deployment lifecycle:

Service	Level	CI/CD Stage	What It Does
S3	level2 (project-buckets-only)	Build + Deploy	Read/write model artifacts, deployment packages
ECR	level3 (ci-read-write)	Build	Build and push container images to registry
Pipeline	level3 (project-ci)	Orchestration	Create, execute, manage ML pipelines
SageMaker	level4-ci (deploy-only)	Deploy	Create endpoints, register models, configure autoscaling. No delete, no traffic shifting
Lambda	level2 (deploy-manage)	Deploy	Deploy and update inference functions

What’s NOT included:

No Bedrock — model access and guardrail management is a human decision
No iam:PassRole on * — PassRole is scoped within SageMaker level4-ci to {company_prefix}-{env}-*-role-* conditioned to sagemaker.amazonaws.com
No cloudformation:* — CloudFormation access is outside the 6-service scope and should be handled separately if needed for IaC deployments
No managed policies — the level system covers everything

SageMaker level4-ci guardrails:

Explicit deny on DeleteEndpoint, DeleteEndpointConfig, DeleteModel, DeleteModelPackage, DeleteModelPackageGroup
Explicit deny on UpdateEndpointWeightsAndCapacities (traffic shifting)
Explicit deny on DeleteDomain, DeleteUserProfile
Pipelines deploy forward — teardown requires separate authorization

Trust Policy Patterns¶

The trust policy depends on the CI/CD platform. Common patterns:

AWS CodePipeline / CodeBuild:

trust_policy:
  Version: '2012-10-17'
  Statement:
    - Effect: Allow
      Principal:
        Service:
          - codepipeline.amazonaws.com
          - codebuild.amazonaws.com
      Action: sts:AssumeRole

GitHub Actions (OIDC):

trust_policy:
  Version: '2012-10-17'
  Statement:
    - Effect: Allow
      Principal:
        Federated: arn:aws:iam::{account_id}:oidc-provider/token.actions.githubusercontent.com
      Action: sts:AssumeRoleWithWebIdentity
      Condition:
        StringEquals:
          token.actions.githubusercontent.com:aud: sts.amazonaws.com
        StringLike:
          token.actions.githubusercontent.com:sub: repo:{org}/{repo}:*

GitLab CI (OIDC):

trust_policy:
  Version: '2012-10-17'
  Statement:
    - Effect: Allow
      Principal:
        Federated: arn:aws:iam::{account_id}:oidc-provider/gitlab.com
      Action: sts:AssumeRoleWithWebIdentity
      Condition:
        StringEquals:
          gitlab.com:aud: https://gitlab.com
        StringLike:
          gitlab.com:sub: project_path:{group}/{project}:*

The trust policy is client-specific and configured per deployment. The policy_assignments remain the same regardless of which CI/CD platform assumes the role.

Implementation Plan¶

Phase 1: Architecture and Config Design¶

~~Design architectural decisions~~ ✅
~~Design config schema with roles and assignments sections~~ ✅
~~Update NAMING_CONVENTIONS.md with arole- and policy- patterns~~ ✅
~~Document architecture in POLICY_GUIDE.md~~ ✅

Phase 2: Config and Validation¶

~~Update validation schemas (startup, medium, enterprise) with roles and assignments~~ ✅
~~Update client configs with role definitions and assignments~~ ✅
~~Update config validation code to validate N:M mappings (no dangling references)~~ ✅

Phase 3: Template Generator¶

~~Extend template generator to emit IAM Role resources with trust policies~~ ✅
~~Emit managed policy resources from POLICY_GUIDE.md definitions~~ ✅
~~Emit AssumeRole inline policies on groups from assignments mapping~~ ✅
~~Convert existing inline policies to standalone managed policies~~ ✅

Phase 4: Deployment and Testing¶

~~Test with --template-only (validate generated CloudFormation)~~ ✅
~~Test with --test-deploy (real AWS, unique names)~~ ✅
~~Add troubleshooting scenarios for AssumeRole denied errors~~ ✅
~~Update POLICY_GUIDE.md Assignment Recommendations~~ ✅

Phase 5: Future Work¶

Config-driven Bedrock model scoping — see Roadmap
Lambda VPC governance — see Roadmap

v2 Roadmap¶

AD-7: MFA and Session Duration Controls¶

What: Support time-bound and conditional role assumption — requiring MFA for sensitive roles and setting session duration limits.

Why:

Reduced attack window — Limiting session duration (1–12 hours) restricts the usefulness of stolen session tokens
Session hijacking mitigation — Prevents persistent sessions from bypassing MFA
Contextual security — Require MFA only for high-risk roles (e.g., platform-full) without constant user friction

Implementation approach:

Add Condition block to trust policies requiring aws:MultiFactorAuthPresent
Add max_session_duration to role config (default 1 hour, configurable up to 12 hours)
Add mfa_required: true/false to assignments config per group-role pair

Best practices to follow:

Role-based durations: critical roles use shorter sessions (1 hour), standard roles use longer (4–8 hours)
Avoid MFA fatigue: don’t force MFA for low-risk, everyday roles
Use idle timeouts (15–60 minutes) alongside absolute session limits

AD-8: Permission Boundaries¶

What: Allow client team leads to create roles within guardrails without risking privilege escalation.

Implementation approach:

Define permission boundary policies per tier
Attach boundaries to delegated admin roles
Ensure no role can exceed its boundary regardless of attached policies

Motivating Scenarios — Why Groups-Only Is Insufficient¶

The following real-world scenarios demonstrate why an IAM Groups-only architecture (without assumable roles) would fail to meet enterprise client needs. These are the problems the Roles layer solves.

Scenario 1: Temporary Cross-Team Access¶

Situation: A data scientist needs to debug a failing ML pipeline. The pipeline logs and configuration are only accessible to the ml-engineers group.

Groups-only problem: The admin must add the data scientist to the ml-engineers group, giving them all ML engineer permissions (ECR write, pipeline management, Lambda deploy). After debugging, the admin must remember to remove them. This is error-prone, over-privileged, and leaves no audit trail of the temporary access.

Roles solution: The data scientist assumes arole-pipeline-viewer (read-only pipeline access) for a single session. No group membership changes, no over-provisioning, session expires automatically.

Scenario 2: Environment-Specific Access Control¶

Situation: A backend developer needs to invoke production SageMaker endpoints for the live application but should never touch dev/staging endpoints (to avoid accidentally routing production traffic to unstable models).

Groups-only problem: Group policies are static bundles. You’d need separate groups for “backend-dev-invoke” and “backend-prod-invoke”, leading to group explosion. With 5 environments × 10 job functions, you’d need 50 groups.

Roles solution: Define arole-prod-invoke and arole-dev-invoke as separate roles. The backend-developers group is assigned only arole-prod-invoke. Clean, no group explosion.

Scenario 3: Least-Privilege for Automation¶

Situation: A CI/CD pipeline needs different permissions at different stages — read ECR during build, write ECR during push, invoke SageMaker during integration tests, deploy Lambda during release.

Groups-only problem: The pipeline’s IAM entity must be in a single group with all permissions combined. It has write access during the build stage when it only needs read access. No stage-level isolation.

Roles solution: The pipeline assumes different roles at each stage: arole-ecr-read during build, arole-ecr-write during push, arole-sagemaker-invoke during tests, arole-lambda-deploy during release. Each stage has exactly the permissions it needs.

Scenario 4: Compliance and Audit Requirements¶

Situation: An enterprise client (e.g., a bank) requires that every privileged action is traceable to a specific permission grant, with clear evidence of when access was assumed and when it expired.

Groups-only problem: Group membership is a persistent state. CloudTrail shows the user’s identity but not which specific permission bundle they were using. There’s no session boundary — the user has all group permissions all the time.

Roles solution: Every sts:AssumeRole call is logged in CloudTrail with the role ARN, session name, and timestamp. Auditors can see exactly which role was assumed, when, and for how long. Session expiry provides natural access boundaries.

Scenario 5: FinOps Cost Control¶

Situation: Only two people on the platform team should be able to create Bedrock provisioned throughput (which can cost thousands per month). Other platform admins should have full access to everything else.

Groups-only problem: All platform-administrators share the same group policy. Either everyone can provision throughput, or no one can. You can’t differentiate within a group.

Roles solution: Define arole-platform-full (everything except throughput) and arole-bedrock-throughput (adds provisioned throughput). Assign arole-bedrock-throughput only to the two approved FinOps engineers. Same group, different role access.

Scenario 6: Onboarding and Offboarding¶

Situation: A new ML engineer joins the team. They should start with read-only access for the first week, then graduate to full ML engineer access.

Groups-only problem: You either put them in the ml-engineers group immediately (over-privileged on day one) or create a temporary “ml-engineers-readonly” group (group explosion, manual cleanup).

Roles solution: Add them to the ml-engineers group on day one. The group has access to both arole-ml-readonly and arole-ml-deploy. During onboarding, they only use arole-ml-readonly. After the first week, they start assuming arole-ml-deploy. No group changes needed — the access control is in which role they choose to assume.

Why 1:1 Mapping Was Rejected¶

For reference, a 1:1 mapping (one group → one role) was initially considered and rejected:

Concern	1:1 Limitation	N:M Solution
Flexibility	User locked to one persona	User assumes different roles as needed
Role switching	Requires group membership changes	Switch roles via AssumeRole
Scalability	Group explosion (one group per permission combo)	Shared roles reduce total count
Cross-account	Static mapping, no dynamic assumption	Role ARNs can reference other accounts
Temporary access	Requires adding/removing from groups	Assume a role temporarily, session expires

When 1:1 is acceptable:

Very small setups (2–3 users, personal accounts)
Strictly defined personas (e.g., a CI/CD pipeline with one fixed function)

References¶

Policy Guide — Source of truth for all policy definitions and levels
Naming Conventions — Naming patterns for all resources
AWS IAM Roles Documentation
AWS IAM Trust Policies
AWS STS AssumeRole

Document Version: 1.0 Last Updated: 2025 Maintained By: MLOps Platform Team