Roles ArchitectureΒΆ
Table of ContentsΒΆ
OverviewΒΆ
This document captures the architectural decisions for adding an IAM Roles layer to the Security Provisioner Tool. The existing tool creates IAM Groups and inline policies. This new layer introduces standalone IAM Roles with managed policy attachments and an N:M (many-to-many) mapping between Groups and Roles.
Current state: Groups β inline policies (permissions directly on groups)
Target state: Groups β assume Roles β managed policies (permissions on roles, groups get AssumeRole access)
This separation decouples who (groups/people) from what (permissions/access), enabling flexible, auditable, and scalable security management.
Architectural DecisionsΒΆ
Decision SummaryΒΆ
ID |
Decision |
Answer |
Version |
|---|---|---|---|
AD-1 |
Group-to-Role mapping |
N:M (many-to-many) |
v1 |
AD-2 |
Role definition |
Standalone objects (not nested under groups) |
v1 |
AD-3 |
Trust policy |
Account root principal (AWS constraint) |
v1 |
AD-4 |
N:M enforcement |
Two-way handshake (group-side AssumeRole + role-side trust) |
v1 |
AD-5 |
Policy composition |
Managed policies attached to roles (not inline) |
v1 |
AD-6 |
Security model flag |
Config-driven toggle: |
v1 |
AD-7 |
MFA / session duration |
Conditional assumption controls |
v2 |
AD-8 |
Permission boundaries |
Delegated role creation guardrails |
v2 |
AD-1: Group-to-Role Mapping (N:M)ΒΆ
Decision: Many-to-many relationship between IAM Groups and IAM Roles.
Rationale:
A 1:1 mapping (one group β one role) is insufficient for real-world scenarios:
Lack of flexibility β A user in a 1:1 model can only take on one persona. In reality, a data scientist might need to act as an ML engineer temporarily when debugging a pipeline.
Role assumption limitations β Users often need to switch between different roles (e.g., viewing logs vs. changing infrastructure) without logging out.
Scalability challenges β Creating a new group for every unique role/permission combination leads to βgroup explosionβ as organizations grow.
Cross-account complexity β A single, static 1:1 mapping doesnβt handle dynamic cross-account role assumption efficiently.
N:M enables:
A group can assume multiple roles (data-scientists can assume both a standard role and an experiment role)
A role can be assumed by multiple groups (a read-only role can be shared across data-scientists, auditors, and business-consumers)
Group membership changes automatically update access without touching roles
AD-2: Roles as Standalone ObjectsΒΆ
Decision: IAM Roles are defined as independent, top-level objects in the configuration β not nested under groups.
Rationale:
Reusability β A single role (e.g.,
arole-s3-readonly) can be mapped to multiple groups without duplicating the role definition.Decoupling β Updating a roleβs permissions reflects across all associated groups instantly.
N:M compatibility β Nesting would force either role duplication under every group or confusing ownership logic. Standalone objects use a clean mapping/assignment layer to bridge groups and roles.
Auditability β Standalone roles make it easy to audit exactly who has access to what without digging through nested structures.
Principle: Groups represent βWhoβ (people/teams). Roles represent βWhatβ (permissions). Keeping them separate allows organizational structure to change without breaking security logic.
AD-3: Trust Policy β Account Root PrincipalΒΆ
Decision: All assumable roles trust the AWS account root as the principal in their trust policy.
Rationale:
This is an AWS constraint β you cannot directly reference an IAM Group as a principal in a roleβs trust policy. The account root principal acts as the first layer of trust, and the group-side AssumeRole policy acts as the second filter.
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
AWS: "arn:aws:iam::{account_id}:root"
Action: "sts:AssumeRole"
Cross-account note: If groups and roles are in different accounts, the Principal must point to the originating account ID instead of the roleβs own account root.
AD-4: Two-Way Handshake for N:M EnforcementΒΆ
Decision: The N:M relationship is enforced through a two-way handshake β identity-based policies on the group side and trust policies on the role side.
Rationale:
Since IAM Groups cannot be referenced as principals in trust policies, both sides must be configured:
Side 1 β Group (identity policy): Each group gets an inline policy allowing sts:AssumeRole on specific role ARNs.
# Group-side: which roles can this group assume?
Type: AWS::IAM::Group
Properties:
GroupName: "{company_prefix}-{env}-{tenant_id}-group-data-scientists"
Policies:
- PolicyName: AllowAssumeRoles
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action: "sts:AssumeRole"
Resource:
- "arn:aws:iam::{account_id}:role/{company_prefix}-{env}-{tenant_id}-arole-ds-standard"
- "arn:aws:iam::{account_id}:role/{company_prefix}-{env}-{tenant_id}-arole-ds-experiment"
Side 2 β Role (trust policy): Each role trusts the account root.
# Role-side: who is trusted to assume this role?
Type: AWS::IAM::Role
Properties:
RoleName: "{company_prefix}-{env}-{tenant_id}-arole-ds-standard"
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
AWS: "arn:aws:iam::{account_id}:root"
Action: "sts:AssumeRole"
How the N:M mapping is managed:
To grant a group access to a role β add the role ARN to the groupβs AssumeRole Resource list
To revoke access β remove the role ARN from the groupβs policy
The template generator reads the assignments mapping and auto-generates both sides
The Hallway Analogy:
Imagine two IAM users. One has a sign on their back that says βDSβ (Data Scientists group). The other has βMLEngβ (ML Engineers group). Both belong to the same AWS account.
βββββββββββββββββββ
β DOOR 1 β
β Trust Policy β
β (Account Root) β
ββββββββββ¬βββββββββ
β
Both DS and MLEng
pass through (same account)
β
ββββββββββ΄βββββββββ
β LONG HALL β
ββββββββββ¬βββββββββ
β
βββββββββββββ΄ββββββββββββ
β β
ββββββββββ΄βββββββββ ββββββββββ΄βββββββββ
β DOOR 2a β β DOOR 2b β
β DS Group β β MLEng Group β
β AssumeRole β β AssumeRole β
β Policy β β Policy β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β
The truth is The truth is
revealed: revealed:
βββ arole-ds-standard β
βββ arole-ds-experiment β
βββ arole-ds-experiment β
βββ arole-ml-deploy β
βββ arole-ml-deploy β βββ arole-bedrock-manage β
Door 1 (trust policy) lets everyone from the same account into the hall β this is the account root principal. The sign on your back (your group membership) determines which corridor you walk down. Door 2 (the groupβs AssumeRole policy) reveals the truth β exactly which roles you can assume. No more, no less.
AD-5: Managed Policy CompositionΒΆ
Decision: Roles are composed by attaching multiple granular managed policies β not by writing inline permissions.
Rationale:
Three approaches were evaluated:
Approach |
Description |
Verdict |
|---|---|---|
Policy-to-Role (managed) |
Attach modular managed policies to roles |
β Best practice |
Role-to-Group mapping |
Groups assume roles via AssumeRole |
β Used for N:M |
Inline policy bundling |
Write JSON permissions directly inside roles |
β Maintenance nightmare |
Why managed policies win:
DRY β Update the S3 read-only policy once, every role using it is updated instantly
Auditable β You can see exactly which building blocks make up a role
Reusable β The same policy (e.g., ECR read-only) attaches to multiple roles
Testable β Each policy can be validated independently
Note: This requires converting our current inline policies to standalone managed policies. The NAMING_CONVENTIONS.md already has a roadmap note for this: {company_prefix}-{env}-{tenant_id}-policy-{policy_name}.
AD-6: Security Model FlagΒΆ
Decision: Support both groups-only and roles-based security models via a config flag (security_model: "groups-only" | "roles-based").
Rationale:
Three options were evaluated:
Option |
Approach |
Pros |
Cons |
|---|---|---|---|
A |
Keep group-level policies AND add Roles layer |
Backward compatible, no breakage |
Double maintenance, technical debt |
B |
Migrate all policies from groups to roles |
Clean architecture, single model |
Breaking change, high risk |
C |
Feature flag to support both modes |
Incremental migration, instant rollback |
Adds code complexity |
Option C was selected because:
Product feature, not just migration tool β Different clients need different maturity levels:
Startup clients (3 groups, small team) may prefer groups-only β simple, no AssumeRole workflow to teach
Enterprise clients (11 groups, compliance) need roles-based for audit trails and session-based access
Medium clients may start groups-only and migrate to roles-based as they mature
Safety net β Allows migrating groups one by one or rolling back instantly if a workflow breaks
End goal is Option B β The flag provides the clean architecture of Option B with the safety of Option A during transition
Config usage:
security:
security_model: "groups-only" # Policies attached directly to groups (current behavior)
# or
security_model: "roles-based" # Policies on roles, groups get AssumeRole only
Template generator behavior:
groups-onlyβ Current behavior. Groups get managed_policies and custom_policies directly.roles-basedβ Groups get only an AssumeRole inline policy. Assumable roles get the managed policies. Assignments mapping drives the wiring.
Architecture OverviewΒΆ
Three-Layer ModelΒΆ
The architecture has three distinct layers:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1: GROUPS (Who) β
β IAM Groups representing job functions β
β e.g., data-scientists, ml-engineers, auditors β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β sts:AssumeRole (N:M mapping)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2: ROLES (What) β
β Assumable roles as permission bundles β
β e.g., arole-ds-standard, arole-ml-deploy β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β Policy attachments
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3: POLICIES (How) β
β Granular managed policies per service/level β
β e.g., s3-read-only, ecr-dev-read-write β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Resource Relationship DiagramΒΆ
IAM Group (data-scientists)
βββ can assume β arole-ds-standard
β βββ attached: policy-s3-project-buckets-only
β βββ attached: policy-ecr-read-only
β βββ attached: policy-pipeline-read-only
β βββ attached: policy-sagemaker-dev-invoke
β βββ attached: policy-bedrock-invoke-only
β
βββ can assume β arole-ds-experiment
βββ attached: policy-s3-project-buckets-full
βββ attached: policy-sagemaker-dev-invoke
βββ attached: policy-bedrock-invoke-only
IAM Group (ml-engineers)
βββ can assume β arole-ds-experiment (shared with data-scientists)
βββ can assume β arole-ml-deploy
β βββ attached: policy-ecr-dev-read-write
β βββ attached: policy-pipeline-project-dev
β βββ attached: policy-lambda-deploy-manage
β
βββ can assume β arole-bedrock-manage
βββ attached: policy-bedrock-model-manage
βββ attached: policy-bedrock-observability
IAM Group (platform-administrators)
βββ can assume β arole-platform-full
βββ attached: policy-s3-full
βββ attached: policy-ecr-full
βββ attached: policy-pipeline-full
βββ attached: policy-sagemaker-full
βββ attached: policy-lambda-full
βββ attached: policy-bedrock-full
Real-World ExampleΒΆ
Scenario: Edge AI, medium tier (9 groups)
Sarah (data scientist) is in the data-scientists group. Her day:
Morning β Assumes
arole-ds-standardto read training data from S3 and invoke Bedrock for embeddingsAfternoon β Assumes
arole-ds-experimentto write experiment results to S3 and invoke dev SageMaker endpointsEnd of day β Session expires, no standing permissions
James (ML engineer) is in the ml-engineers group. His day:
Morning β Assumes
arole-ml-deployto push Docker images and update a pipelineAfternoon β Assumes
arole-bedrock-manageto configure a new guardrail for the production chatbotDebugging β Assumes
arole-ds-experiment(shared role) to reproduce a data scientistβs issue
Naming ConventionsΒΆ
Assumable RolesΒΆ
Pattern: {company_prefix}-{env}-{tenant_id}-arole-{role_name}
Examples:
edge-prod-b001-arole-ds-standardedge-prod-b001-arole-ml-deployedge-prod-b001-arole-platform-fullglobalbank-prod-c001-arole-bedrock-manage
Rationale: The arole- prefix distinguishes assumable roles from service roles (role-). This distinction matters because:
Service roles (
role-sagemaker-execution) are assumed by AWS services to run jobsAssumable roles (
arole-data-scientists) are assumed by humans/groups to get permissionsDifferent trust relationships, different audit trails, different lifecycle management
Managed PoliciesΒΆ
Pattern: {company_prefix}-{env}-{tenant_id}-policy-{service}-{level}
Examples:
edge-prod-b001-policy-s3-read-onlyedge-prod-b001-policy-ecr-dev-read-writeedge-prod-b001-policy-pipeline-project-devedge-prod-b001-policy-sagemaker-dev-invokeedge-prod-b001-policy-lambda-deploy-manageedge-prod-b001-policy-bedrock-invoke-onlyedge-prod-b001-policy-bedrock-model-manageedge-prod-b001-policy-bedrock-fulledge-prod-b001-policy-kms-level1-read-onlyedge-prod-b001-policy-trusted-advisor-level1-read-only
Test ResourcesΒΆ
Pattern: {base_name}-test-{6_digit_random}
Examples:
edge-prod-b001-arole-ds-standard-test-a3f9c2edge-prod-b001-policy-s3-read-only-test-a3f9c2
Config Schema DesignΒΆ
Roles SectionΒΆ
Roles are standalone objects. Each role defines a name and a list of managed policy references.
roles:
- name: "ds-standard"
description: "Standard data science access β read data, invoke models"
policies:
- s3-project-buckets-only
- ecr-read-only
- pipeline-read-only
- sagemaker-dev-invoke
- bedrock-invoke-only
- name: "ds-experiment"
description: "Experiment access β write results, invoke dev endpoints"
policies:
- s3-project-buckets-full
- sagemaker-dev-invoke
- bedrock-invoke-only
- name: "ml-deploy"
description: "ML deployment access β push images, manage pipelines, deploy functions"
policies:
- ecr-dev-read-write
- pipeline-project-dev
- lambda-deploy-manage
- name: "bedrock-manage"
description: "Bedrock management β guardrails, model access, imports"
policies:
- bedrock-model-manage
- name: "platform-full"
description: "Full platform administration"
policies:
- s3-full
- ecr-full
- pipeline-full
- sagemaker-full
- lambda-full
- bedrock-full
Assignments SectionΒΆ
The N:M mapping between groups and roles.
assignments:
- group: "data-scientists"
roles:
- "ds-standard"
- "ds-experiment"
- group: "ml-engineers"
roles:
- "ds-experiment"
- "ml-deploy"
- "bedrock-manage"
- group: "platform-administrators"
roles:
- "platform-full"
- group: "operations-support"
roles:
- "ds-standard"
- "ml-deploy"
# Groups using direct policy assignments (no assumable roles):
# - business-consumers: Non-technical users (dashboards, reports). Minimal invoke-only
# permissions (sagemaker level1-prod, bedrock level1). Adding sts:AssumeRole friction
# provides no security benefit for users who only call inference endpoints.
# - external-contractors: Minimal read-only access (s3 level1, sagemaker level1, bedrock level1).
# Role assumption deferred to v2 (TODO #8) when MFA + IP + session constraints are
# implemented β adding a role without those controls adds plumbing with no security value.
Complete Config ExampleΒΆ
client: "edge"
environment: "prod"
tenant_id: "b001"
region: "us-west-2"
tier: "medium-9"
groups:
- name: "data-scientists"
description: "Data science team"
- name: "ml-engineers"
description: "ML engineering team"
- name: "platform-administrators"
description: "Platform admin team"
- name: "business-consumers"
description: "Business stakeholders"
- name: "operations-support"
description: "Operations team"
roles:
- name: "ds-standard"
description: "Standard data science access"
policies:
- s3-project-buckets-only
- ecr-read-only
- pipeline-read-only
- sagemaker-dev-invoke
- bedrock-invoke-only
- name: "ds-experiment"
description: "Experiment access with write permissions"
policies:
- s3-project-buckets-full
- sagemaker-dev-invoke
- bedrock-invoke-only
- name: "ml-deploy"
description: "ML deployment and CI/CD access"
policies:
- ecr-dev-read-write
- pipeline-project-dev
- lambda-deploy-manage
- name: "bedrock-manage"
description: "Bedrock model and guardrail management"
policies:
- bedrock-model-manage
- name: "platform-full"
description: "Full platform administration"
policies:
- s3-full
- ecr-full
- pipeline-full
- sagemaker-full
- lambda-full
- bedrock-full
assignments:
- group: "data-scientists"
roles:
- "ds-standard"
- "ds-experiment"
- group: "ml-engineers"
roles:
- "ds-experiment"
- "ml-deploy"
- "bedrock-manage"
- group: "platform-administrators"
roles:
- "platform-full"
- group: "operations-support"
roles:
- "ds-standard"
- "ml-deploy"
# Groups using direct policy assignments (no assumable roles):
# - business-consumers: Non-technical users (dashboards, reports). Minimal invoke-only
# permissions (sagemaker level1-prod, bedrock level1). Adding sts:AssumeRole friction
# provides no security benefit for users who only call inference endpoints.
# - external-contractors: Minimal read-only access (s3 level1, sagemaker level1, bedrock level1).
# Role assumption deferred to v2 (TODO #8) when MFA + IP + session constraints are
# implemented β adding a role without those controls adds plumbing with no security value.
CloudFormation Resource GenerationΒΆ
The template generator reads the config and produces these CloudFormation resources:
Role ResourceΒΆ
For each role in the roles section:
# Generated for: arole-ds-standard
EdgeProdB001AroleDsStandard:
Type: AWS::IAM::Role
Properties:
RoleName: "edge-prod-b001-arole-ds-standard"
Description: "Standard data science access β read data, invoke models"
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
AWS: !Sub "arn:aws:iam::${AWS::AccountId}:root"
Action: "sts:AssumeRole"
ManagedPolicyArns:
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-s3-project-buckets-only"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-ecr-read-only"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-pipeline-read-only"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-sagemaker-dev-invoke"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/edge-prod-b001-policy-bedrock-invoke-only"
Group AssumeRole PolicyΒΆ
For each group in the assignments section, an inline policy is added:
# Generated for: group-data-scientists
EdgeProdB001GroupDataScientists:
Type: AWS::IAM::Group
Properties:
GroupName: "edge-prod-b001-group-data-scientists"
Policies:
- PolicyName: "AllowAssumeRoles"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action: "sts:AssumeRole"
Resource:
- !Sub "arn:aws:iam::${AWS::AccountId}:role/edge-prod-b001-arole-ds-standard"
- !Sub "arn:aws:iam::${AWS::AccountId}:role/edge-prod-b001-arole-ds-experiment"
Service Roles and Policy AssignmentsΒΆ
Service roles are IAM Roles assumed by AWS services and CI/CD platforms β not by human users. Unlike assumable roles (Layer 2), which are elevated by humans via sts:AssumeRole, service roles are assumed automatically by machines via trust policies.
Why service_account Was Removed from iam_groupsΒΆ
The original enterprise config had a service_account IAM group with PowerUserAccess + inline iam:PassRole + cloudformation:* on *. This was an anti-pattern for three reasons:
IAM Groups are for humans β Groups are collections of IAM Users. Service accounts/machine identities should be IAM Roles assumed by the CI/CD platform, not IAM Users with long-lived access keys.
PowerUserAccess is a sledgehammer β It grants access to every AWS service except IAM management. A CI/CD pipeline only needs access to the specific services it deploys to.
iam:PassRole on
*is privilege escalation β Unscoped PassRole allows passing any role to any service, effectively granting admin access through role chaining.
Decision: Remove service_account from iam_groups and replace with ci_cd_deployment_role under service_roles, using the policy_assignments system for scoped permissions.
Service Roles Use Policy AssignmentsΒΆ
Service roles can reference the same policy level system as IAM groups. This ensures consistency β a sagemaker: level4-ci assignment on a service role uses the exact same policy definition as it would on a group.
service_roles:
ci_cd_deployment:
description: "CI/CD deployment role β build, test, deploy across the ML platform"
trust_policy:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: codepipeline.amazonaws.com
Action: sts:AssumeRole
policy_assignments:
s3: level2 # project-buckets-only
ecr: level3 # ci-read-write
pipeline: level3 # project-ci
sagemaker: level4-ci # deploy-only β no delete, no traffic shifting
lambda: level2 # deploy-manage
The key difference from group policy_assignments:
Groups get policies attached directly (inline or customer-managed)
Service roles get the same policies attached to the role itself
Trust policy controls who can assume the role (AWS service, OIDC provider, or another account)
ci_cd_deployment_role DesignΒΆ
The CI/CD deployment role replaces the old service_account group. It covers the full deployment lifecycle:
Service |
Level |
CI/CD Stage |
What It Does |
|---|---|---|---|
S3 |
level2 (project-buckets-only) |
Build + Deploy |
Read/write model artifacts, deployment packages |
ECR |
level3 (ci-read-write) |
Build |
Build and push container images to registry |
Pipeline |
level3 (project-ci) |
Orchestration |
Create, execute, manage ML pipelines |
SageMaker |
level4-ci (deploy-only) |
Deploy |
Create endpoints, register models, configure autoscaling. No delete, no traffic shifting |
Lambda |
level2 (deploy-manage) |
Deploy |
Deploy and update inference functions |
Whatβs NOT included:
No Bedrock β model access and guardrail management is a human decision
No
iam:PassRoleon*β PassRole is scoped within SageMaker level4-ci to{company_prefix}-{env}-*-role-*conditioned tosagemaker.amazonaws.comNo
cloudformation:*β CloudFormation access is outside the 6-service scope and should be handled separately if needed for IaC deploymentsNo managed policies β the level system covers everything
SageMaker level4-ci guardrails:
Explicit deny on
DeleteEndpoint,DeleteEndpointConfig,DeleteModel,DeleteModelPackage,DeleteModelPackageGroupExplicit deny on
UpdateEndpointWeightsAndCapacities(traffic shifting)Explicit deny on
DeleteDomain,DeleteUserProfilePipelines deploy forward β teardown requires separate authorization
Trust Policy PatternsΒΆ
The trust policy depends on the CI/CD platform. Common patterns:
AWS CodePipeline / CodeBuild:
trust_policy:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- codepipeline.amazonaws.com
- codebuild.amazonaws.com
Action: sts:AssumeRole
GitHub Actions (OIDC):
trust_policy:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Federated: arn:aws:iam::{account_id}:oidc-provider/token.actions.githubusercontent.com
Action: sts:AssumeRoleWithWebIdentity
Condition:
StringEquals:
token.actions.githubusercontent.com:aud: sts.amazonaws.com
StringLike:
token.actions.githubusercontent.com:sub: repo:{org}/{repo}:*
GitLab CI (OIDC):
trust_policy:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Federated: arn:aws:iam::{account_id}:oidc-provider/gitlab.com
Action: sts:AssumeRoleWithWebIdentity
Condition:
StringEquals:
gitlab.com:aud: https://gitlab.com
StringLike:
gitlab.com:sub: project_path:{group}/{project}:*
The trust policy is client-specific and configured per deployment. The policy_assignments remain the same regardless of which CI/CD platform assumes the role.
Implementation PlanΒΆ
Phase 1: Architecture and Config DesignΒΆ
~~Design architectural decisions~~ β
~~Design config schema with
rolesandassignmentssections~~ β~~Update NAMING_CONVENTIONS.md with
arole-andpolicy-patterns~~ β~~Document architecture in POLICY_GUIDE.md~~ β
Phase 2: Config and ValidationΒΆ
~~Update validation schemas (startup, medium, enterprise) with
rolesandassignments~~ β~~Update client configs with role definitions and assignments~~ β
~~Update config validation code to validate N:M mappings (no dangling references)~~ β
Phase 3: Template GeneratorΒΆ
~~Extend template generator to emit IAM Role resources with trust policies~~ β
~~Emit managed policy resources from POLICY_GUIDE.md definitions~~ β
~~Emit AssumeRole inline policies on groups from assignments mapping~~ β
~~Convert existing inline policies to standalone managed policies~~ β
Phase 4: Deployment and TestingΒΆ
~~Test with
--template-only(validate generated CloudFormation)~~ β~~Test with
--test-deploy(real AWS, unique names)~~ β~~Add troubleshooting scenarios for AssumeRole denied errors~~ β
~~Update POLICY_GUIDE.md Assignment Recommendations~~ β
Phase 5: Future WorkΒΆ
Config-driven Bedrock model scoping β see ROADMAP.md
Lambda VPC governance β see ROADMAP.md
v2 RoadmapΒΆ
AD-7: MFA and Session Duration ControlsΒΆ
What: Support time-bound and conditional role assumption β requiring MFA for sensitive roles and setting session duration limits.
Why:
Reduced attack window β Limiting session duration (1β12 hours) restricts the usefulness of stolen session tokens
Session hijacking mitigation β Prevents persistent sessions from bypassing MFA
Contextual security β Require MFA only for high-risk roles (e.g., platform-full) without constant user friction
Implementation approach:
Add
Conditionblock to trust policies requiringaws:MultiFactorAuthPresentAdd
max_session_durationto role config (default 1 hour, configurable up to 12 hours)Add
mfa_required: true/falseto assignments config per group-role pair
Best practices to follow:
Role-based durations: critical roles use shorter sessions (1 hour), standard roles use longer (4β8 hours)
Avoid MFA fatigue: donβt force MFA for low-risk, everyday roles
Use idle timeouts (15β60 minutes) alongside absolute session limits
AD-8: Permission BoundariesΒΆ
What: Allow client team leads to create roles within guardrails without risking privilege escalation.
Implementation approach:
Define permission boundary policies per tier
Attach boundaries to delegated admin roles
Ensure no role can exceed its boundary regardless of attached policies
Motivating Scenarios β Why Groups-Only Is InsufficientΒΆ
The following real-world scenarios demonstrate why an IAM Groups-only architecture (without assumable roles) would fail to meet enterprise client needs. These are the problems the Roles layer solves.
Scenario 1: Temporary Cross-Team AccessΒΆ
Situation: A data scientist needs to debug a failing ML pipeline. The pipeline logs and configuration are only accessible to the ml-engineers group.
Groups-only problem: The admin must add the data scientist to the ml-engineers group, giving them all ML engineer permissions (ECR write, pipeline management, Lambda deploy). After debugging, the admin must remember to remove them. This is error-prone, over-privileged, and leaves no audit trail of the temporary access.
Roles solution: The data scientist assumes arole-pipeline-viewer (read-only pipeline access) for a single session. No group membership changes, no over-provisioning, session expires automatically.
Scenario 2: Environment-Specific Access ControlΒΆ
Situation: A backend developer needs to invoke production SageMaker endpoints for the live application but should never touch dev/staging endpoints (to avoid accidentally routing production traffic to unstable models).
Groups-only problem: Group policies are static bundles. Youβd need separate groups for βbackend-dev-invokeβ and βbackend-prod-invokeβ, leading to group explosion. With 5 environments Γ 10 job functions, youβd need 50 groups.
Roles solution: Define arole-prod-invoke and arole-dev-invoke as separate roles. The backend-developers group is assigned only arole-prod-invoke. Clean, no group explosion.
Scenario 3: Least-Privilege for AutomationΒΆ
Situation: A CI/CD pipeline needs different permissions at different stages β read ECR during build, write ECR during push, invoke SageMaker during integration tests, deploy Lambda during release.
Groups-only problem: The pipelineβs IAM entity must be in a single group with all permissions combined. It has write access during the build stage when it only needs read access. No stage-level isolation.
Roles solution: The pipeline assumes different roles at each stage: arole-ecr-read during build, arole-ecr-write during push, arole-sagemaker-invoke during tests, arole-lambda-deploy during release. Each stage has exactly the permissions it needs.
Scenario 4: Compliance and Audit RequirementsΒΆ
Situation: An enterprise client (e.g., a bank) requires that every privileged action is traceable to a specific permission grant, with clear evidence of when access was assumed and when it expired.
Groups-only problem: Group membership is a persistent state. CloudTrail shows the userβs identity but not which specific permission bundle they were using. Thereβs no session boundary β the user has all group permissions all the time.
Roles solution: Every sts:AssumeRole call is logged in CloudTrail with the role ARN, session name, and timestamp. Auditors can see exactly which role was assumed, when, and for how long. Session expiry provides natural access boundaries.
Scenario 5: FinOps Cost ControlΒΆ
Situation: Only two people on the platform team should be able to create Bedrock provisioned throughput (which can cost thousands per month). Other platform admins should have full access to everything else.
Groups-only problem: All platform-administrators share the same group policy. Either everyone can provision throughput, or no one can. You canβt differentiate within a group.
Roles solution: Define arole-platform-full (everything except throughput) and arole-bedrock-throughput (adds provisioned throughput). Assign arole-bedrock-throughput only to the two approved FinOps engineers. Same group, different role access.
Scenario 6: Onboarding and OffboardingΒΆ
Situation: A new ML engineer joins the team. They should start with read-only access for the first week, then graduate to full ML engineer access.
Groups-only problem: You either put them in the ml-engineers group immediately (over-privileged on day one) or create a temporary βml-engineers-readonlyβ group (group explosion, manual cleanup).
Roles solution: Add them to the ml-engineers group on day one. The group has access to both arole-ml-readonly and arole-ml-deploy. During onboarding, they only use arole-ml-readonly. After the first week, they start assuming arole-ml-deploy. No group changes needed β the access control is in which role they choose to assume.
Why 1:1 Mapping Was RejectedΒΆ
For reference, a 1:1 mapping (one group β one role) was initially considered and rejected:
Concern |
1:1 Limitation |
N:M Solution |
|---|---|---|
Flexibility |
User locked to one persona |
User assumes different roles as needed |
Role switching |
Requires group membership changes |
Switch roles via AssumeRole |
Scalability |
Group explosion (one group per permission combo) |
Shared roles reduce total count |
Cross-account |
Static mapping, no dynamic assumption |
Role ARNs can reference other accounts |
Temporary access |
Requires adding/removing from groups |
Assume a role temporarily, session expires |
When 1:1 is acceptable:
Very small setups (2β3 users, personal accounts)
Strictly defined personas (e.g., a CI/CD pipeline with one fixed function)
ReferencesΒΆ
POLICY_GUIDE.md β Source of truth for all policy definitions and levels
ROLES_GUIDE.md β Source of truth for role definitions (service roles, assumable roles, cross-account roles)
NAMING_CONVENTIONS.md β Naming patterns for all resources
Document Version: 1.0 Last Updated: 2025 Maintained By: MLOps Platform Team