Governance, Compliance, and Audit CapabilitiesΒΆ

Table of ContentsΒΆ

OverviewΒΆ

This document describes how the S3 folder structure provisioned by s3-provisioner-tool supports enterprise-grade governance, compliance, and audit capabilities.

Important: The s3-provisioner tool creates the folder structure only. Actual governance, compliance, and audit capabilities must be implemented by:

  • AWS services (CloudTrail, Config, S3 Access Logs, Macie, etc.)

  • Your MLOps applications that write to these folders

  • Third-party governance tools

This document serves as a reference architecture showing how to use the provisioned structure for enterprise requirements.

What the s3-provisioner Tool ProvidesΒΆ

The s3-provisioner tool creates:

βœ… Folder structure with designated locations for:

  • Audit logs (artifacts/metadata/governance/audit_logs/)

  • Compliance reports (artifacts/metadata/governance/compliance/)

  • Data lineage metadata (artifacts/metadata/governance/lineage/)

  • Data quality reports (artifacts/reports/data_quality/)

βœ… S3 bucket configuration:

  • Versioning (optional)

  • Encryption at rest

  • Public access blocking

  • VPC endpoint support

  • Lifecycle policies

  • Tagging for governance

βœ… CloudFormation templates for reproducible infrastructure

What You Need to ImplementΒΆ

❌ Data governance logic - Your applications must write metadata to governance folders

❌ Compliance monitoring - Configure AWS Config, CloudTrail, or third-party tools

❌ Audit trail generation - Your MLOps pipelines must log events to audit folders

❌ Access control enforcement - Configure IAM policies, bucket policies, SCPs

❌ Data lineage tracking - Your pipelines must write lineage metadata


Data GovernanceΒΆ

Metadata ManagementΒΆ

Location: artifacts/metadata/governance/

governance/
β”œβ”€β”€ data_catalog/
β”‚   β”œβ”€β”€ schema_registry.json
β”‚   β”œβ”€β”€ data_dictionary.json
β”‚   └── lineage_metadata.json
β”œβ”€β”€ policies/
β”‚   β”œβ”€β”€ data_classification.json
β”‚   β”œβ”€β”€ access_policies.json
β”‚   └── retention_policies.json
└── audit_logs/
    └── YYYY/MM/DD/
        └── governance_events.jsonl

Key Capabilities:

  • Centralized data catalog with schema versioning

  • Data classification (PII, sensitive, public)

  • Automated policy enforcement

  • Change tracking and approval workflows

Data Quality StandardsΒΆ

Location: artifacts/reports/data_quality/

{
  "report_id": "dq_2024_01_01_001",
  "timestamp": "2024-01-01T10:00:00Z",
  "dataset": "raw/customer_data",
  "checks": {
    "completeness": {"score": 0.98, "threshold": 0.95, "status": "PASS"},
    "accuracy": {"score": 0.96, "threshold": 0.90, "status": "PASS"},
    "consistency": {"score": 0.99, "threshold": 0.95, "status": "PASS"},
    "timeliness": {"score": 0.94, "threshold": 0.90, "status": "PASS"}
  },
  "violations": [],
  "approved_by": "<user_id>",
  "compliance_status": "COMPLIANT"
}

Compliance FrameworkΒΆ

Regulatory ComplianceΒΆ

Supports multiple compliance standards:

Standard

Scope

Implementation

GDPR

Data privacy, right to deletion

Encryption, access logs, data retention

HIPAA

Healthcare data protection

Encryption at rest/transit, audit trails

SOC 2

Security and availability

Access controls, monitoring, incident response

ISO 27001

Information security

Risk management, security policies

CCPA

California consumer privacy

Data inventory, consent management

Compliance ArtifactsΒΆ

Location: artifacts/metadata/governance/compliance/

compliance/
β”œβ”€β”€ certifications/
β”‚   β”œβ”€β”€ soc2_report.pdf
β”‚   └── iso27001_certificate.pdf
β”œβ”€β”€ assessments/
β”‚   └── YYYY/MM/
β”‚       └── compliance_assessment.json
β”œβ”€β”€ evidence/
β”‚   └── YYYY/MM/DD/
β”‚       β”œβ”€β”€ access_reviews.json
β”‚       β”œβ”€β”€ encryption_verification.json
β”‚       └── backup_verification.json
└── violations/
    └── YYYY/MM/DD/
        └── violation_reports.json

Data Classification TagsΒΆ

{
  "classification": {
    "level": "CONFIDENTIAL",
    "categories": ["PII", "FINANCIAL"],
    "retention_period": "7_YEARS",
    "encryption_required": true,
    "access_restrictions": ["ROLE_DATA_SCIENTIST", "ROLE_ADMIN"],
    "geographic_restrictions": ["US", "EU"],
    "compliance_frameworks": ["GDPR", "CCPA", "SOC2"]
  }
}

Audit CapabilitiesΒΆ

Comprehensive Audit TrailΒΆ

All operations are logged with full traceability:

{
  "event_id": "evt_20240101_100000_001",
  "timestamp": "2024-01-01T10:00:00Z",
  "event_type": "DATA_ACCESS",
  "actor": {
    "user_id": "<user_id>",
    "role": "DATA_SCIENTIST",
    "session_id": "<session_id>"
  },
  "resource": {
    "type": "S3_OBJECT",
    "path": "s3://bucket/solutions/master-solution/data/raw/2024/01/01/data.csv",
    "classification": "CONFIDENTIAL"
  },
  "action": "READ",
  "result": "SUCCESS",
  "metadata": {
    "ip_address": "<ip>",
    "user_agent": "boto3/1.26.0",
    "request_id": "<request_id>"
  },
  "compliance_context": {
    "purpose": "MODEL_TRAINING",
    "approval_id": "appr_001",
    "data_usage_agreement": "dua_2024_001"
  }
}

Audit Log LocationsΒΆ

artifacts/metadata/governance/audit_logs/
β”œβ”€β”€ access_logs/
β”‚   └── YYYY/MM/DD/
β”‚       └── access_events.jsonl
β”œβ”€β”€ change_logs/
β”‚   └── YYYY/MM/DD/
β”‚       └── change_events.jsonl
β”œβ”€β”€ deployment_logs/
β”‚   └── YYYY/MM/DD/
β”‚       └── deployment_events.jsonl
└── compliance_logs/
    └── YYYY/MM/DD/
        └── compliance_events.jsonl

Audit Query ExamplesΒΆ

Who accessed sensitive data?

SELECT actor.user_id, resource.path, timestamp
FROM audit_logs
WHERE resource.classification = 'CONFIDENTIAL'
  AND action = 'READ'
  AND date >= '2024-01-01'

Track model lineage:

SELECT model_id, training_data_path, training_timestamp, deployed_by
FROM deployment_logs
WHERE model_id = 'model_v1.0.0'

Access ControlΒΆ

Role-Based Access Control (RBAC)ΒΆ

Location: config/environment_configs/rbac/

{
  "roles": {
    "DATA_SCIENTIST": {
      "permissions": [
        "s3:GetObject:data/processed/*",
        "s3:PutObject:notebooks/*",
        "s3:GetObject:models/experiments/*"
      ],
      "restrictions": {
        "no_access": ["data/raw/*", "config/*"],
        "read_only": ["models/registry/production/*"]
      }
    },
    "ML_ENGINEER": {
      "permissions": [
        "s3:*:code/*",
        "s3:*:models/training/*",
        "s3:PutObject:models/registry/staging/*"
      ]
    },
    "ADMIN": {
      "permissions": ["s3:*:*"]
    }
  }
}

Least Privilege PrincipleΒΆ

  • Default deny all access

  • Explicit grants per role

  • Time-bound access tokens

  • Just-in-time privilege elevation

Data LineageΒΆ

Lineage TrackingΒΆ

Location: artifacts/metadata/governance/lineage/

{
  "lineage_id": "lin_20240101_001",
  "entity_type": "MODEL",
  "entity_id": "model_v1.0.0",
  "lineage_graph": {
    "nodes": [
      {
        "id": "raw_data_001",
        "type": "DATASET",
        "path": "s3://bucket/data/raw/2024/01/01/",
        "timestamp": "2024-01-01T08:00:00Z"
      },
      {
        "id": "processed_data_001",
        "type": "DATASET",
        "path": "s3://bucket/data/processed/train/",
        "timestamp": "2024-01-01T09:00:00Z"
      },
      {
        "id": "model_v1.0.0",
        "type": "MODEL",
        "path": "s3://bucket/models/registry/production/model_v1.0.0/",
        "timestamp": "2024-01-01T12:00:00Z"
      }
    ],
    "edges": [
      {"from": "raw_data_001", "to": "processed_data_001", "transform": "preprocessing_v1"},
      {"from": "processed_data_001", "to": "model_v1.0.0", "transform": "training_job_001"}
    ]
  },
  "compliance_metadata": {
    "data_sources_approved": true,
    "transformations_validated": true,
    "model_approved_for_production": true
  }
}

Lineage VisualizationΒΆ

  • End-to-end data flow tracking

  • Transformation history

  • Impact analysis for changes

  • Reproducibility guarantees

Retention PoliciesΒΆ

Data Lifecycle ManagementΒΆ

Location: config/environment_configs/retention_policies.json

{
  "policies": [
    {
      "name": "raw_data_retention",
      "path_pattern": "data/raw/**",
      "retention_days": 90,
      "transition_rules": [
        {"days": 30, "storage_class": "STANDARD_IA"},
        {"days": 60, "storage_class": "GLACIER"}
      ],
      "delete_after_days": 90
    },
    {
      "name": "model_artifacts_retention",
      "path_pattern": "models/experiments/**",
      "retention_days": 365,
      "transition_rules": [
        {"days": 90, "storage_class": "STANDARD_IA"}
      ]
    },
    {
      "name": "audit_logs_retention",
      "path_pattern": "artifacts/metadata/governance/audit_logs/**",
      "retention_days": 2555,
      "compliance_requirement": "SOC2_7_YEARS",
      "immutable": true
    },
    {
      "name": "production_models_retention",
      "path_pattern": "models/registry/production/**",
      "retention_days": 1825,
      "compliance_requirement": "REGULATORY_5_YEARS",
      "immutable": true
    }
  ]
}

Archive StrategyΒΆ

data/raw/archive/
└── YYYY/
    └── MM/
        └── archived_data_YYYYMMDD.tar.gz

Monitoring and AlertingΒΆ

Compliance MonitoringΒΆ

Location: artifacts/reports/monitoring/compliance/

Real-time Alerts:

  • Unauthorized access attempts

  • Policy violations

  • Data quality threshold breaches

  • Encryption failures

  • Unusual data access patterns

Alert Configuration:

{
  "alerts": [
    {
      "name": "unauthorized_access",
      "condition": "event_type == 'DATA_ACCESS' AND result == 'DENIED'",
      "severity": "HIGH",
      "notification_channels": ["security_team", "compliance_officer"]
    },
    {
      "name": "pii_access_without_approval",
      "condition": "resource.classification == 'PII' AND compliance_context.approval_id IS NULL",
      "severity": "CRITICAL",
      "notification_channels": ["security_team", "legal_team"]
    },
    {
      "name": "data_quality_failure",
      "condition": "data_quality.score < threshold AND dataset.classification == 'PRODUCTION'",
      "severity": "MEDIUM",
      "notification_channels": ["data_engineering_team"]
    }
  ]
}

Compliance Dashboard MetricsΒΆ

  • Access control effectiveness

  • Policy compliance rate

  • Audit coverage percentage

  • Data quality scores

  • Incident response times

  • Certification status

Implementation ChecklistΒΆ

S3 Bucket Setup (Handled by s3-provisioner)ΒΆ

  • Enable S3 bucket versioning (configurable)

  • Configure encryption (SSE-S3 by default)

  • Set up lifecycle policies (configurable)

  • Block public access

  • Create folder structure

  • Apply resource tags

AWS Service Configuration (You Must Configure)ΒΆ

  • Enable AWS CloudTrail for API logging

  • Configure S3 Access Logs

  • Set up AWS Config rules

  • Enable S3 Inventory reports

  • Configure S3 Object Lock for immutable data (if required)

  • Implement bucket policies and IAM roles

  • Configure cross-region replication (if required)

  • Set up AWS Macie for PII detection (if required)

Application Integration (You Must Implement)ΒΆ

  • Regular access reviews (quarterly)

  • Compliance assessments (annual)

  • Audit log reviews (monthly)

  • Data quality monitoring (daily)

  • Policy updates and approvals

  • Incident response drills

  • Security training for team members

  • Vendor risk assessments

Best PracticesΒΆ

  1. Encryption Everywhere: Encrypt data at rest and in transit

  2. Least Privilege: Grant minimum necessary permissions

  3. Immutable Audit Logs: Use S3 Object Lock for compliance logs

  4. Automated Compliance: Use AWS Config and custom Lambda functions

  5. Regular Reviews: Conduct periodic access and compliance reviews

  6. Documentation: Maintain up-to-date policies and procedures

  7. Incident Response: Have a documented incident response plan

  8. Training: Regular security and compliance training for all users