ML Pipeline Lifecycle Policies¶

Table of Contents¶

Overview
Customer Responsibility
Recommended Lifecycle Patterns
ML Data Type Recommendations
Storage Class Characteristics
Implementation Steps
Folder Structure Reference
Best Practices
Cost Optimization Examples
Troubleshooting
Additional Resources

Overview¶

The S3 Provisioner creates a folder structure optimized for ML workflows with automated lifecycle policy profiles and manual customization options.

Automated Lifecycle Profiles (v1.1.0+)¶

The S3 Provisioner now supports 4 automated lifecycle profiles configured in your YAML file:

s3:
  lifecycle_policy: ml-optimized  # or compliance, development, none

Available Profiles:

Profile	Transitions	Expiration	Use Case
ml-optimized	30d→STANDARD_IA, 90d→GLACIER	Never	Production ML with cost optimization
compliance	90d→GLACIER	7 years (2555d)	HIPAA/PCI regulated industries
development	None	90 days	Dev/staging environments
none	None	Never	Manual lifecycle management

All profiles apply to the entire solutions/ prefix (all folders including data/, models/, notebooks/, code/, config/, artifacts/).

For detailed profile specifications and cost savings, see CONFIGURATION.md.

Manual Lifecycle Implementation¶

For custom lifecycle rules beyond the automated profiles, this guide provides recommended patterns for the four core ML data folders:

solutions/*/data/raw/ - Raw ingested data
solutions/*/data/curated/ - Cleaned and validated data
solutions/*/data/processed/ - Feature-engineered training data
solutions/*/data/inference/ - Prediction results

Customer Responsibility¶

For custom lifecycle rules beyond the 4 automated profiles (ml-optimized, compliance, development, none), lifecycle policies must be manually added to the generated CloudFormation template.

Recommended Lifecycle Patterns¶

Pattern 1: Cost-Optimized ML Pipeline¶

Balances cost with accessibility for active ML development.

LifecycleConfiguration:
  Rules:
    # Raw data: Keep hot for preprocessing, then archive
    - Id: "RawDataLifecycle"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        And:
          Prefix: "solutions/"
          Tags:
            - Key: "DataType"
              Value: "raw"
      Transitions:
        - Days: 30
          StorageClass: STANDARD_IA
        - Days: 90
          StorageClass: GLACIER
      # Never expire - keep for audit/compliance
      
    # Curated data: Moderate access during development
    - Id: "CuratedDataLifecycle"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/curated/
      Transitions:
        - Days: 60
          StorageClass: INTELLIGENT_TIERING
      ExpirationInDays: 365
      
    # Processed data: Transient, delete after training
    - Id: "ProcessedDataLifecycle"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/processed/
      ExpirationInDays: 90
      
    # Inference results: Keep recent, delete old
    - Id: "InferenceDataLifecycle"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/inference/
      Transitions:
        - Days: 30
          StorageClass: STANDARD_IA
      ExpirationInDays: 180

Pattern 2: Compliance-Focused (HIPAA/PCI)¶

Long-term retention with archival for regulated industries.

LifecycleConfiguration:
  Rules:
    # Raw data: Archive for 7 years (compliance requirement)
    - Id: "RawDataCompliance"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/raw/
      Transitions:
        - Days: 90
          StorageClass: GLACIER
      ExpirationInDays: 2555  # 7 years
      
    # Curated data: Retain for audit trail
    - Id: "CuratedDataCompliance"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/curated/
      Transitions:
        - Days: 90
          StorageClass: GLACIER
      ExpirationInDays: 2555  # 7 years
      
    # Processed data: Keep for model reproducibility
    - Id: "ProcessedDataCompliance"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/processed/
      Transitions:
        - Days: 180
          StorageClass: GLACIER
      ExpirationInDays: 1095  # 3 years
      
    # Inference results: Retain predictions for audit
    - Id: "InferenceDataCompliance"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/inference/
      Transitions:
        - Days: 30
          StorageClass: STANDARD_IA
        - Days: 90
          StorageClass: GLACIER
      ExpirationInDays: 1095  # 3 years

Pattern 3: Active Development¶

Aggressive cleanup for development/staging environments.

LifecycleConfiguration:
  Rules:
    # Raw data: Quick transition to save costs
    - Id: "RawDataDev"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/raw/
      Transitions:
        - Days: 7
          StorageClass: STANDARD_IA
      ExpirationInDays: 90
      
    # Curated data: Short retention
    - Id: "CuratedDataDev"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/curated/
      ExpirationInDays: 60
      
    # Processed data: Very transient
    - Id: "ProcessedDataDev"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/processed/
      ExpirationInDays: 30
      
    # Inference results: Cleanup quickly
    - Id: "InferenceDataDev"
      Status: Enabled
      Prefix: "solutions/"
      Filter:
        Prefix: "solutions/"
        # Matches: solutions/*/data/inference/
      ExpirationInDays: 14

ML Data Type Recommendations¶

Data Type	Access Pattern	Recommended Transition	Expiration
Raw Data	Intensive first 30 days for preprocessing	STANDARD → STANDARD_IA (30d) → GLACIER (90d)	Never (or 7+ years for compliance)
Curated Data	Moderate during development	STANDARD → INTELLIGENT_TIERING (60d)	365 days (or compliance requirement)
Processed Data	Frequent during training, then unused	STANDARD (no transition)	90 days after training completion
Inference Results	Varies by use case	STANDARD → STANDARD_IA (30d)	90-180 days

Storage Class Characteristics¶

Storage Class	Use Case	Retrieval Time	Cost
STANDARD	Active ML workloads	Immediate	Highest storage, lowest retrieval
STANDARD_IA	Infrequent access (>30 days old)	Immediate	Lower storage, retrieval fee
INTELLIGENT_TIERING	Unpredictable access patterns	Immediate	Auto-optimizes based on access
GLACIER	Long-term archive	3-5 hours	Very low storage, retrieval fee
DEEP_ARCHIVE	Compliance archive (7+ years)	12 hours	Lowest storage, highest retrieval

Implementation Steps¶

1. Generate CloudFormation Template¶

docker run --rm \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/templates:/app/templates \
  s3-provisioner:latest \
  --config my-config.yaml \
  --action create-prov-template \
  --solution master-solution

2. Locate Generated Template¶

Template will be in: templates/<config-name>_<solution>_s3_template.yaml

3. Add LifecycleConfiguration¶

Edit the MLSolutionsBucket resource:

MLSolutionsBucket:
  Type: AWS::S3::Bucket
  Properties:
    BucketName: !Ref BucketName
    VersioningConfiguration:
      Status: Enabled
    LifecycleConfiguration:
      Rules:
        # Add your lifecycle rules here
        - Id: "RawDataLifecycle"
          Status: Enabled
          Prefix: "solutions/"
          Transitions:
            - Days: 30
              StorageClass: STANDARD_IA
            - Days: 90
              StorageClass: GLACIER
    Tags: [...]
    PublicAccessBlockConfiguration: [...]

4. Deploy Modified Template¶

docker run --rm \
  -e AWS_PROFILE=default \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  s3-provisioner:latest \
  --config my-config.yaml \
  --action create-bucket \
  --solution master-solution \
  --force

Folder Structure Reference¶

Note: The inline comments below are recommendations for manual lifecycle implementation when using lifecycle_policy: none. When using automated profiles (ml-optimized, compliance, development), those profiles apply uniformly to ALL folders under solutions/, not just data/. Many clients prefer to implement custom lifecycle policies at the folder level for fine-grained control.

The S3 Provisioner creates this structure:

solutions/
  <solution-name>/
    data/
      raw/                    # Raw ingested data
        2024/01/01/          # Date-partitioned
        archive/
      curated/               # Cleaned data
        2024/01/01/
        consolidated/
          weekly/
          monthly/
      processed/             # Feature-engineered data
        train/
          feature_store/
        validation/
          feature_store/
        test/
          feature_store/
        feature_engineering/
          encoders/
          feature_definitions/
          statistics/
      inference/             # Prediction results
        batch/
          input/
          output/
        realtime/
          requests/2024/01/01/
          responses/2024/01/01/
    models/                  # Keep in STANDARD or INTELLIGENT_TIERING
    notebooks/               # Keep in STANDARD
    artifacts/               # Consider separate lifecycle rules
    code/                    # Keep in STANDARD
    config/                  # Keep in STANDARD

Best Practices¶

1. Align with ML Workflow¶

Match transitions to actual data access patterns
Consider training frequency and model retraining schedules
Account for experimentation vs. production workloads

2. Consider Retrieval Times¶

GLACIER: 3-5 hours - acceptable for historical analysis
DEEP_ARCHIVE: 12 hours - only for compliance archives
Keep active training data in STANDARD or INTELLIGENT_TIERING

3. Use Prefix Filters¶

Apply different rules to raw/, curated/, processed/, inference/
Use wildcard patterns: solutions/*/data/raw/
Combine with tags for fine-grained control

4. Enable Versioning¶

Protect against accidental deletions during pipeline runs
Use NoncurrentVersionTransitions for old versions
Set NoncurrentVersionExpirationInDays to cleanup

5. Monitor Costs¶

Use AWS Cost Explorer to track storage class costs
Review S3 Storage Lens for access patterns
Adjust lifecycle rules based on actual usage

6. Test in Non-Production¶

Validate lifecycle rules don’t break ML pipelines
Ensure retrieval times meet SLA requirements
Test restore procedures from GLACIER

7. Document Retention Policies¶

Maintain clear policies for compliance
Document business justification for retention periods
Review and update policies annually

Cost Optimization Examples¶

⚠️ IMPORTANT DISCLAIMER
The following cost estimates are illustrative examples only and should not be used for budgeting or financial planning.
Actual costs will vary significantly based on:

Your specific usage patterns and data access frequency

AWS region (pricing varies by region)

Current AWS pricing (subject to change)

Data transfer costs and request volumes

Storage class transition timing

Always use the AWS Pricing Calculator for accurate cost projections specific to your use case.

Example 1: 100 TB ML Pipeline (Annual)¶

Without Lifecycle Policies:

100 TB in STANDARD: ~$2,300/month = $27,600/year

With Cost-Optimized Pattern:

10 TB STANDARD (active): $230/month
30 TB STANDARD_IA (recent): $375/month
60 TB GLACIER (archive): $240/month
Total: $845/month = $10,140/year
Savings: $17,460/year (63%)

Example 2: Compliance Workload (7-year retention)¶

Without Lifecycle Policies:

500 TB in STANDARD: $11,500/month = $138,000/year

With Compliance Pattern:

50 TB STANDARD (active): $1,150/month
450 TB GLACIER (archive): $1,800/month
Total: $2,950/month = $35,400/year
Savings: $102,600/year (74%)

Troubleshooting¶

Issue: Data Not Transitioning¶

Check:

Lifecycle rule status is Enabled
Prefix matches actual folder structure
Minimum object size requirements met (128 KB for IA)
Sufficient time has passed (transitions occur daily)

Issue: Unexpected Deletions¶

Check:

ExpirationInDays is set correctly
No conflicting lifecycle rules
Versioning enabled to protect against accidental deletes
Review CloudTrail logs for lifecycle actions

Issue: High Retrieval Costs¶

Solution:

Use INTELLIGENT_TIERING for unpredictable access
Keep frequently accessed data in STANDARD
Batch retrieval requests to minimize API calls
Consider S3 Select for partial object retrieval

Additional Resources¶

User Guide - Complete command reference
Configuration Reference - Configuration reference and automated lifecycle profiles
S3 Folder Structure Reference - Complete folder hierarchy reference
Governance, Compliance, and Audit Capabilities - Enterprise governance implementation guide
AWS S3 Lifecycle Documentation
S3 Storage Classes
S3 Pricing Calculator
S3 Storage Lens

See LICENSE.txt for terms and conditions.