ML Pipeline Lifecycle PoliciesΒΆ

Table of ContentsΒΆ


OverviewΒΆ

The S3 Provisioner creates a folder structure optimized for ML workflows with automated lifecycle policy profiles and manual customization options.

Automated Lifecycle Profiles (v1.1.0+)ΒΆ

The S3 Provisioner now supports 4 automated lifecycle profiles configured in your YAML file:

s3:
  lifecycle_policy: ml-optimized  # or compliance, development, none

Available Profiles:

Profile

Transitions

Expiration

Use Case

ml-optimized

30d→STANDARD_IA, 90d→GLACIER

Never

Production ML with cost optimization

compliance

90d→GLACIER

7 years (2555d)

HIPAA/PCI regulated industries

development

None

90 days

Dev/staging environments

none

None

Never

Manual lifecycle management

All profiles apply to the entire solutions/ prefix (all folders including data/, models/, notebooks/, code/, config/, artifacts/).

For detailed profile specifications and cost savings, see CONFIGURATION.md.

Manual Lifecycle ImplementationΒΆ

For custom lifecycle rules beyond the automated profiles, this guide provides recommended patterns for the four core ML data folders:

  • solutions/*/data/raw/ - Raw ingested data

  • solutions/*/data/curated/ - Cleaned and validated data

  • solutions/*/data/processed/ - Feature-engineered training data

  • solutions/*/data/inference/ - Prediction results

Customer ResponsibilityΒΆ

For custom lifecycle rules beyond the 4 automated profiles (ml-optimized, compliance, development, none), lifecycle policies must be manually added to the generated CloudFormation template.

ML Data Type RecommendationsΒΆ

Data Type

Access Pattern

Recommended Transition

Expiration

Raw Data

Intensive first 30 days for preprocessing

STANDARD β†’ STANDARD_IA (30d) β†’ GLACIER (90d)

Never (or 7+ years for compliance)

Curated Data

Moderate during development

STANDARD β†’ INTELLIGENT_TIERING (60d)

365 days (or compliance requirement)

Processed Data

Frequent during training, then unused

STANDARD (no transition)

90 days after training completion

Inference Results

Varies by use case

STANDARD β†’ STANDARD_IA (30d)

90-180 days

Storage Class CharacteristicsΒΆ

Storage Class

Use Case

Retrieval Time

Cost

STANDARD

Active ML workloads

Immediate

Highest storage, lowest retrieval

STANDARD_IA

Infrequent access (>30 days old)

Immediate

Lower storage, retrieval fee

INTELLIGENT_TIERING

Unpredictable access patterns

Immediate

Auto-optimizes based on access

GLACIER

Long-term archive

3-5 hours

Very low storage, retrieval fee

DEEP_ARCHIVE

Compliance archive (7+ years)

12 hours

Lowest storage, highest retrieval

Implementation StepsΒΆ

1. Generate CloudFormation TemplateΒΆ

docker run --rm \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/templates:/app/templates \
  s3-provisioner:latest \
  --config my-config.yaml \
  --action create-prov-template \
  --solution master-solution

2. Locate Generated TemplateΒΆ

Template will be in: templates/<config-name>_<solution>_s3_template.yaml

3. Add LifecycleConfigurationΒΆ

Edit the MLSolutionsBucket resource:

MLSolutionsBucket:
  Type: AWS::S3::Bucket
  Properties:
    BucketName: !Ref BucketName
    VersioningConfiguration:
      Status: Enabled
    LifecycleConfiguration:
      Rules:
        # Add your lifecycle rules here
        - Id: "RawDataLifecycle"
          Status: Enabled
          Prefix: "solutions/"
          Transitions:
            - Days: 30
              StorageClass: STANDARD_IA
            - Days: 90
              StorageClass: GLACIER
    Tags: [...]
    PublicAccessBlockConfiguration: [...]

4. Deploy Modified TemplateΒΆ

docker run --rm \
  -e AWS_PROFILE=default \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  s3-provisioner:latest \
  --config my-config.yaml \
  --action create-bucket \
  --solution master-solution \
  --force

Folder Structure ReferenceΒΆ

Note: The inline comments below are recommendations for manual lifecycle implementation when using lifecycle_policy: none. When using automated profiles (ml-optimized, compliance, development), those profiles apply uniformly to ALL folders under solutions/, not just data/. Many clients prefer to implement custom lifecycle policies at the folder level for fine-grained control.

The S3 Provisioner creates this structure:

solutions/
  <solution-name>/
    data/
      raw/                    # Raw ingested data
        2024/01/01/          # Date-partitioned
        archive/
      curated/               # Cleaned data
        2024/01/01/
        consolidated/
          weekly/
          monthly/
      processed/             # Feature-engineered data
        train/
          feature_store/
        validation/
          feature_store/
        test/
          feature_store/
        feature_engineering/
          encoders/
          feature_definitions/
          statistics/
      inference/             # Prediction results
        batch/
          input/
          output/
        realtime/
          requests/2024/01/01/
          responses/2024/01/01/
    models/                  # Keep in STANDARD or INTELLIGENT_TIERING
    notebooks/               # Keep in STANDARD
    artifacts/               # Consider separate lifecycle rules
    code/                    # Keep in STANDARD
    config/                  # Keep in STANDARD

Best PracticesΒΆ

1. Align with ML WorkflowΒΆ

  • Match transitions to actual data access patterns

  • Consider training frequency and model retraining schedules

  • Account for experimentation vs. production workloads

2. Consider Retrieval TimesΒΆ

  • GLACIER: 3-5 hours - acceptable for historical analysis

  • DEEP_ARCHIVE: 12 hours - only for compliance archives

  • Keep active training data in STANDARD or INTELLIGENT_TIERING

3. Use Prefix FiltersΒΆ

  • Apply different rules to raw/, curated/, processed/, inference/

  • Use wildcard patterns: solutions/*/data/raw/

  • Combine with tags for fine-grained control

4. Enable VersioningΒΆ

  • Protect against accidental deletions during pipeline runs

  • Use NoncurrentVersionTransitions for old versions

  • Set NoncurrentVersionExpirationInDays to cleanup

5. Monitor CostsΒΆ

  • Use AWS Cost Explorer to track storage class costs

  • Review S3 Storage Lens for access patterns

  • Adjust lifecycle rules based on actual usage

6. Test in Non-ProductionΒΆ

  • Validate lifecycle rules don’t break ML pipelines

  • Ensure retrieval times meet SLA requirements

  • Test restore procedures from GLACIER

7. Document Retention PoliciesΒΆ

  • Maintain clear policies for compliance

  • Document business justification for retention periods

  • Review and update policies annually

Cost Optimization ExamplesΒΆ

⚠️ IMPORTANT DISCLAIMER
The following cost estimates are illustrative examples only and should not be used for budgeting or financial planning.
Actual costs will vary significantly based on:

  • Your specific usage patterns and data access frequency

  • AWS region (pricing varies by region)

  • Current AWS pricing (subject to change)

  • Data transfer costs and request volumes

  • Storage class transition timing

Always use the AWS Pricing Calculator for accurate cost projections specific to your use case.

Example 1: 100 TB ML Pipeline (Annual)ΒΆ

Without Lifecycle Policies:

  • 100 TB in STANDARD: ~$2,300/month = $27,600/year

With Cost-Optimized Pattern:

  • 10 TB STANDARD (active): $230/month

  • 30 TB STANDARD_IA (recent): $375/month

  • 60 TB GLACIER (archive): $240/month

  • Total: $845/month = $10,140/year

  • Savings: $17,460/year (63%)

Example 2: Compliance Workload (7-year retention)ΒΆ

Without Lifecycle Policies:

  • 500 TB in STANDARD: $11,500/month = $138,000/year

With Compliance Pattern:

  • 50 TB STANDARD (active): $1,150/month

  • 450 TB GLACIER (archive): $1,800/month

  • Total: $2,950/month = $35,400/year

  • Savings: $102,600/year (74%)

TroubleshootingΒΆ

Issue: Data Not TransitioningΒΆ

Check:

  • Lifecycle rule status is Enabled

  • Prefix matches actual folder structure

  • Minimum object size requirements met (128 KB for IA)

  • Sufficient time has passed (transitions occur daily)

Issue: Unexpected DeletionsΒΆ

Check:

  • ExpirationInDays is set correctly

  • No conflicting lifecycle rules

  • Versioning enabled to protect against accidental deletes

  • Review CloudTrail logs for lifecycle actions

Issue: High Retrieval CostsΒΆ

Solution:

  • Use INTELLIGENT_TIERING for unpredictable access

  • Keep frequently accessed data in STANDARD

  • Batch retrieval requests to minimize API calls

  • Consider S3 Select for partial object retrieval

Additional ResourcesΒΆ


Copyright Β© 2025 Axon Tech Labs All rights reserved.

See LICENSE.txt for terms and conditions.