ML Pipeline Lifecycle PoliciesΒΆ
Table of ContentsΒΆ
OverviewΒΆ
The S3 Provisioner creates a folder structure optimized for ML workflows with automated lifecycle policy profiles and manual customization options.
Automated Lifecycle Profiles (v1.1.0+)ΒΆ
The S3 Provisioner now supports 4 automated lifecycle profiles configured in your YAML file:
s3:
lifecycle_policy: ml-optimized # or compliance, development, none
Available Profiles:
Profile |
Transitions |
Expiration |
Use Case |
|---|---|---|---|
ml-optimized |
30dβSTANDARD_IA, 90dβGLACIER |
Never |
Production ML with cost optimization |
compliance |
90dβGLACIER |
7 years (2555d) |
HIPAA/PCI regulated industries |
development |
None |
90 days |
Dev/staging environments |
none |
None |
Never |
Manual lifecycle management |
All profiles apply to the entire solutions/ prefix (all folders including data/, models/, notebooks/, code/, config/, artifacts/).
For detailed profile specifications and cost savings, see CONFIGURATION.md.
Manual Lifecycle ImplementationΒΆ
For custom lifecycle rules beyond the automated profiles, this guide provides recommended patterns for the four core ML data folders:
solutions/*/data/raw/ - Raw ingested data
solutions/*/data/curated/ - Cleaned and validated data
solutions/*/data/processed/ - Feature-engineered training data
solutions/*/data/inference/ - Prediction results
Customer ResponsibilityΒΆ
For custom lifecycle rules beyond the 4 automated profiles (ml-optimized, compliance, development, none), lifecycle policies must be manually added to the generated CloudFormation template.
Recommended Lifecycle PatternsΒΆ
Pattern 1: Cost-Optimized ML PipelineΒΆ
Balances cost with accessibility for active ML development.
LifecycleConfiguration:
Rules:
# Raw data: Keep hot for preprocessing, then archive
- Id: "RawDataLifecycle"
Status: Enabled
Prefix: "solutions/"
Filter:
And:
Prefix: "solutions/"
Tags:
- Key: "DataType"
Value: "raw"
Transitions:
- Days: 30
StorageClass: STANDARD_IA
- Days: 90
StorageClass: GLACIER
# Never expire - keep for audit/compliance
# Curated data: Moderate access during development
- Id: "CuratedDataLifecycle"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/curated/
Transitions:
- Days: 60
StorageClass: INTELLIGENT_TIERING
ExpirationInDays: 365
# Processed data: Transient, delete after training
- Id: "ProcessedDataLifecycle"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/processed/
ExpirationInDays: 90
# Inference results: Keep recent, delete old
- Id: "InferenceDataLifecycle"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/inference/
Transitions:
- Days: 30
StorageClass: STANDARD_IA
ExpirationInDays: 180
Pattern 2: Compliance-Focused (HIPAA/PCI)ΒΆ
Long-term retention with archival for regulated industries.
LifecycleConfiguration:
Rules:
# Raw data: Archive for 7 years (compliance requirement)
- Id: "RawDataCompliance"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/raw/
Transitions:
- Days: 90
StorageClass: GLACIER
ExpirationInDays: 2555 # 7 years
# Curated data: Retain for audit trail
- Id: "CuratedDataCompliance"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/curated/
Transitions:
- Days: 90
StorageClass: GLACIER
ExpirationInDays: 2555 # 7 years
# Processed data: Keep for model reproducibility
- Id: "ProcessedDataCompliance"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/processed/
Transitions:
- Days: 180
StorageClass: GLACIER
ExpirationInDays: 1095 # 3 years
# Inference results: Retain predictions for audit
- Id: "InferenceDataCompliance"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/inference/
Transitions:
- Days: 30
StorageClass: STANDARD_IA
- Days: 90
StorageClass: GLACIER
ExpirationInDays: 1095 # 3 years
Pattern 3: Active DevelopmentΒΆ
Aggressive cleanup for development/staging environments.
LifecycleConfiguration:
Rules:
# Raw data: Quick transition to save costs
- Id: "RawDataDev"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/raw/
Transitions:
- Days: 7
StorageClass: STANDARD_IA
ExpirationInDays: 90
# Curated data: Short retention
- Id: "CuratedDataDev"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/curated/
ExpirationInDays: 60
# Processed data: Very transient
- Id: "ProcessedDataDev"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/processed/
ExpirationInDays: 30
# Inference results: Cleanup quickly
- Id: "InferenceDataDev"
Status: Enabled
Prefix: "solutions/"
Filter:
Prefix: "solutions/"
# Matches: solutions/*/data/inference/
ExpirationInDays: 14
ML Data Type RecommendationsΒΆ
Data Type |
Access Pattern |
Recommended Transition |
Expiration |
|---|---|---|---|
Raw Data |
Intensive first 30 days for preprocessing |
STANDARD β STANDARD_IA (30d) β GLACIER (90d) |
Never (or 7+ years for compliance) |
Curated Data |
Moderate during development |
STANDARD β INTELLIGENT_TIERING (60d) |
365 days (or compliance requirement) |
Processed Data |
Frequent during training, then unused |
STANDARD (no transition) |
90 days after training completion |
Inference Results |
Varies by use case |
STANDARD β STANDARD_IA (30d) |
90-180 days |
Storage Class CharacteristicsΒΆ
Storage Class |
Use Case |
Retrieval Time |
Cost |
|---|---|---|---|
STANDARD |
Active ML workloads |
Immediate |
Highest storage, lowest retrieval |
STANDARD_IA |
Infrequent access (>30 days old) |
Immediate |
Lower storage, retrieval fee |
INTELLIGENT_TIERING |
Unpredictable access patterns |
Immediate |
Auto-optimizes based on access |
GLACIER |
Long-term archive |
3-5 hours |
Very low storage, retrieval fee |
DEEP_ARCHIVE |
Compliance archive (7+ years) |
12 hours |
Lowest storage, highest retrieval |
Implementation StepsΒΆ
1. Generate CloudFormation TemplateΒΆ
docker run --rm \
-v $(pwd)/s3/configs:/app/configs \
-v $(pwd)/s3/templates:/app/templates \
s3-provisioner:latest \
--config my-config.yaml \
--action create-prov-template \
--solution master-solution
2. Locate Generated TemplateΒΆ
Template will be in: templates/<config-name>_<solution>_s3_template.yaml
3. Add LifecycleConfigurationΒΆ
Edit the MLSolutionsBucket resource:
MLSolutionsBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref BucketName
VersioningConfiguration:
Status: Enabled
LifecycleConfiguration:
Rules:
# Add your lifecycle rules here
- Id: "RawDataLifecycle"
Status: Enabled
Prefix: "solutions/"
Transitions:
- Days: 30
StorageClass: STANDARD_IA
- Days: 90
StorageClass: GLACIER
Tags: [...]
PublicAccessBlockConfiguration: [...]
4. Deploy Modified TemplateΒΆ
docker run --rm \
-e AWS_PROFILE=default \
-v ~/.aws:/home/s3user/.aws:ro \
-v $(pwd)/s3/configs:/app/configs \
s3-provisioner:latest \
--config my-config.yaml \
--action create-bucket \
--solution master-solution \
--force
Folder Structure ReferenceΒΆ
Note: The inline comments below are recommendations for manual lifecycle implementation when using lifecycle_policy: none. When using automated profiles (ml-optimized, compliance, development), those profiles apply uniformly to ALL folders under solutions/, not just data/. Many clients prefer to implement custom lifecycle policies at the folder level for fine-grained control.
The S3 Provisioner creates this structure:
solutions/
<solution-name>/
data/
raw/ # Raw ingested data
2024/01/01/ # Date-partitioned
archive/
curated/ # Cleaned data
2024/01/01/
consolidated/
weekly/
monthly/
processed/ # Feature-engineered data
train/
feature_store/
validation/
feature_store/
test/
feature_store/
feature_engineering/
encoders/
feature_definitions/
statistics/
inference/ # Prediction results
batch/
input/
output/
realtime/
requests/2024/01/01/
responses/2024/01/01/
models/ # Keep in STANDARD or INTELLIGENT_TIERING
notebooks/ # Keep in STANDARD
artifacts/ # Consider separate lifecycle rules
code/ # Keep in STANDARD
config/ # Keep in STANDARD
Best PracticesΒΆ
1. Align with ML WorkflowΒΆ
Match transitions to actual data access patterns
Consider training frequency and model retraining schedules
Account for experimentation vs. production workloads
2. Consider Retrieval TimesΒΆ
GLACIER: 3-5 hours - acceptable for historical analysis
DEEP_ARCHIVE: 12 hours - only for compliance archives
Keep active training data in STANDARD or INTELLIGENT_TIERING
3. Use Prefix FiltersΒΆ
Apply different rules to raw/, curated/, processed/, inference/
Use wildcard patterns:
solutions/*/data/raw/Combine with tags for fine-grained control
4. Enable VersioningΒΆ
Protect against accidental deletions during pipeline runs
Use
NoncurrentVersionTransitionsfor old versionsSet
NoncurrentVersionExpirationInDaysto cleanup
5. Monitor CostsΒΆ
Use AWS Cost Explorer to track storage class costs
Review S3 Storage Lens for access patterns
Adjust lifecycle rules based on actual usage
6. Test in Non-ProductionΒΆ
Validate lifecycle rules donβt break ML pipelines
Ensure retrieval times meet SLA requirements
Test restore procedures from GLACIER
7. Document Retention PoliciesΒΆ
Maintain clear policies for compliance
Document business justification for retention periods
Review and update policies annually
Cost Optimization ExamplesΒΆ
β οΈ IMPORTANT DISCLAIMER
The following cost estimates are illustrative examples only and should not be used for budgeting or financial planning.
Actual costs will vary significantly based on:
Your specific usage patterns and data access frequency
AWS region (pricing varies by region)
Current AWS pricing (subject to change)
Data transfer costs and request volumes
Storage class transition timing
Always use the AWS Pricing Calculator for accurate cost projections specific to your use case.
Example 1: 100 TB ML Pipeline (Annual)ΒΆ
Without Lifecycle Policies:
100 TB in STANDARD: ~$2,300/month = $27,600/year
With Cost-Optimized Pattern:
10 TB STANDARD (active): $230/month
30 TB STANDARD_IA (recent): $375/month
60 TB GLACIER (archive): $240/month
Total: $845/month = $10,140/year
Savings: $17,460/year (63%)
Example 2: Compliance Workload (7-year retention)ΒΆ
Without Lifecycle Policies:
500 TB in STANDARD: $11,500/month = $138,000/year
With Compliance Pattern:
50 TB STANDARD (active): $1,150/month
450 TB GLACIER (archive): $1,800/month
Total: $2,950/month = $35,400/year
Savings: $102,600/year (74%)
TroubleshootingΒΆ
Issue: Data Not TransitioningΒΆ
Check:
Lifecycle rule status is
EnabledPrefix matches actual folder structure
Minimum object size requirements met (128 KB for IA)
Sufficient time has passed (transitions occur daily)
Issue: Unexpected DeletionsΒΆ
Check:
ExpirationInDaysis set correctlyNo conflicting lifecycle rules
Versioning enabled to protect against accidental deletes
Review CloudTrail logs for lifecycle actions
Issue: High Retrieval CostsΒΆ
Solution:
Use INTELLIGENT_TIERING for unpredictable access
Keep frequently accessed data in STANDARD
Batch retrieval requests to minimize API calls
Consider S3 Select for partial object retrieval
Additional ResourcesΒΆ
USER_GUIDE.md - Complete command reference
CONFIGURATION.md - Configuration reference and automated lifecycle profiles
S3_FOLDERS.md - Complete folder hierarchy reference
GOVERNANCE_COMPLIANCE.md - Enterprise governance implementation guide
Copyright Β© 2025 Axon Tech Labs All rights reserved.
See LICENSE.txt for terms and conditions.