README¶

Enterprise-grade S3 infrastructure provisioning tool purpose-built for machine learning workloads.

Table of Contents¶

Why This Tool Exists
Enterprise-Scale Versatility
What You Get
Quick Start
Common Workflows
Key Features
AWS Credentials
Documentation
System Requirements
Support
License

Why This Tool Exists¶

Most S3 provisioning tools create empty buckets. This tool creates production-ready ML pipeline infrastructure with a complete folder structure for:

Data ingestion (raw → curated → processed → inference)
Model training, evaluation, and registry
Feature engineering and feature stores
Notebooks, artifacts, logs, and reports
Code, configurations, and monitoring

One command deploys 130+ folders organized for enterprise ML operations.

Enterprise-Scale Versatility¶

Functionality	Description
Multi-Tenant	Deploy isolated ML infrastructure for multiple teams or clients
Multi-Region	Replicate infrastructure across AWS regions (us-west-1, eu-west-1, ap-southeast-1, etc.)
Multi-Environment	Separate dev, staging, and production environments with identical structure
Multi-Company	Support multiple companies with distinct configurations and branding
Flexible Bucket Naming	Auto-generate standardized names (`{prefix}-{env}-{alias}-{region}`) or use custom client-specified names
Multi-Solution Support	Deploy multiple ML solutions (customer-churn, fraud-detection, demand-forecasting) in a single shared bucket or dedicated buckets per solution
VPC Integration	Seamlessly integrates with VPC Provisioner to enable private S3 access via VPC endpoints - all traffic stays within your private cloud with no internet exposure
Configuration-Driven Structure	Control the entire S3 folder hierarchy through simple YAML configuration - clients define bucket names, lifecycle policies, versioning, tags, and VPC settings without touching code

Example:

Deploy customer-churn solution for 3 companies × 2 regions × 3 environments = 18 isolated buckets with one configuration template.

What You Get¶

Battle-Tested ML Folder Structure¶

Saves weeks of infrastructure design time - Instead of designing folder hierarchies from scratch, get a comprehensive, production-ready structure that covers:

Complete ML Pipeline: Raw → Curated → Processed → Inference data flow
Universal Applicability: Works for any ML domain (computer vision, NLP, time series, recommendation systems, fraud detection, etc.)
130+ Organized Folders: Data, models, notebooks, artifacts, code, configs, and monitoring
Enterprise-Ready: Built-in support for governance, compliance, audit trails, and data lineage
Fully Customizable: Use as-is or adapt to your specific needs - remove unused folders or add custom ones

See S3 Folder Structure Reference for complete folder structure reference.

Bonus: Enterprise Governance Blueprint¶

Beyond infrastructure provisioning - Get a complete reference architecture for implementing governance, compliance, and audit capabilities:

Ready-to-use JSON schemas for audit logs, data lineage, and compliance metadata
Multi-framework compliance support (GDPR, HIPAA, SOC 2, ISO 27001, CCPA)
RBAC examples with role-based access patterns
Query templates for audit trail analysis
Implementation checklist with AWS service recommendations

See Governance, Compliance, and Audit Capabilities for the complete governance framework.

ML-Optimized Folder Structure¶

solutions/
  customer-churn/
    data/
      raw/              # Ingested data with date partitioning
      curated/          # Cleaned and validated data
      processed/        # Feature-engineered training data
        train/
        validation/
        test/
        feature_engineering/
      inference/        # Batch and realtime predictions
    models/
      experiments/      # Experiment tracking
      training/         # Trained models by algorithm
      evaluation/       # Model comparison and monitoring
      registry/         # Production/staging/dev model versions
    notebooks/          # Jupyter notebooks by phase
    artifacts/          # Logs, checkpoints, visualizations, reports
    code/               # Pipeline code and tests
    config/             # Environment and model configurations

Automated Lifecycle Policies¶

4 pre-configured profiles for cost optimization:

ml-optimized: 30d→IA, 90d→GLACIER (60-70% cost savings)
compliance: 90d→GLACIER, 7-year retention (70-80% savings)
development: 90-day expiration (100% savings)
none: Manual management

Infrastructure as Code¶

CloudFormation-based deployment with stack outputs for multi-stack orchestration
One-command cleanup via CloudFormation stack deletion
67 leaf folders created via CloudFormation templates
Lambda-based folder creation for remaining structure
IAM policies auto-generated
VPC endpoint support
Automated tagging (7 system + custom tags)
Local template validation (YAML syntax, structure, reference integrity)
Infrastructure drift detection against deployed stacks
Change preview via CloudFormation ChangeSets
Safe test deployments with isolated resource names
Built-in cost estimation with region-specific pricing

Quick Start¶

1. Prerequisites & Installation¶

Requirements¶

Requirement	Version	Notes
Docker	20.10+	Required to run the S3 Provisioner CLI
AWS CLI	2.x	Required for credential configuration and AWS resource verification
AWS Account	—	With permissions to create S3, CloudFormation, Lambda, and IAM resources

Installation¶

No installation is required. S3 Provisioner is distributed as a Docker image via AWS Marketplace.

docker pull s3-provisioner:latest

AWS Credentials Setup¶

# Configure AWS CLI
aws configure

# Verify credentials
aws sts get-caller-identity

Working Directory Setup¶

Create the required directory structure:

mkdir -p s3/{configs,policies,templates,reports,docs}

Copy documentation from the Docker image:

docker run --rm \
  -v $(pwd)/s3/docs:/output \
  --entrypoint cp \
  s3-provisioner:latest \
  -r /app/docs/. /output/

Copy example configuration files from the Docker image:

docker run --rm \
  -v $(pwd)/s3/configs:/app/configs \
  --entrypoint cp \
  s3-provisioner:latest \
  -r /app/examples/configs/. /app/configs/

Open s3/docs/index.html in your browser to view the full documentation offline.

2. Create Configuration¶

configs/my-ml-project.yaml:

client:
  company_name: "Acme Corp"
  company_prefix: "acme"
  account_id: "123456789012"
  tenant_id: "a001"

environment:
  env: "prod"
  region: "us-west-1"

s3:
  bucket_name_override: ""
  versioning: true
  lifecycle_policy: "ml-optimized"
  vpc_id: ""
  route_table_ids: ""
  tags:
    Project: "Customer Churn ML"
    Owner: "data-science-team"

3. Deploy Master Solution¶

docker run --rm \
  -e AWS_PROFILE=default \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action prep-master \
  --solution master-solution \
  --force

Result: Bucket acme-prod-a001-us-west-1-s3 created with complete ML folder structure.

4. Deploy Additional Solutions¶

# Deploy customer churn solution
docker run --rm \
  -e AWS_PROFILE=default \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action deploy-solution \
  --solution customer-churn

Available Solutions: customer-churn, demand-forecasting, fraud-detection

Common Workflows¶

Pattern A: Shared Bucket (Multiple Solutions)¶

# 1. Create master structure
--action prep-master --solution master-solution --force

# 2. Deploy solutions
--action deploy-solution --solution customer-churn
--action deploy-solution --solution fraud-detection

Result: acme-prod-a001-us-west-1-s3/solutions/{customer-churn,fraud-detection}/

Pattern B: Dedicated Buckets (One Solution Per Bucket)¶

# customer-churn-config.yaml
s3:
  bucket_name_override: "acme-prod-a001-us-west-1-customer-churn"

--action prep-master --solution customer-churn --force

Result: acme-prod-a001-us-west-1-customer-churn/solutions/customer-churn/

Key Features¶

Configuration Validation¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action validate-config

IAM Policy Generation¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/policies:/app/policies \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action create-policy

Template Validation¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs:ro \
  -v $(pwd)/s3/templates:/app/templates \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action validate-prov-template \
  --solution master-solution

Change Preview¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs:ro \
  -v $(pwd)/s3/templates:/app/templates \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action show-changes \
  --solution master-solution

Drift Detection¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs:ro \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action check-drift

Infrastructure Cleanup¶

docker run --rm \
  -e AWS_PROFILE=default \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action tear-down \
  --force

Generate Usage Assumptions¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/templates:/app/templates \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action cost-traffic \
  --solution master-solution

Estimate Monthly Costs¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/templates:/app/templates \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action cost-estimate \
  --solution master-solution

Refresh Resource Pricing¶

docker run --rm \
  -v ~/.aws:/home/s3user/.aws:ro \
  -v $(pwd)/s3/configs:/app/configs \
  -v $(pwd)/s3/templates:/app/templates \
  -v $(pwd)/s3/reports:/app/reports \
  s3-provisioner:latest \
  --config my-ml-project.yaml \
  --action cost-refresh-prices \
  --solution master-solution

AWS Credentials¶

Option 1: AWS Profile (Recommended)

-e AWS_PROFILE=default \
-v ~/.aws:/home/s3user/.aws:ro

Option 2: Environment Variables

-e AWS_ACCESS_KEY_ID=<key> \
-e AWS_SECRET_ACCESS_KEY=<secret> \
-e AWS_DEFAULT_REGION=us-west-1

Option 3: IAM Role (when running on EC2/ECS)

# No credentials needed - uses instance role

Documentation¶

All documentation is embedded in the Docker image:

# Copy all docs to local directory
docker run --rm \
  -v $(pwd)/s3/docs:/output \
  --entrypoint cp \
  s3-provisioner:latest \
  -r /app/docs/. /output/

Available Guides:

S3 Folder Structure Reference - Complete folder structure reference
Governance, Compliance, and Audit Capabilities - Enterprise governance reference architecture
User Guide - Complete command reference with 19 actions
Configuration Reference - Configuration parameters and examples
IAM Permissions - Required AWS permissions
ML Pipeline Lifecycle Policies - Lifecycle policy details
Cost Optimization - Cost optimization and estimation
Troubleshooting - Common issues and solutions
Release Notes - Version history and features

System Requirements¶

Docker 20.10+
AWS account with S3 and CloudFormation permissions
512 MB RAM minimum
1 GB disk space

Support¶

See Support for assistance.

License¶

Commercial license via AWS Marketplace subscription.

See LICENSE.txt for terms and conditions.