S3 Folder Structure Reference¶

Complete technical reference for the S3 folder structure provisioned by this tool.

Table of Contents¶

Key Design Principles
Bucket Naming Convention
Folder Structure (solutions only)
Folder Structure (master-solution one level down)
Folder Structure (master-solution two levels down)
Compacted Folder Structure (master-solution all nodes, folders only)
Complete Folder Structure (folders and example files)
S3 Folder Structure Reference

Key Design Principles¶

Principle	Description
Date Partitioning	Uses `YYYY/MM/DD` format for proper S3 partitioning
Naming Convention	Underscores for code folders, hyphens for non-code
Feature Store	Evolving model with feature stores in each data split
Environment Strategy	3-environment model (dev → staging → prod)
Multi-Tenant	Company-agnostic design with client configuration support
Complete Pipeline	Raw → Curated → Processed → Inference data flow
Comprehensive Logging	Daily partitioned logs with structured JSON format
Enterprise Standards	Folder structure designed to support governance, compliance, and audit workflows

Customization: This structure provides a comprehensive starting point. Clients can remove unused folders or add custom folders to match their specific ML workflows and organizational requirements.

Bucket Naming Convention¶

When naming Amazon S3 buckets, it is critical to remember that names must be globally unique across all AWS accounts and regions. Effective naming often includes identifying information like your company name, the environment, and the purpose of the bucket to avoid collisions.

The name of a bucket can be auto-generated by s3-provisioner application or provided by a user (e.g. examplecompany-stage-uswest2-54321-prod).

When auto-generated then it is constructed from four parts:

{company_prefix}-{environment}-{tenant_id}-{region}

Here is the explanation of these parts:

Field	Description	Example
`company_prefix`	Short company identifier	“edge”
`tenant_id`	AWS tenant id/ID	“a001”
`environment`	Environment (prod/dev/test)	“prod”
`region`	AWS region	“us-west-2”
`bucket_name_override`	Custom bucket name or ‘’	“edge-overriden-bucket”

Examples:

- edge-dev-a001-us-west-1
- edge-dev-a001-us-west-2        -- different AWS region
- edge-prod-a001-us-west-2-s3       -- different environment
- edge-prod-b001-us-west-2       -- different tenant
- techcorp-prod-b001-us-west-2   -- different company1

Folder Structure (solutions only)¶

s3://{company_prefix}-{environment}-{tenant_id}-{region}/
└── solutions/
    ├── customer-churn/
    ├── demand-forecasting/
    ├── fraud-detection/
    ├── master-solution/
    ├── recommendation-engine/
    └── sentiment-analysis/

Folder Structure (master-solution one level down)¶

s3://{company_prefix}-{environment}-{tenant_id}-{region}/
└── solutions/
    └── master-solution/
        ├── artifacts/
        ├── code/
        ├── config/
        ├── data/
        ├── models/
        ├── notebooks/
        └── templates/

Folder Structure (master-solution two levels down)¶

s3://{company_prefix}-{environment}-{tenant_id}-{region}/
└── solutions/
    └── master-solution/
        ├── artifacts/
        │   ├── checkpoints/
        │   ├── logs/
        │   ├── metadata/
        │   ├── reports/
        │   ├── sagemaker-extensions/
        │   └── visualizations/
        ├── code/
        │   ├── feature_engineering/
        │   ├── inference/
        │   ├── monitoring/
        │   ├── pipelines/
        │   ├── preprocessing/
        │   ├── training/
        │   └── utils/
        ├── config/
        │   ├── environment_configs/
        │   └── model_configs/
        ├── data/
        │   ├── curated/
        │   ├── inference/
        │   ├── processed/
        │   └── raw/
        ├── models/
        │   ├── evaluation/
        │   ├── experiments/
        │   ├── registry/
        │   ├── training/
        │   └── tuning/
        ├── notebooks/
        │   ├── evaluation/
        │   ├── exploration/
        │   ├── inference/
        │   ├── preprocessing/
        │   └── training/
        └── templates/
            └── service-catalog/

Compacted Folder Structure (master-solution all nodes, folders only)¶

s3://{company_prefix}-{environment}-{tenant_id}-{region}/
└── solutions/
    └── master-solution/
        ├── artifacts/
        │   ├── checkpoints/
        │   │   ├── preprocessing_checkpoints/
        │   │   └── training_checkpoints/
        │   │       ├── xgboost/
        │   │       └── random_forest/
        │   ├── logs/
        │   │   ├── feature_engineering/
        │   │   │   └── daily_logs/
        │   │   ├── inference/
        │   │   │   ├── batch_inference/
        │   │   │   └── realtime_inference/
        │   │   ├── preprocessing/
        │   │   │   └── daily_logs/
        │   │   │       └── 2024/
        │   │   │           └── 01/
        │   │   │               └── 01/
        │   │   └── training/
        │   │       ├── hyperparameter_tuning_logs/
        │   │       └── training_job_logs/
        │   ├── metadata/
        │   │   ├── deployment_metadata/
        │   │   ├── governance/
        │   │   ├── preprocessing_metadata/
        │   │   └── training_metadata/
        │   ├── reports/
        │   │   ├── data_quality/
        │   │   │   └── daily_quality_reports/
        │   │   │       └── 2024/
        │   │   │           └── 01/
        │   │   │               └── 01/
        │   │   ├── feature_engineering/
        │   │   ├── model_evaluation/
        │   │   ├── model_training/
        │   │   ├── monitoring/
        │   │   └── validation/
        │   ├── sagemaker-extensions/
        │   └── visualizations/
        │       ├── data_exploration/
        │       ├── feature_analysis/
        │       ├── model_performance/
        │       └── monitoring/
        ├── code/
        │   ├── feature_engineering/
        │   │   └── tests/
        │   ├── inference/
        │   │   └── tests/
        │   ├── monitoring/
        │   │   └── tests/
        │   ├── pipelines/
        │   │   └── tests/
        │   ├── preprocessing/
        │   │   └── tests/
        │   ├── training/
        │   │   └── tests/
        │   └── utils/
        │       └── tests/
        ├── config/
        │   ├── environment_configs/
        │   └── model_configs/
        ├── data/
        │   ├── curated/
        │   │   ├── 2024/
        │   │   │   └── 01/
        │   │   │       └── 01/
        │   │   └── consolidated/
        │   │       ├── weekly/
        │   │       └── monthly/
        │   ├── inference/
        │   │   ├── batch/
        │   │   │   ├── input/
        │   │   │   └── output/
        │   │   └── realtime/
        │   │        ├── requests/
        │   │        │   └── 2024/
        │   │        │       └── 01/
        │   │        │           └── 01/
        │   │        └── responses/
        │   │            └── 2024/
        │   │                └── 01/
        │   │                    └── 01/
        │   ├── processed/
        │   │   ├── train/
        │   │   │   └── feature_store/
        │   │   ├── validation/
        │   │   │   └── feature_store/
        │   │   ├── test/
        │   │   │   └── feature_store/
        │   │   └── feature_engineering/
        │   │       ├── encoders/
        │   │       ├── feature_definitions/
        │   │       └── statistics/
        │   └── raw/
        │       ├── 2024
        │       │   └── 01/
        │       │       └── 01/
        │       └── archive/
        ├── models/
        │   ├── evaluation/
        │   │   ├── model_comparison/
        │   │   │   └── performance_charts/
        │   │   ├── validation_results/
        │   │   └── monitoring/
        │   ├── experiments/
        │   │   ├── experiment_001/
        │   │   │   └── artifacts/
        │   │   └── experiment_002/
        │   ├── registry/
        │   │   ├── production/
        │   │   │   └── model_v1.0.0/
        │   │   ├── staging/
        │   │   └── development/
        │   ├── training/
        │   │   ├── xgboost/
        │   │   ├── random_forest/
        │   │   └── neural_network/
        │   └── tuning/
        │       └── tuning_job_001/
        │           ├── best_training_job/
        │           └── all_training_jobs/
        ├── notebooks/
        │   ├── evaluation/
        │   ├── exploration/
        │   ├── inference/
        │   ├── preprocessing/
        │   └── training/
        └── templates/
            └── service-catalog/

Complete Folder Structure (folders and example files)¶

This section shows the complete folder structure with example files for any ML solution. The example uses customer-churn-prediction as a representative use case, but this structure applies to:

Computer vision (image classification, object detection, segmentation)
Natural language processing (sentiment analysis, text classification, NER)
Time series forecasting (demand prediction, anomaly detection)
Recommendation systems
Fraud detection
Any supervised/unsupervised ML workflow

The bottom 3 folders (shared/, client_config/) are optional organizational folders not provisioned by default.

s3://{company_prefix}-{environment}-{tenant_id}-{region}/
├── solutions/
│   └── customer-churn-prediction/
│       ├── artifacts/
│       │   ├── checkpoints/
│       │   │   ├── preprocessing_checkpoints/
│       │   │   └── training_checkpoints/
│       │   │       ├── xgboost/
│       │   │       │   ├── checkpoint_epoch_10.pkl
│       │   │       │   ├── checkpoint_epoch_20.pkl
│       │   │       │   └── checkpoint_final.pkl
│       │   │       └── random_forest/
│       │   ├── logs/
│       │   │   ├── feature_engineering/
│       │   │   │   ├── feature_engineering_pipeline.log
│       │   │   │   ├── categorical_features.log
│       │   │   │   └── daily_logs/
│       │   │   ├── inference/
│       │   │   │   ├── batch_inference/
│       │   │   │   └── realtime_inference/
│       │   │   ├── preprocessing/
│       │   │   │   ├── preprocessing_pipeline.log
│       │   │   │   ├── data_ingestion.log
│       │   │   │   ├── data_validation.log
│       │   │   │   ├── data_cleaning.log
│       │   │   │   └── daily_logs/
│       │   │   │       └── 2024/
│       │   │   │           └── 01/
│       │   │   │               └── 01/
│       │   │   │                   ├── 01/preprocessing_001.json
│       │   │   │                   └── 01/preprocessing_001.json
│       │   │   └── training/
│       │   │       ├── hyperparameter_tuning_logs/
│       │   │       └── training_job_logs/
│       │   │           ├── xgboost_training.log
│       │   │           └── random_forest_training.log
│       │   ├── metadata/
│       │   │   ├── deployment_metadata/
│       │   │   │   ├── endpoint_configurations.json
│       │   │   │   └── model_deployment_history.json
│       │   │   ├── governance/
│       │   │   │   ├── data_governance_policies.json
│       │   │   │   └── audit_trail.json
│       │   │   ├── preprocessing_metadata/
│       │   │   │   ├── cleaning_summary.json
│       │   │   │   ├── transformation_summary.json
│       │   │   │   └── data_lineage.json
│       │   │   └── training_metadata/
│       │   │       ├── experiment_tracking.json
│       │   │       ├── model_versioning.json
│       │   │       └── hyperparameter_history.json
│       │   ├── reports/
│       │   │   ├── data_quality/
│       │   │   │   ├── raw_data_quality_report.html
│       │   │   │   ├── curated_data_quality_report.html
│       │   │   │   └── daily_quality_reports/
│       │   │   │       └── 2024
│       │   │   │           └── 01/
│       │   │   │               └── 01/
│       │   │   │                   ├── customers_quality.html
│       │   │   │                   ├── transactions_quality.html
│       │   │   │                   └── usage_metrics_quality.html
│       │   │   ├── feature_engineering/
│       │   │   │   ├── feature_correlation_matrix.html
│       │   │   │   ├── feature_importance_report.html
│       │   │   │   └── feature_engineering_summary.html
│       │   │   ├── model_evaluation/
│       │   │   │   ├── performance_evaluation_report.html
│       │   │   │   ├── bias_fairness_report.html
│       │   │   │   └── model_interpretability_report.html
│       │   │   ├── model_training/
│       │   │   │   ├── training_summary_report.html
│       │   │   │   ├── hyperparameter_tuning_report.html
│       │   │   │   └── model_comparison_report.html
│       │   │   ├── monitoring/
│       │   │   │   ├── model_monitoring_dashboard.html
│       │   │   │   └── data_drift_report.html
│       │   │   └── validation/
│       │   │       ├── validation_summary.json
│       │   │       ├── schema_validation_report.html
│       │   │       └── data_quality_validation.html
│       │   ├── sagemaker-extensions/
│       │   └── visualizations/
│       │       ├── data_exploration/
│       │       │   ├── customer_demographics.png
│       │       │   ├── transaction_distributions.png
│       │       │   └── usage_patterns.png
│       │       ├── feature_analysis/
│       │       │   ├── feature_importance_plots.png
│       │       │   ├── correlation_heatmaps.png
│       │       │   └── shap_analysis.png
│       │       ├── model_performance/
│       │       │   ├── roc_curves.png
│       │       │   ├── precision_recall_curves.png
│       │       │   └── confusion_matrices.png
│       │       └── monitoring/
│       ├── code/
│       │   ├── feature_engineering/
│       │   │   ├── feature_engineering_pipeline.py
│       │   │   ├── categorical_features.py
│       │   │   ├── numerical_features.py
│       │   │   ├── feature_selection.py
│       │   │   ├── feature_validation.py
│       │   │   └── tests/
│       │   ├── inference/
│       │   │   ├── batch_inference.py
│       │   │   ├── realtime_inference.py
│       │   │   ├── model_serving.py
│       │   │   └── tests/
│       │   ├── monitoring/
│       │   │   ├── model_drift_detection.py
│       │   │   ├── data_quality_monitoring.py
│       │   │   ├── performance_monitoring.py
│       │   │   └── tests/
│       │   ├── pipelines/
│       │   │   ├── training_pipeline.py
│       │   │   ├── inference_pipeline.py
│       │   │   ├── monitoring_pipeline.py
│       │   │   └── tests/
│       │   ├── preprocessing/
│       │   │   ├── s3_event_handler.py
│       │   │   ├── preprocessing_pipeline.py
│       │   │   ├── data_ingestion.py
│       │   │   ├── data_validation.py
│       │   │   ├── data_cleaning.py
│       │   │   ├── data_transformation.py
│       │   │   ├── data_profiler.py
│       │   │   └── tests/
│       │   │       ├── test_data_ingestion.py
│       │   │       ├── test_data_validation.py
│       │   │       └── test_preprocessing_pipeline.py
│       │   ├── training/
│       │   │   ├── train_xgboost.py
│       │   │   ├── train_random_forest.py
│       │   │   ├── hyperparameter_tuning.py
│       │   │   ├── model_evaluation.py
│       │   │   └── tests/
│       │   └── utils/
│       │       ├── common_utils.py
│       │       ├── aws_utils.py
│       │       ├── data_utils.py
│       │       └── tests/
│       ├── config/
│       │   ├── environment_configs/
│       │   │   ├── development.yaml
│       │   │   ├── staging.yaml
│       │   │   └── production.yaml
│       │   ├── model_configs/
│       │   │   ├── xgboost_config.yaml
│       │   │   ├── random_forest_config.yaml
│       │   │   └── neural_network_config.yaml
│       │   ├── preprocessing_config.yaml
│       │   ├── feature_engineering_config.yaml
│       │   ├── training_config.yaml
│       │   ├── inference_config.yaml
│       │   └── monitoring_config.yaml
│       ├── data/
│       │   ├── curated/
│       │   │   ├── 2024
│       │   │   │    └── 01/
│       │   │   │        └── 01/
│       │   │   │           ├── customers_cleaned_20240101.parquet
│       │   │   │           ├── transactions_cleaned_20240101.parquet
│       │   │   │           ├── support_tickets_cleaned_20240101.parquet
│       │   │   │           └── usage_metrics_cleaned_20240101.parquet
│       │   │   └── consolidated/
│       │   │       ├── weekly/
│       │   │       │   └── customers_week_01_2024.parquet
│       │   │       └── monthly/
│       │   │           └── customers_jan_2024.parquet
│       │   ├── inference/
│       │   │   ├── batch/
│       │   │   │   ├── input/
│       │   │   │   │   ├── batch_20240101.parquet
│       │   │   │   │   └── batch_20240102.parquet
│       │   │   │   └── output/
│       │   │   │       ├── predictions_20240101.parquet
│       │   │   │       └── predictions_20240102.parquet
│       │   │   └── realtime/
│       │   │        ├── requests/
│       │   │        │   ├── 2024
│       │   │        │   │   └── 01/
│       │   │        │   │       └── 01/
│       │   │        │   └── 2024
│       │   │        │       └── 01/
│       │   │        │           └── 02/
│       │   │        └── responses/
│       │   │            ├── 2024
│       │   │            │   └── 01/
│       │   │            │       └── 01/
│       │   │            └── 2024
│       │   │                └── 01/
│       │   │                    └── 02/
│       │   ├── processed/
│       │   │   ├── train/
│       │   │   │   ├── features_train.parquet
│       │   │   │   ├── labels_train.parquet
│       │   │   │   ├── metadata_train.json
│       │   │   │   └── feature_store/
│       │   │   │       ├── customer_features.parquet
│       │   │   │       ├── transaction_features.parquet
│       │   │   │       ├── support_features.parquet
│       │   │   │       └── usage_features.parquet
│       │   │   ├── validation/
│       │   │   │   ├── features_validation.parquet
│       │   │   │   ├── labels_validation.parquet
│       │   │   │   ├── metadata_validation.json
│       │   │   │   └── feature_store/
│       │   │   ├── test/
│       │   │   │   ├── features_test.parquet
│       │   │   │   ├── labels_test.parquet
│       │   │   │   ├── metadata_test.json
│       │   │   │   └── feature_store/
│       │   │   └── feature_engineering/
│       │   │       ├── encoders/
│       │   │       │   ├── categorical_encoder.pkl
│       │   │       │   ├── numerical_scaler.pkl
│       │   │       │   └── feature_selector.pkl
│       │   │       ├── feature_definitions/
│       │   │       │   ├── feature_schema.json
│       │   │       │   ├── feature_catalog.json
│       │   │       │   └── feature_lineage.json
│       │   │       └── statistics/
│       │   │           ├── feature_stats.json
│       │   │           ├── correlation_matrix.json
│       │   │           └── importance_scores.json
│       │   └── raw/
│       │       ├── 2024
│       │       │   └── 01/
│       │       │       └── 01/
│       │       │           ├── customers_20240101.csv
│       │       │           ├── transactions_20240101.csv
│       │       │           ├── support_tickets_20240101.json
│       │       │           └── usage_metrics_20240101.parquet
│       │       └── archive/
│       ├── models/
│       │   ├── evaluation/
│       │   │   ├── model_comparison/
│       │   │   │   ├── comparison_report.html
│       │   │   │   ├── metrics_comparison.json
│       │   │   │   └── performance_charts/
│       │   │   │       ├── roc_curves.png
│       │   │   │       ├── precision_recall.png
│       │   │   │       └── feature_importance.png
│       │   │   ├── validation_results/
│       │   │   └── monitoring/
│       │   ├── experiments/
│       │   │   ├── experiment_001/
│       │   │   │   ├── config.json
│       │   │   │   ├── metrics.json
│       │   │   │   ├── parameters.json
│       │   │   │   └── artifacts/
│       │   │   │       ├── model.pkl
│       │   │   │       ├── feature_importance.json
│       │   │   │       └── confusion_matrix.png
│       │   │   └── experiment_002/
│       │   ├── registry/
│       │   │   ├── production/
│       │   │   │   └── model_v1.0.0/
│       │   │   │       ├── model_package.json
│       │   │   │       ├── approval_status.json
│       │   │   │       └── deployment_config.json
│       │   │   ├── staging/
│       │   │   └── development/
│       │   ├── training/
│       │   │   ├── xgboost/
│       │   │   │   ├── model.tar.gz
│       │   │   │   ├── model_metadata.json
│       │   │   │   └── training_job_config.json
│       │   │   ├── random_forest/
│       │   │   └── neural_network/
│       │   └── tuning/
│       │       └── tuning_job_001/
│       │           ├── best_training_job/
│       │           │   ├── model.tar.gz
│       │           │   └── hyperparameters.json
│       │           ├── all_training_jobs/
│       │           └── tuning_results.json
│       ├── notebooks/
│       │   ├── evaluation/
│       │   │   ├── model_performance_analysis.ipynb
│       │   │   ├── bias_fairness_evaluation.ipynb
│       │   │   └── model_interpretability.ipynb
│       │   ├── exploration/
│       │   │   ├── customer_analysis.ipynb
│       │   │   ├── transaction_patterns.ipynb
│       │   │   ├── support_ticket_analysis.ipynb
│       │   │   └── churn_pattern_discovery.ipynb
│       │   ├── inference/
│       │   │   ├── batch_inference_testing.ipynb
│       │   │   └── realtime_inference_testing.ipynb
│       │   ├── preprocessing/
│       │   │   ├── data_quality_assessment.ipynb
│       │   │   ├── data_cleaning_validation.ipynb
│       │   │   └── preprocessing_pipeline_validation.ipynb
│       │   └── training/
│       │       ├── baseline_model_training.ipynb
│       │       ├── hyperparameter_tuning.ipynb
│       │       └── ensemble_model_training.ipynb
│       └── templates/
│           └── service-catalog/
├── shared/
│   ├── infrastructure/
│   ├── monitoring/
│   └── utilities/
└── client_config/
    ├── environments/
    ├── branding/
    └── policies/

See LICENSE.txt for terms and conditions.