Monitoring and Health Checks¶

Monitoring strategies and health check procedures for s3-provisioner-tool.

Table of Contents¶

Health Check Endpoints
CloudWatch Metrics
Log Monitoring
Alerting
Performance Monitoring
Availability Monitoring
Reporting
Automation
Monitoring Checklist

Health Check Endpoints¶

Basic Health Check¶

#!/bin/bash
# health-check.sh

set -e

echo "=== S3 Provisioner Health Check ==="

# 1. Check Python environment
if python -m s3_provisioner.cli --version > /dev/null 2>&1; then
  echo "✓ Python environment OK"
else
  echo "✗ Python environment FAILED"
  exit 1
fi

# 2. Check AWS credentials
if aws sts get-caller-identity > /dev/null 2>&1; then
  echo "✓ AWS credentials OK"
else
  echo "✗ AWS credentials FAILED"
  exit 1
fi

# 3. Check EC2/S3 access
if aws ec2 list-buckets --max-results 1 > /dev/null 2>&1; then
  echo "✓ S3 access OK"
else
  echo "✗ S3 access FAILED"
  exit 1
fi

# 4. Check CloudFormation access
if aws cloudformation list-stacks --max-items 1 > /dev/null 2>&1; then
  echo "✓ CloudFormation access OK"
else
  echo "✗ CloudFormation access FAILED"
  exit 1
fi

echo "=== Health Check PASSED ==="
exit 0

Detailed Health Check¶

#!/bin/bash
# health-check-detailed.sh

HEALTH_LOG="/var/log/s3-provisioner-health.log"

{
  echo "=== Health Check: $(date) ==="
  
  # System resources
  echo "CPU Usage: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')"
  echo "Memory Usage: $(free -m | awk 'NR==2{printf "%.2f%%", $3*100/$2 }')"
  echo "Disk Usage: $(df -h /home/user | awk 'NR==2{print $5}')"
  
  # Python environment
  echo "Python Version: $(python --version)"
  echo "Package Version: $(pip show s3-provisioner | grep Version)"
  
  # AWS connectivity
  echo "AWS Account: $(aws sts get-caller-identity --query Account --output text)"
  echo "AWS Region: $(aws configure get region)"
  
  # S3 resources
  echo "VPC Count: $(aws ec2 list-buckets --query 'length(Vpcs)' --output text)"
  
  # Recent errors
  ERROR_COUNT=$(grep -c ERROR reports/*.log 2>/dev/null || echo 0)
  echo "Recent Errors: $ERROR_COUNT"
  
  echo "=== Health Check Complete ==="
} | tee -a "$HEALTH_LOG"

CloudWatch Metrics¶

Custom Metrics¶

Send metrics to CloudWatch:

# metrics.py
import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def send_metric(metric_name, value, unit='Count'):
    """Send custom metric to CloudWatch"""
    cloudwatch.put_metric_data(
        Namespace='S3Provisioner',
        MetricData=[
            {
                'MetricName': metric_name,
                'Value': value,
                'Unit': unit,
                'Timestamp': datetime.utcnow()
            }
        ]
    )

# Usage
send_metric('BucketCreationSuccess', 1)
send_metric('BucketCreationTime', 125.8, 'Seconds')
send_metric('TemplateGenerationSuccess', 1)

Key Metrics to Monitor¶

Operation Success Rate
- Bucket creation success/failure
- Template generation success/failure
- Deployment success/failure
Performance Metrics
- Bucket creation time
- Template generation time
- Deployment time
Resource Metrics
- Number of Buckets created
- Number of subnets created
- Number of route tables created
Error Metrics
- API errors
- Validation errors
- Permission errors

CloudWatch Dashboard¶

# Create CloudWatch dashboard
aws cloudwatch put-dashboard \
  --dashboard-name S3ProvisionerDashboard \
  --dashboard-body file://dashboard.json

dashboard.json:

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["S3Provisioner", "BucketCreationSuccess"],
          [".", "BucketCreationFailure"]
        ],
        "period": 300,
        "stat": "Sum",
        "region": "us-west-1",
        "title": "VPC Creation Status"
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["S3Provisioner", "BucketCreationTime"]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-west-1",
        "title": "Average Creation Time"
      }
    }
  ]
}

Log Monitoring¶

Log Analysis¶

#!/bin/bash
# analyze-logs.sh

REPORTS_DIR="reports"
DATE=$(date +%Y%m%d)

echo "=== Log Analysis for $DATE ==="

# Count operations
echo "Operations today:"
grep -h "Action.*completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
  awk '{print $6}' | sort | uniq -c

# Count errors
echo ""
echo "Errors today:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | wc -l

# Recent errors
echo ""
echo "Recent error messages:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | tail -5

# Performance stats
echo ""
echo "Average execution time:"
grep -h "completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
  awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}'

Real-time Log Monitoring¶

# Monitor logs in real-time
tail -f reports/*.log | grep --line-buffered -E "ERROR|WARNING|SUCCESS"

Alerting¶

CloudWatch Alarms¶

Create alarms for critical events:

# Alarm for high error rate
aws cloudwatch put-metric-alarm \
  --alarm-name s3-provisioner-high-error-rate \
  --alarm-description "Alert when error rate exceeds threshold" \
  --metric-name BucketCreationFailure \
  --namespace S3Provisioner \
  --statistic Sum \
  --period 300 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-west-1:123456789012:alerts

# Alarm for slow performance
aws cloudwatch put-metric-alarm \
  --alarm-name s3-provisioner-slow-performance \
  --alarm-description "Alert when creation time is too slow" \
  --metric-name BucketCreationTime \
  --namespace S3Provisioner \
  --statistic Average \
  --period 300 \
  --threshold 180 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-west-1:123456789012:alerts

Email Alerts¶

#!/bin/bash
# alert-on-error.sh

LOG_FILE="reports/latest.log"
ERROR_COUNT=$(grep -c ERROR "$LOG_FILE" 2>/dev/null || echo 0)

if [ "$ERROR_COUNT" -gt 5 ]; then
  # Send email alert
  echo "High error count detected: $ERROR_COUNT errors" | \
    mail -s "S3 Provisioner Alert" ops-team@company.com
fi

Performance Monitoring¶

Execution Time Tracking¶

# performance.py
import time
import functools

def track_time(func):
    """Decorator to track function execution time"""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        duration = time.time() - start
        
        # Log performance
        print(f"{func.__name__} took {duration:.2f} seconds")
        
        # Send to CloudWatch
        send_metric(f"{func.__name__}Time", duration, 'Seconds')
        
        return result
    return wrapper

# Usage
@track_time
def create_vpc(config):
    # ... Bucket creation logic ...
    pass

Resource Usage Monitoring¶

#!/bin/bash
# monitor-resources.sh

while true; do
  # CPU usage
  CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}')
  
  # Memory usage
  MEM=$(free -m | awk 'NR==2{printf "%.2f", $3*100/$2 }')
  
  # Disk usage
  DISK=$(df -h /home/user | awk 'NR==2{print $5}' | sed 's/%//')
  
  # Log to file
  echo "$(date),CPU:$CPU,MEM:$MEM%,DISK:$DISK%" >> /var/log/s3-provisioner-resources.log
  
  # Send to CloudWatch
  aws cloudwatch put-metric-data \
    --namespace S3Provisioner \
    --metric-name CPUUsage \
    --value "$CPU" \
    --unit Percent
  
  sleep 60
done

Availability Monitoring¶

Uptime Check¶

#!/bin/bash
# uptime-check.sh

ENDPOINT="http://localhost:8080/health"
MAX_RETRIES=3
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
  if curl -f -s "$ENDPOINT" > /dev/null; then
    echo "✓ Service is UP"
    exit 0
  else
    RETRY_COUNT=$((RETRY_COUNT + 1))
    echo "Retry $RETRY_COUNT/$MAX_RETRIES"
    sleep 5
  fi
done

echo "✗ Service is DOWN"
exit 1

Service Status Page¶

# status.py
from flask import Flask, jsonify
import subprocess
from datetime import datetime

app = Flask(__name__)

@app.route('/health')
def health():
    """Health check endpoint"""
    try:
        # Check Python environment
        subprocess.run(['python', '-m', 's3_provisioner.cli', '--version'], 
                      check=True, capture_output=True)
        
        # Check AWS credentials
        subprocess.run(['aws', 'sts', 'get-caller-identity'], 
                      check=True, capture_output=True)
        
        return jsonify({
            'status': 'healthy',
            'timestamp': datetime.utcnow().isoformat()
        }), 200
    except Exception as e:
        return jsonify({
            'status': 'unhealthy',
            'error': str(e),
            'timestamp': datetime.utcnow().isoformat()
        }), 503

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Reporting¶

Daily Report¶

#!/bin/bash
# daily-report.sh

DATE=$(date +%Y-%m-%d)
REPORT_FILE="reports/daily-report-$DATE.txt"

{
  echo "=== S3 Provisioner Daily Report: $DATE ==="
  echo ""
  
  echo "Operations Summary:"
  grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
    awk '{print $6}' | sort | uniq -c
  
  echo ""
  echo "Error Summary:"
  grep -h "ERROR" reports/*$(date +%Y%m%d)*.log | wc -l
  
  echo ""
  echo "Performance:"
  echo "Average execution time: $(grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
    awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}')"
  
  echo ""
  echo "Resource Usage:"
  echo "Buckets created: $(grep -h "VPC created" reports/*$(date +%Y%m%d)*.log | wc -l)"
  echo "Templates generated: $(grep -h "Template generated" reports/*$(date +%Y%m%d)*.log | wc -l)"
  
} | tee "$REPORT_FILE"

# Email report
mail -s "S3 Provisioner Daily Report" ops-team@company.com < "$REPORT_FILE"

Weekly Report¶

#!/bin/bash
# weekly-report.sh

WEEK_START=$(date -d "7 days ago" +%Y%m%d)
WEEK_END=$(date +%Y%m%d)

echo "=== S3 Provisioner Weekly Report: $WEEK_START to $WEEK_END ==="

# Total operations
echo "Total operations: $(grep -h "completed successfully" reports/*.log | wc -l)"

# Success rate
TOTAL=$(grep -h "Action" reports/*.log | wc -l)
SUCCESS=$(grep -h "completed successfully" reports/*.log | wc -l)
echo "Success rate: $(echo "scale=2; $SUCCESS*100/$TOTAL" | bc)%"

# Most common operations
echo ""
echo "Most common operations:"
grep -h "completed successfully" reports/*.log | \
  awk '{print $6}' | sort | uniq -c | sort -rn | head -5

Automation¶

Cron Schedule¶

# Edit crontab
crontab -e

# Health check every 5 minutes
*/5 * * * * /home/user/scripts/health-check.sh

# Detailed health check every hour
0 * * * * /home/user/scripts/health-check-detailed.sh

# Log analysis every 6 hours
0 */6 * * * /home/user/scripts/analyze-logs.sh

# Daily report at 8 AM
0 8 * * * /home/user/scripts/daily-report.sh

# Weekly report on Monday at 9 AM
0 9 * * 1 /home/user/scripts/weekly-report.sh

Monitoring and Health Checks¶

Table of Contents¶

Health Check Endpoints¶

Basic Health Check¶

Detailed Health Check¶

CloudWatch Metrics¶

Custom Metrics¶

Key Metrics to Monitor¶

CloudWatch Dashboard¶

Log Monitoring¶

Log Analysis¶

Real-time Log Monitoring¶

Alerting¶

CloudWatch Alarms¶

Email Alerts¶

Performance Monitoring¶

Execution Time Tracking¶

Resource Usage Monitoring¶

Availability Monitoring¶

Uptime Check¶

Service Status Page¶

Reporting¶

Daily Report¶

Weekly Report¶

Automation¶

Cron Schedule¶

Monitoring Checklist¶