Monitoring and Health Checks

Monitoring strategies and health check procedures for s3-provisioner-tool.

Table of Contents

Health Check Endpoints

Basic Health Check

#!/bin/bash
# health-check.sh

set -e

echo "=== S3 Provisioner Health Check ==="

# 1. Check Python environment
if python -m s3_provisioner.cli --version > /dev/null 2>&1; then
  echo "✓ Python environment OK"
else
  echo "✗ Python environment FAILED"
  exit 1
fi

# 2. Check AWS credentials
if aws sts get-caller-identity > /dev/null 2>&1; then
  echo "✓ AWS credentials OK"
else
  echo "✗ AWS credentials FAILED"
  exit 1
fi

# 3. Check EC2/S3 access
if aws ec2 list-buckets --max-results 1 > /dev/null 2>&1; then
  echo "✓ S3 access OK"
else
  echo "✗ S3 access FAILED"
  exit 1
fi

# 4. Check CloudFormation access
if aws cloudformation list-stacks --max-items 1 > /dev/null 2>&1; then
  echo "✓ CloudFormation access OK"
else
  echo "✗ CloudFormation access FAILED"
  exit 1
fi

echo "=== Health Check PASSED ==="
exit 0

Detailed Health Check

#!/bin/bash
# health-check-detailed.sh

HEALTH_LOG="/var/log/s3-provisioner-health.log"

{
  echo "=== Health Check: $(date) ==="
  
  # System resources
  echo "CPU Usage: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')"
  echo "Memory Usage: $(free -m | awk 'NR==2{printf "%.2f%%", $3*100/$2 }')"
  echo "Disk Usage: $(df -h /home/user | awk 'NR==2{print $5}')"
  
  # Python environment
  echo "Python Version: $(python --version)"
  echo "Package Version: $(pip show s3-provisioner | grep Version)"
  
  # AWS connectivity
  echo "AWS Account: $(aws sts get-caller-identity --query Account --output text)"
  echo "AWS Region: $(aws configure get region)"
  
  # S3 resources
  echo "VPC Count: $(aws ec2 list-buckets --query 'length(Vpcs)' --output text)"
  
  # Recent errors
  ERROR_COUNT=$(grep -c ERROR reports/*.log 2>/dev/null || echo 0)
  echo "Recent Errors: $ERROR_COUNT"
  
  echo "=== Health Check Complete ==="
} | tee -a "$HEALTH_LOG"

CloudWatch Metrics

Custom Metrics

Send metrics to CloudWatch:

# metrics.py
import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def send_metric(metric_name, value, unit='Count'):
    """Send custom metric to CloudWatch"""
    cloudwatch.put_metric_data(
        Namespace='S3Provisioner',
        MetricData=[
            {
                'MetricName': metric_name,
                'Value': value,
                'Unit': unit,
                'Timestamp': datetime.utcnow()
            }
        ]
    )

# Usage
send_metric('BucketCreationSuccess', 1)
send_metric('BucketCreationTime', 125.8, 'Seconds')
send_metric('TemplateGenerationSuccess', 1)

Key Metrics to Monitor

  1. Operation Success Rate

    • Bucket creation success/failure

    • Template generation success/failure

    • Deployment success/failure

  2. Performance Metrics

    • Bucket creation time

    • Template generation time

    • Deployment time

  3. Resource Metrics

    • Number of Buckets created

    • Number of subnets created

    • Number of route tables created

  4. Error Metrics

    • API errors

    • Validation errors

    • Permission errors

CloudWatch Dashboard

# Create CloudWatch dashboard
aws cloudwatch put-dashboard \
  --dashboard-name S3ProvisionerDashboard \
  --dashboard-body file://dashboard.json

dashboard.json:

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["S3Provisioner", "BucketCreationSuccess"],
          [".", "BucketCreationFailure"]
        ],
        "period": 300,
        "stat": "Sum",
        "region": "us-west-1",
        "title": "VPC Creation Status"
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["S3Provisioner", "BucketCreationTime"]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-west-1",
        "title": "Average Creation Time"
      }
    }
  ]
}

Log Monitoring

Log Analysis

#!/bin/bash
# analyze-logs.sh

REPORTS_DIR="reports"
DATE=$(date +%Y%m%d)

echo "=== Log Analysis for $DATE ==="

# Count operations
echo "Operations today:"
grep -h "Action.*completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
  awk '{print $6}' | sort | uniq -c

# Count errors
echo ""
echo "Errors today:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | wc -l

# Recent errors
echo ""
echo "Recent error messages:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | tail -5

# Performance stats
echo ""
echo "Average execution time:"
grep -h "completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
  awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}'

Real-time Log Monitoring

# Monitor logs in real-time
tail -f reports/*.log | grep --line-buffered -E "ERROR|WARNING|SUCCESS"

Alerting

CloudWatch Alarms

Create alarms for critical events:

# Alarm for high error rate
aws cloudwatch put-metric-alarm \
  --alarm-name s3-provisioner-high-error-rate \
  --alarm-description "Alert when error rate exceeds threshold" \
  --metric-name BucketCreationFailure \
  --namespace S3Provisioner \
  --statistic Sum \
  --period 300 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-west-1:123456789012:alerts

# Alarm for slow performance
aws cloudwatch put-metric-alarm \
  --alarm-name s3-provisioner-slow-performance \
  --alarm-description "Alert when creation time is too slow" \
  --metric-name BucketCreationTime \
  --namespace S3Provisioner \
  --statistic Average \
  --period 300 \
  --threshold 180 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-west-1:123456789012:alerts

Email Alerts

#!/bin/bash
# alert-on-error.sh

LOG_FILE="reports/latest.log"
ERROR_COUNT=$(grep -c ERROR "$LOG_FILE" 2>/dev/null || echo 0)

if [ "$ERROR_COUNT" -gt 5 ]; then
  # Send email alert
  echo "High error count detected: $ERROR_COUNT errors" | \
    mail -s "S3 Provisioner Alert" ops-team@company.com
fi

Performance Monitoring

Execution Time Tracking

# performance.py
import time
import functools

def track_time(func):
    """Decorator to track function execution time"""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        duration = time.time() - start
        
        # Log performance
        print(f"{func.__name__} took {duration:.2f} seconds")
        
        # Send to CloudWatch
        send_metric(f"{func.__name__}Time", duration, 'Seconds')
        
        return result
    return wrapper

# Usage
@track_time
def create_vpc(config):
    # ... Bucket creation logic ...
    pass

Resource Usage Monitoring

#!/bin/bash
# monitor-resources.sh

while true; do
  # CPU usage
  CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}')
  
  # Memory usage
  MEM=$(free -m | awk 'NR==2{printf "%.2f", $3*100/$2 }')
  
  # Disk usage
  DISK=$(df -h /home/user | awk 'NR==2{print $5}' | sed 's/%//')
  
  # Log to file
  echo "$(date),CPU:$CPU,MEM:$MEM%,DISK:$DISK%" >> /var/log/s3-provisioner-resources.log
  
  # Send to CloudWatch
  aws cloudwatch put-metric-data \
    --namespace S3Provisioner \
    --metric-name CPUUsage \
    --value "$CPU" \
    --unit Percent
  
  sleep 60
done

Availability Monitoring

Uptime Check

#!/bin/bash
# uptime-check.sh

ENDPOINT="http://localhost:8080/health"
MAX_RETRIES=3
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
  if curl -f -s "$ENDPOINT" > /dev/null; then
    echo "✓ Service is UP"
    exit 0
  else
    RETRY_COUNT=$((RETRY_COUNT + 1))
    echo "Retry $RETRY_COUNT/$MAX_RETRIES"
    sleep 5
  fi
done

echo "✗ Service is DOWN"
exit 1

Service Status Page

# status.py
from flask import Flask, jsonify
import subprocess
from datetime import datetime

app = Flask(__name__)

@app.route('/health')
def health():
    """Health check endpoint"""
    try:
        # Check Python environment
        subprocess.run(['python', '-m', 's3_provisioner.cli', '--version'], 
                      check=True, capture_output=True)
        
        # Check AWS credentials
        subprocess.run(['aws', 'sts', 'get-caller-identity'], 
                      check=True, capture_output=True)
        
        return jsonify({
            'status': 'healthy',
            'timestamp': datetime.utcnow().isoformat()
        }), 200
    except Exception as e:
        return jsonify({
            'status': 'unhealthy',
            'error': str(e),
            'timestamp': datetime.utcnow().isoformat()
        }), 503

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Reporting

Daily Report

#!/bin/bash
# daily-report.sh

DATE=$(date +%Y-%m-%d)
REPORT_FILE="reports/daily-report-$DATE.txt"

{
  echo "=== S3 Provisioner Daily Report: $DATE ==="
  echo ""
  
  echo "Operations Summary:"
  grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
    awk '{print $6}' | sort | uniq -c
  
  echo ""
  echo "Error Summary:"
  grep -h "ERROR" reports/*$(date +%Y%m%d)*.log | wc -l
  
  echo ""
  echo "Performance:"
  echo "Average execution time: $(grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
    awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}')"
  
  echo ""
  echo "Resource Usage:"
  echo "Buckets created: $(grep -h "VPC created" reports/*$(date +%Y%m%d)*.log | wc -l)"
  echo "Templates generated: $(grep -h "Template generated" reports/*$(date +%Y%m%d)*.log | wc -l)"
  
} | tee "$REPORT_FILE"

# Email report
mail -s "S3 Provisioner Daily Report" ops-team@company.com < "$REPORT_FILE"

Weekly Report

#!/bin/bash
# weekly-report.sh

WEEK_START=$(date -d "7 days ago" +%Y%m%d)
WEEK_END=$(date +%Y%m%d)

echo "=== S3 Provisioner Weekly Report: $WEEK_START to $WEEK_END ==="

# Total operations
echo "Total operations: $(grep -h "completed successfully" reports/*.log | wc -l)"

# Success rate
TOTAL=$(grep -h "Action" reports/*.log | wc -l)
SUCCESS=$(grep -h "completed successfully" reports/*.log | wc -l)
echo "Success rate: $(echo "scale=2; $SUCCESS*100/$TOTAL" | bc)%"

# Most common operations
echo ""
echo "Most common operations:"
grep -h "completed successfully" reports/*.log | \
  awk '{print $6}' | sort | uniq -c | sort -rn | head -5

Automation

Cron Schedule

# Edit crontab
crontab -e

# Health check every 5 minutes
*/5 * * * * /home/user/scripts/health-check.sh

# Detailed health check every hour
0 * * * * /home/user/scripts/health-check-detailed.sh

# Log analysis every 6 hours
0 */6 * * * /home/user/scripts/analyze-logs.sh

# Daily report at 8 AM
0 8 * * * /home/user/scripts/daily-report.sh

# Weekly report on Monday at 9 AM
0 9 * * 1 /home/user/scripts/weekly-report.sh

Monitoring Checklist

  • CloudWatch metrics configured

  • CloudWatch alarms set up

  • Log monitoring in place

  • Performance tracking enabled

  • Resource monitoring active

  • Health checks scheduled

  • Alerting configured

  • Daily reports automated

  • Weekly reports automated

  • Dashboard created