Monitoring and Health Checks¶
Monitoring strategies and health check procedures for s3-provisioner-tool.
Table of Contents¶
Health Check Endpoints¶
Basic Health Check¶
#!/bin/bash
# health-check.sh
set -e
echo "=== S3 Provisioner Health Check ==="
# 1. Check Python environment
if python -m s3_provisioner.cli --version > /dev/null 2>&1; then
echo "✓ Python environment OK"
else
echo "✗ Python environment FAILED"
exit 1
fi
# 2. Check AWS credentials
if aws sts get-caller-identity > /dev/null 2>&1; then
echo "✓ AWS credentials OK"
else
echo "✗ AWS credentials FAILED"
exit 1
fi
# 3. Check EC2/S3 access
if aws ec2 list-buckets --max-results 1 > /dev/null 2>&1; then
echo "✓ S3 access OK"
else
echo "✗ S3 access FAILED"
exit 1
fi
# 4. Check CloudFormation access
if aws cloudformation list-stacks --max-items 1 > /dev/null 2>&1; then
echo "✓ CloudFormation access OK"
else
echo "✗ CloudFormation access FAILED"
exit 1
fi
echo "=== Health Check PASSED ==="
exit 0
Detailed Health Check¶
#!/bin/bash
# health-check-detailed.sh
HEALTH_LOG="/var/log/s3-provisioner-health.log"
{
echo "=== Health Check: $(date) ==="
# System resources
echo "CPU Usage: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')"
echo "Memory Usage: $(free -m | awk 'NR==2{printf "%.2f%%", $3*100/$2 }')"
echo "Disk Usage: $(df -h /home/user | awk 'NR==2{print $5}')"
# Python environment
echo "Python Version: $(python --version)"
echo "Package Version: $(pip show s3-provisioner | grep Version)"
# AWS connectivity
echo "AWS Account: $(aws sts get-caller-identity --query Account --output text)"
echo "AWS Region: $(aws configure get region)"
# S3 resources
echo "VPC Count: $(aws ec2 list-buckets --query 'length(Vpcs)' --output text)"
# Recent errors
ERROR_COUNT=$(grep -c ERROR reports/*.log 2>/dev/null || echo 0)
echo "Recent Errors: $ERROR_COUNT"
echo "=== Health Check Complete ==="
} | tee -a "$HEALTH_LOG"
CloudWatch Metrics¶
Custom Metrics¶
Send metrics to CloudWatch:
# metrics.py
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def send_metric(metric_name, value, unit='Count'):
"""Send custom metric to CloudWatch"""
cloudwatch.put_metric_data(
Namespace='S3Provisioner',
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': unit,
'Timestamp': datetime.utcnow()
}
]
)
# Usage
send_metric('BucketCreationSuccess', 1)
send_metric('BucketCreationTime', 125.8, 'Seconds')
send_metric('TemplateGenerationSuccess', 1)
Key Metrics to Monitor¶
Operation Success Rate
Bucket creation success/failure
Template generation success/failure
Deployment success/failure
Performance Metrics
Bucket creation time
Template generation time
Deployment time
Resource Metrics
Number of Buckets created
Number of subnets created
Number of route tables created
Error Metrics
API errors
Validation errors
Permission errors
CloudWatch Dashboard¶
# Create CloudWatch dashboard
aws cloudwatch put-dashboard \
--dashboard-name S3ProvisionerDashboard \
--dashboard-body file://dashboard.json
dashboard.json:
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["S3Provisioner", "BucketCreationSuccess"],
[".", "BucketCreationFailure"]
],
"period": 300,
"stat": "Sum",
"region": "us-west-1",
"title": "VPC Creation Status"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["S3Provisioner", "BucketCreationTime"]
],
"period": 300,
"stat": "Average",
"region": "us-west-1",
"title": "Average Creation Time"
}
}
]
}
Log Monitoring¶
Log Analysis¶
#!/bin/bash
# analyze-logs.sh
REPORTS_DIR="reports"
DATE=$(date +%Y%m%d)
echo "=== Log Analysis for $DATE ==="
# Count operations
echo "Operations today:"
grep -h "Action.*completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
awk '{print $6}' | sort | uniq -c
# Count errors
echo ""
echo "Errors today:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | wc -l
# Recent errors
echo ""
echo "Recent error messages:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | tail -5
# Performance stats
echo ""
echo "Average execution time:"
grep -h "completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}'
Real-time Log Monitoring¶
# Monitor logs in real-time
tail -f reports/*.log | grep --line-buffered -E "ERROR|WARNING|SUCCESS"
Alerting¶
CloudWatch Alarms¶
Create alarms for critical events:
# Alarm for high error rate
aws cloudwatch put-metric-alarm \
--alarm-name s3-provisioner-high-error-rate \
--alarm-description "Alert when error rate exceeds threshold" \
--metric-name BucketCreationFailure \
--namespace S3Provisioner \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-west-1:123456789012:alerts
# Alarm for slow performance
aws cloudwatch put-metric-alarm \
--alarm-name s3-provisioner-slow-performance \
--alarm-description "Alert when creation time is too slow" \
--metric-name BucketCreationTime \
--namespace S3Provisioner \
--statistic Average \
--period 300 \
--threshold 180 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-west-1:123456789012:alerts
Email Alerts¶
#!/bin/bash
# alert-on-error.sh
LOG_FILE="reports/latest.log"
ERROR_COUNT=$(grep -c ERROR "$LOG_FILE" 2>/dev/null || echo 0)
if [ "$ERROR_COUNT" -gt 5 ]; then
# Send email alert
echo "High error count detected: $ERROR_COUNT errors" | \
mail -s "S3 Provisioner Alert" ops-team@company.com
fi
Performance Monitoring¶
Execution Time Tracking¶
# performance.py
import time
import functools
def track_time(func):
"""Decorator to track function execution time"""
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
duration = time.time() - start
# Log performance
print(f"{func.__name__} took {duration:.2f} seconds")
# Send to CloudWatch
send_metric(f"{func.__name__}Time", duration, 'Seconds')
return result
return wrapper
# Usage
@track_time
def create_vpc(config):
# ... Bucket creation logic ...
pass
Resource Usage Monitoring¶
#!/bin/bash
# monitor-resources.sh
while true; do
# CPU usage
CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}')
# Memory usage
MEM=$(free -m | awk 'NR==2{printf "%.2f", $3*100/$2 }')
# Disk usage
DISK=$(df -h /home/user | awk 'NR==2{print $5}' | sed 's/%//')
# Log to file
echo "$(date),CPU:$CPU,MEM:$MEM%,DISK:$DISK%" >> /var/log/s3-provisioner-resources.log
# Send to CloudWatch
aws cloudwatch put-metric-data \
--namespace S3Provisioner \
--metric-name CPUUsage \
--value "$CPU" \
--unit Percent
sleep 60
done
Availability Monitoring¶
Uptime Check¶
#!/bin/bash
# uptime-check.sh
ENDPOINT="http://localhost:8080/health"
MAX_RETRIES=3
RETRY_COUNT=0
while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
if curl -f -s "$ENDPOINT" > /dev/null; then
echo "✓ Service is UP"
exit 0
else
RETRY_COUNT=$((RETRY_COUNT + 1))
echo "Retry $RETRY_COUNT/$MAX_RETRIES"
sleep 5
fi
done
echo "✗ Service is DOWN"
exit 1
Service Status Page¶
# status.py
from flask import Flask, jsonify
import subprocess
from datetime import datetime
app = Flask(__name__)
@app.route('/health')
def health():
"""Health check endpoint"""
try:
# Check Python environment
subprocess.run(['python', '-m', 's3_provisioner.cli', '--version'],
check=True, capture_output=True)
# Check AWS credentials
subprocess.run(['aws', 'sts', 'get-caller-identity'],
check=True, capture_output=True)
return jsonify({
'status': 'healthy',
'timestamp': datetime.utcnow().isoformat()
}), 200
except Exception as e:
return jsonify({
'status': 'unhealthy',
'error': str(e),
'timestamp': datetime.utcnow().isoformat()
}), 503
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Reporting¶
Daily Report¶
#!/bin/bash
# daily-report.sh
DATE=$(date +%Y-%m-%d)
REPORT_FILE="reports/daily-report-$DATE.txt"
{
echo "=== S3 Provisioner Daily Report: $DATE ==="
echo ""
echo "Operations Summary:"
grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
awk '{print $6}' | sort | uniq -c
echo ""
echo "Error Summary:"
grep -h "ERROR" reports/*$(date +%Y%m%d)*.log | wc -l
echo ""
echo "Performance:"
echo "Average execution time: $(grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}')"
echo ""
echo "Resource Usage:"
echo "Buckets created: $(grep -h "VPC created" reports/*$(date +%Y%m%d)*.log | wc -l)"
echo "Templates generated: $(grep -h "Template generated" reports/*$(date +%Y%m%d)*.log | wc -l)"
} | tee "$REPORT_FILE"
# Email report
mail -s "S3 Provisioner Daily Report" ops-team@company.com < "$REPORT_FILE"
Weekly Report¶
#!/bin/bash
# weekly-report.sh
WEEK_START=$(date -d "7 days ago" +%Y%m%d)
WEEK_END=$(date +%Y%m%d)
echo "=== S3 Provisioner Weekly Report: $WEEK_START to $WEEK_END ==="
# Total operations
echo "Total operations: $(grep -h "completed successfully" reports/*.log | wc -l)"
# Success rate
TOTAL=$(grep -h "Action" reports/*.log | wc -l)
SUCCESS=$(grep -h "completed successfully" reports/*.log | wc -l)
echo "Success rate: $(echo "scale=2; $SUCCESS*100/$TOTAL" | bc)%"
# Most common operations
echo ""
echo "Most common operations:"
grep -h "completed successfully" reports/*.log | \
awk '{print $6}' | sort | uniq -c | sort -rn | head -5
Automation¶
Cron Schedule¶
# Edit crontab
crontab -e
# Health check every 5 minutes
*/5 * * * * /home/user/scripts/health-check.sh
# Detailed health check every hour
0 * * * * /home/user/scripts/health-check-detailed.sh
# Log analysis every 6 hours
0 */6 * * * /home/user/scripts/analyze-logs.sh
# Daily report at 8 AM
0 8 * * * /home/user/scripts/daily-report.sh
# Weekly report on Monday at 9 AM
0 9 * * 1 /home/user/scripts/weekly-report.sh
Monitoring Checklist¶
CloudWatch metrics configured
CloudWatch alarms set up
Log monitoring in place
Performance tracking enabled
Resource monitoring active
Health checks scheduled
Alerting configured
Daily reports automated
Weekly reports automated
Dashboard created