Monitoring and Health Checks¶
Monitoring strategies and health check procedures for vpc-provisioner-tool.
Table of Contents¶
Health Check Endpoints¶
Basic Health Check¶
#!/bin/bash
# health-check.sh
set -e
echo "=== VPC Provisioner Health Check ==="
# 1. Check Python environment
if python -m vpc_provisioner.cli --version > /dev/null 2>&1; then
echo "✓ Python environment OK"
else
echo "✗ Python environment FAILED"
exit 1
fi
# 2. Check AWS credentials
if aws sts get-caller-identity > /dev/null 2>&1; then
echo "✓ AWS credentials OK"
else
echo "✗ AWS credentials FAILED"
exit 1
fi
# 3. Check EC2/VPC access
if aws ec2 describe-vpcs --max-results 1 > /dev/null 2>&1; then
echo "✓ VPC access OK"
else
echo "✗ VPC access FAILED"
exit 1
fi
# 4. Check CloudFormation access
if aws cloudformation list-stacks --max-items 1 > /dev/null 2>&1; then
echo "✓ CloudFormation access OK"
else
echo "✗ CloudFormation access FAILED"
exit 1
fi
echo "=== Health Check PASSED ==="
exit 0
Detailed Health Check¶
#!/bin/bash
# health-check-detailed.sh
HEALTH_LOG="/var/log/vpc-provisioner-health.log"
{
echo "=== Health Check: $(date) ==="
# System resources
echo "CPU Usage: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')"
echo "Memory Usage: $(free -m | awk 'NR==2{printf "%.2f%%", $3*100/$2 }')"
echo "Disk Usage: $(df -h /home/user | awk 'NR==2{print $5}')"
# Python environment
echo "Python Version: $(python --version)"
echo "Package Version: $(pip show vpc-provisioner | grep Version)"
# AWS connectivity
echo "AWS Account: $(aws sts get-caller-identity --query Account --output text)"
echo "AWS Region: $(aws configure get region)"
# VPC resources
echo "VPC Count: $(aws ec2 describe-vpcs --query 'length(Vpcs)' --output text)"
# Recent errors
ERROR_COUNT=$(grep -c ERROR reports/*.log 2>/dev/null || echo 0)
echo "Recent Errors: $ERROR_COUNT"
echo "=== Health Check Complete ==="
} | tee -a "$HEALTH_LOG"
CloudWatch Metrics¶
Custom Metrics¶
Send metrics to CloudWatch:
# metrics.py
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def send_metric(metric_name, value, unit='Count'):
"""Send custom metric to CloudWatch"""
cloudwatch.put_metric_data(
Namespace='VPCProvisioner',
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': unit,
'Timestamp': datetime.utcnow()
}
]
)
# Usage
send_metric('VPCCreationSuccess', 1)
send_metric('VPCCreationTime', 125.8, 'Seconds')
send_metric('TemplateGenerationSuccess', 1)
Key Metrics to Monitor¶
Operation Success Rate
VPC creation success/failure
Template generation success/failure
Deployment success/failure
Performance Metrics
VPC creation time
Template generation time
Deployment time
Resource Metrics
Number of VPCs created
Number of subnets created
Number of route tables created
Error Metrics
API errors
Validation errors
Permission errors
CloudWatch Dashboard¶
# Create CloudWatch dashboard
aws cloudwatch put-dashboard \
--dashboard-name VPCProvisionerDashboard \
--dashboard-body file://dashboard.json
dashboard.json:
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["VPCProvisioner", "VPCCreationSuccess"],
[".", "VPCCreationFailure"]
],
"period": 300,
"stat": "Sum",
"region": "us-west-2",
"title": "VPC Creation Status"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["VPCProvisioner", "VPCCreationTime"]
],
"period": 300,
"stat": "Average",
"region": "us-west-2",
"title": "Average Creation Time"
}
}
]
}
Log Monitoring¶
Log Analysis¶
#!/bin/bash
# analyze-logs.sh
REPORTS_DIR="reports"
DATE=$(date +%Y%m%d)
echo "=== Log Analysis for $DATE ==="
# Count operations
echo "Operations today:"
grep -h "Action.*completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
awk '{print $6}' | sort | uniq -c
# Count errors
echo ""
echo "Errors today:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | wc -l
# Recent errors
echo ""
echo "Recent error messages:"
grep -h "ERROR" "$REPORTS_DIR"/*$DATE*.log | tail -5
# Performance stats
echo ""
echo "Average execution time:"
grep -h "completed successfully" "$REPORTS_DIR"/*$DATE*.log | \
awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}'
Real-time Log Monitoring¶
# Monitor logs in real-time
tail -f reports/*.log | grep --line-buffered -E "ERROR|WARNING|SUCCESS"
Alerting¶
CloudWatch Alarms¶
Create alarms for critical events:
# Alarm for high error rate
aws cloudwatch put-metric-alarm \
--alarm-name vpc-provisioner-high-error-rate \
--alarm-description "Alert when error rate exceeds threshold" \
--metric-name VPCCreationFailure \
--namespace VPCProvisioner \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-west-2:123456789012:alerts
# Alarm for slow performance
aws cloudwatch put-metric-alarm \
--alarm-name vpc-provisioner-slow-performance \
--alarm-description "Alert when creation time is too slow" \
--metric-name VPCCreationTime \
--namespace VPCProvisioner \
--statistic Average \
--period 300 \
--threshold 180 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-west-2:123456789012:alerts
Email Alerts¶
#!/bin/bash
# alert-on-error.sh
LOG_FILE="reports/latest.log"
ERROR_COUNT=$(grep -c ERROR "$LOG_FILE" 2>/dev/null || echo 0)
if [ "$ERROR_COUNT" -gt 5 ]; then
# Send email alert
echo "High error count detected: $ERROR_COUNT errors" | \
mail -s "VPC Provisioner Alert" ops-team@company.com
fi
Performance Monitoring¶
Execution Time Tracking¶
# performance.py
import time
import functools
def track_time(func):
"""Decorator to track function execution time"""
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
duration = time.time() - start
# Log performance
print(f"{func.__name__} took {duration:.2f} seconds")
# Send to CloudWatch
send_metric(f"{func.__name__}Time", duration, 'Seconds')
return result
return wrapper
# Usage
@track_time
def create_vpc(config):
# ... VPC creation logic ...
pass
Resource Usage Monitoring¶
#!/bin/bash
# monitor-resources.sh
while true; do
# CPU usage
CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}')
# Memory usage
MEM=$(free -m | awk 'NR==2{printf "%.2f", $3*100/$2 }')
# Disk usage
DISK=$(df -h /home/user | awk 'NR==2{print $5}' | sed 's/%//')
# Log to file
echo "$(date),CPU:$CPU,MEM:$MEM%,DISK:$DISK%" >> /var/log/vpc-provisioner-resources.log
# Send to CloudWatch
aws cloudwatch put-metric-data \
--namespace VPCProvisioner \
--metric-name CPUUsage \
--value "$CPU" \
--unit Percent
sleep 60
done
Availability Monitoring¶
Uptime Check¶
#!/bin/bash
# uptime-check.sh
ENDPOINT="http://localhost:8080/health"
MAX_RETRIES=3
RETRY_COUNT=0
while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
if curl -f -s "$ENDPOINT" > /dev/null; then
echo "✓ Service is UP"
exit 0
else
RETRY_COUNT=$((RETRY_COUNT + 1))
echo "Retry $RETRY_COUNT/$MAX_RETRIES"
sleep 5
fi
done
echo "✗ Service is DOWN"
exit 1
Service Status Page¶
# status.py
from flask import Flask, jsonify
import subprocess
from datetime import datetime
app = Flask(__name__)
@app.route('/health')
def health():
"""Health check endpoint"""
try:
# Check Python environment
subprocess.run(['python', '-m', 'vpc_provisioner.cli', '--version'],
check=True, capture_output=True)
# Check AWS credentials
subprocess.run(['aws', 'sts', 'get-caller-identity'],
check=True, capture_output=True)
return jsonify({
'status': 'healthy',
'timestamp': datetime.utcnow().isoformat()
}), 200
except Exception as e:
return jsonify({
'status': 'unhealthy',
'error': str(e),
'timestamp': datetime.utcnow().isoformat()
}), 503
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Reporting¶
Daily Report¶
#!/bin/bash
# daily-report.sh
DATE=$(date +%Y-%m-%d)
REPORT_FILE="reports/daily-report-$DATE.txt"
{
echo "=== VPC Provisioner Daily Report: $DATE ==="
echo ""
echo "Operations Summary:"
grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
awk '{print $6}' | sort | uniq -c
echo ""
echo "Error Summary:"
grep -h "ERROR" reports/*$(date +%Y%m%d)*.log | wc -l
echo ""
echo "Performance:"
echo "Average execution time: $(grep -h "completed successfully" reports/*$(date +%Y%m%d)*.log | \
awk '{print $NF}' | awk '{sum+=$1; count++} END {print sum/count " seconds"}')"
echo ""
echo "Resource Usage:"
echo "VPCs created: $(grep -h "VPC created" reports/*$(date +%Y%m%d)*.log | wc -l)"
echo "Templates generated: $(grep -h "Template generated" reports/*$(date +%Y%m%d)*.log | wc -l)"
} | tee "$REPORT_FILE"
# Email report
mail -s "VPC Provisioner Daily Report" ops-team@company.com < "$REPORT_FILE"
Weekly Report¶
#!/bin/bash
# weekly-report.sh
WEEK_START=$(date -d "7 days ago" +%Y%m%d)
WEEK_END=$(date +%Y%m%d)
echo "=== VPC Provisioner Weekly Report: $WEEK_START to $WEEK_END ==="
# Total operations
echo "Total operations: $(grep -h "completed successfully" reports/*.log | wc -l)"
# Success rate
TOTAL=$(grep -h "Action" reports/*.log | wc -l)
SUCCESS=$(grep -h "completed successfully" reports/*.log | wc -l)
echo "Success rate: $(echo "scale=2; $SUCCESS*100/$TOTAL" | bc)%"
# Most common operations
echo ""
echo "Most common operations:"
grep -h "completed successfully" reports/*.log | \
awk '{print $6}' | sort | uniq -c | sort -rn | head -5
Automation¶
Cron Schedule¶
# Edit crontab
crontab -e
# Health check every 5 minutes
*/5 * * * * /home/user/scripts/health-check.sh
# Detailed health check every hour
0 * * * * /home/user/scripts/health-check-detailed.sh
# Log analysis every 6 hours
0 */6 * * * /home/user/scripts/analyze-logs.sh
# Daily report at 8 AM
0 8 * * * /home/user/scripts/daily-report.sh
# Weekly report on Monday at 9 AM
0 9 * * 1 /home/user/scripts/weekly-report.sh
Monitoring Checklist¶
CloudWatch metrics configured
CloudWatch alarms set up
Log monitoring in place
Performance tracking enabled
Resource monitoring active
Health checks scheduled
Alerting configured
Daily reports automated
Weekly reports automated
Dashboard created