Performance Tuning Guide¶

Performance optimization strategies for VPC infrastructure provisioned by the VPC Provisioner.

Table of Contents¶

Network Performance
NAT Gateway Performance
Subnet and AZ Optimization
VPC Endpoint Performance
ML Workload Optimization
Provisioner Performance
Monitoring Performance

Network Performance¶

Bandwidth Limits¶

Resource	Bandwidth	Notes
NAT Gateway	100 Gbps	Scales automatically
VPC Peering	No limit	Uses AWS backbone
Internet Gateway	No limit	Scales automatically
EC2 Instance	Varies by type	ml.m5.xlarge = 10 Gbps
S3 Gateway Endpoint	No limit	Free, no NAT overhead

Latency Optimization¶

Traffic Path	Typical Latency	Optimization
Same AZ	< 1 ms	Co-locate communicating resources
Cross-AZ	1-2 ms	Use for HA, avoid for latency-sensitive
To S3 (same region)	1-5 ms	Use VPC Gateway Endpoint
To S3 via NAT	5-15 ms	Avoid — use VPC Endpoint instead
Cross-region	20-100 ms	Use Transfer Acceleration for S3

NAT Gateway Performance¶

Throughput¶

NAT Gateway supports up to 100 Gbps burst. For sustained high throughput:

Each NAT Gateway supports 55,000 simultaneous connections to a single destination
900 connections per second to a single destination
If you exceed these limits, connections are dropped

High Availability Configuration¶

# Single NAT — all traffic through one gateway
vpc:
  nat_gateway:
    enabled: true
    high_availability: false    # Single point of failure

# HA NAT — one per AZ, traffic stays local
vpc:
  nat_gateway:
    enabled: true
    high_availability: true     # Better performance + resilience

HA NAT Gateways improve performance because traffic stays within the same AZ — no cross-AZ hop.

When to Avoid NAT Gateway¶

For AWS service traffic (S3, DynamoDB), use VPC Gateway Endpoints instead of NAT Gateway:

Lower latency (direct path vs NAT hop)
Higher throughput (no NAT bottleneck)
Zero cost (Gateway Endpoints are free)

Subnet and AZ Optimization¶

Co-Location Strategy¶

Place resources that communicate frequently in the same AZ:

us-west-2a:
  private-app-subnet-1:
    - SageMaker training instances
    - Lambda inference functions
  database-subnet-1:
    - RDS primary instance

us-west-2b:
  private-app-subnet-2:
    - SageMaker endpoint instances (HA)
  database-subnet-2:
    - RDS standby instance (HA)

Subnet Sizing¶

Size subnets based on expected resource count:

Subnet CIDR	Usable IPs	Best For
/24	251	Application subnets (EC2, ECS, Lambda)
/26	59	Database subnets (RDS, ElastiCache)
/28	11	Small utility subnets
/20	4,091	Large-scale EKS or SageMaker workloads

The provisioner defaults (/24 for app, /26 for database) work well for most ML workloads.

Multiple Private Subnets¶

For ML workloads that need isolation:

vpc:
  subnets:
    private:
      - name: private-app-subnet-1      # Application tier
        cidr: 10.0.11.0/24
        az: us-west-2a
      - name: private-ml-subnet-1       # ML training tier
        cidr: 10.0.13.0/24
        az: us-west-2a

Separate ML training from application workloads to prevent resource contention.

VPC Endpoint Performance¶

Gateway Endpoints (S3, DynamoDB)¶

Gateway Endpoints route traffic directly to the service without NAT:

Without Endpoint:  Instance → NAT Gateway → Internet → S3
With Endpoint:     Instance → VPC Endpoint → S3 (direct)

Performance improvement:

Latency: 50-70% reduction (no NAT hop)
Throughput: No NAT Gateway bottleneck
Cost: Free (no NAT data processing charges)

Configure via S3 Provisioner:

s3:
  vpc_id: "vpc-0a1b2c3d4e5f6g7h8"
  route_table_ids: "rtb-0a1b2c3d,rtb-4e5f6g7h"

Interface Endpoints (SageMaker, ECR, CloudWatch)¶

For other AWS services, consider Interface Endpoints:

SageMaker API and Runtime
ECR (for container image pulls)
CloudWatch Logs
SSM Parameter Store

These cost $0.01/hour per AZ but eliminate NAT dependency for AWS service traffic.

ML Workload Optimization¶

SageMaker Training in VPC¶

# Place training in private subnets with S3 VPC Endpoint
estimator = Estimator(
    subnets=["subnet-private-app-1", "subnet-private-app-2"],
    security_group_ids=["sg-ml-training"],
    # S3 data access goes through VPC Endpoint — fast and free
)

SageMaker Distributed Training¶

For multi-instance training, use the same AZ to minimize inter-node latency:

estimator = Estimator(
    instance_count=4,
    instance_type="ml.p3.16xlarge",
    subnets=["subnet-private-ml-1"],  # Single AZ for low latency
)

Lambda in VPC¶

Lambda functions in VPC have cold start overhead (~1-2 seconds for VPC attachment). Mitigate with:

Provisioned Concurrency for latency-sensitive functions
Keep functions warm with scheduled invocations
Use smaller memory sizes for faster initialization

Provisioner Performance¶

Typical Operation Times¶

Action	Typical Duration	Notes
validate-config	< 1 second	Local only
create-policy	< 1 second	Local only
create-prov-template	< 1 second	Local only
validate-prov-template	< 1 second	Local only
show-changes	10-30 seconds	Creates and deletes ChangeSet
check-drift	15-60 seconds	Depends on resource count
test-deploy	90-180 seconds	Full stack with NAT Gateways
create-vpc	90-180 seconds	NAT Gateway creation is slowest
delete-vpc	60-120 seconds	NAT Gateway deletion takes 5-10 min

Why NAT Gateway Is Slow¶

NAT Gateway creation takes 2-5 minutes per gateway. With HA (2-3 AZs), this is the primary bottleneck in VPC deployment. This is an AWS limitation, not a provisioner limitation.

Monitoring Performance¶

VPC Flow Logs¶

Enable Flow Logs to analyze traffic patterns:

aws ec2 create-flow-log \
  --resource-type VPC \
  --resource-id vpc-0a1b2c3d4e5f6g7h8 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/edge-prod-b001-us-west-2-vpc/flow-logs

Analyze for:

High cross-AZ traffic (co-location opportunity)
Traffic to AWS services without VPC Endpoints
Rejected traffic (security group or NACL issues)

NAT Gateway Metrics¶

Monitor in CloudWatch:

BytesOutToDestination — throughput
PacketsDropCount — capacity issues (consider HA)
ConnectionAttemptCount — connection rate
ActiveConnectionCount — concurrent connections

Network Performance Testing¶

# Install iperf3 on two EC2 instances in different subnets
sudo apt install iperf3

# Server (instance in subnet A)
iperf3 -s

# Client (instance in subnet B)
iperf3 -c <server-private-ip> -t 30 -P 10

For cost optimization, see Cost Optimization. For architecture patterns, see User Guide.