Performance Tuning Guide

Performance optimization strategies for VPC infrastructure provisioned by the VPC Provisioner.

Table of Contents

Network Performance

Bandwidth Limits

Resource

Bandwidth

Notes

NAT Gateway

100 Gbps

Scales automatically

VPC Peering

No limit

Uses AWS backbone

Internet Gateway

No limit

Scales automatically

EC2 Instance

Varies by type

ml.m5.xlarge = 10 Gbps

S3 Gateway Endpoint

No limit

Free, no NAT overhead

Latency Optimization

Traffic Path

Typical Latency

Optimization

Same AZ

< 1 ms

Co-locate communicating resources

Cross-AZ

1-2 ms

Use for HA, avoid for latency-sensitive

To S3 (same region)

1-5 ms

Use VPC Gateway Endpoint

To S3 via NAT

5-15 ms

Avoid — use VPC Endpoint instead

Cross-region

20-100 ms

Use Transfer Acceleration for S3

NAT Gateway Performance

Throughput

NAT Gateway supports up to 100 Gbps burst. For sustained high throughput:

  • Each NAT Gateway supports 55,000 simultaneous connections to a single destination

  • 900 connections per second to a single destination

  • If you exceed these limits, connections are dropped

High Availability Configuration

# Single NAT — all traffic through one gateway
vpc:
  nat_gateway:
    enabled: true
    high_availability: false    # Single point of failure

# HA NAT — one per AZ, traffic stays local
vpc:
  nat_gateway:
    enabled: true
    high_availability: true     # Better performance + resilience

HA NAT Gateways improve performance because traffic stays within the same AZ — no cross-AZ hop.

When to Avoid NAT Gateway

For AWS service traffic (S3, DynamoDB), use VPC Gateway Endpoints instead of NAT Gateway:

  • Lower latency (direct path vs NAT hop)

  • Higher throughput (no NAT bottleneck)

  • Zero cost (Gateway Endpoints are free)

Subnet and AZ Optimization

Co-Location Strategy

Place resources that communicate frequently in the same AZ:

us-west-2a:
  private-app-subnet-1:
    - SageMaker training instances
    - Lambda inference functions
  database-subnet-1:
    - RDS primary instance

us-west-2b:
  private-app-subnet-2:
    - SageMaker endpoint instances (HA)
  database-subnet-2:
    - RDS standby instance (HA)

Subnet Sizing

Size subnets based on expected resource count:

Subnet CIDR

Usable IPs

Best For

/24

251

Application subnets (EC2, ECS, Lambda)

/26

59

Database subnets (RDS, ElastiCache)

/28

11

Small utility subnets

/20

4,091

Large-scale EKS or SageMaker workloads

The provisioner defaults (/24 for app, /26 for database) work well for most ML workloads.

Multiple Private Subnets

For ML workloads that need isolation:

vpc:
  subnets:
    private:
      - name: private-app-subnet-1      # Application tier
        cidr: 10.0.11.0/24
        az: us-west-2a
      - name: private-ml-subnet-1       # ML training tier
        cidr: 10.0.13.0/24
        az: us-west-2a

Separate ML training from application workloads to prevent resource contention.

VPC Endpoint Performance

Gateway Endpoints (S3, DynamoDB)

Gateway Endpoints route traffic directly to the service without NAT:

Without Endpoint:  Instance → NAT Gateway → Internet → S3
With Endpoint:     Instance → VPC Endpoint → S3 (direct)

Performance improvement:

  • Latency: 50-70% reduction (no NAT hop)

  • Throughput: No NAT Gateway bottleneck

  • Cost: Free (no NAT data processing charges)

Configure via S3 Provisioner:

s3:
  vpc_id: "vpc-0a1b2c3d4e5f6g7h8"
  route_table_ids: "rtb-0a1b2c3d,rtb-4e5f6g7h"

Interface Endpoints (SageMaker, ECR, CloudWatch)

For other AWS services, consider Interface Endpoints:

  • SageMaker API and Runtime

  • ECR (for container image pulls)

  • CloudWatch Logs

  • SSM Parameter Store

These cost $0.01/hour per AZ but eliminate NAT dependency for AWS service traffic.

ML Workload Optimization

SageMaker Training in VPC

# Place training in private subnets with S3 VPC Endpoint
estimator = Estimator(
    subnets=["subnet-private-app-1", "subnet-private-app-2"],
    security_group_ids=["sg-ml-training"],
    # S3 data access goes through VPC Endpoint — fast and free
)

SageMaker Distributed Training

For multi-instance training, use the same AZ to minimize inter-node latency:

estimator = Estimator(
    instance_count=4,
    instance_type="ml.p3.16xlarge",
    subnets=["subnet-private-ml-1"],  # Single AZ for low latency
)

Lambda in VPC

Lambda functions in VPC have cold start overhead (~1-2 seconds for VPC attachment). Mitigate with:

  • Provisioned Concurrency for latency-sensitive functions

  • Keep functions warm with scheduled invocations

  • Use smaller memory sizes for faster initialization

Provisioner Performance

Typical Operation Times

Action

Typical Duration

Notes

validate-config

< 1 second

Local only

create-policy

< 1 second

Local only

create-prov-template

< 1 second

Local only

validate-prov-template

< 1 second

Local only

show-changes

10-30 seconds

Creates and deletes ChangeSet

check-drift

15-60 seconds

Depends on resource count

test-deploy

90-180 seconds

Full stack with NAT Gateways

create-vpc

90-180 seconds

NAT Gateway creation is slowest

delete-vpc

60-120 seconds

NAT Gateway deletion takes 5-10 min

Why NAT Gateway Is Slow

NAT Gateway creation takes 2-5 minutes per gateway. With HA (2-3 AZs), this is the primary bottleneck in VPC deployment. This is an AWS limitation, not a provisioner limitation.

Monitoring Performance

VPC Flow Logs

Enable Flow Logs to analyze traffic patterns:

aws ec2 create-flow-log \
  --resource-type VPC \
  --resource-id vpc-0a1b2c3d4e5f6g7h8 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/edge-prod-b001-us-west-2-vpc/flow-logs

Analyze for:

  • High cross-AZ traffic (co-location opportunity)

  • Traffic to AWS services without VPC Endpoints

  • Rejected traffic (security group or NACL issues)

NAT Gateway Metrics

Monitor in CloudWatch:

  • BytesOutToDestination — throughput

  • PacketsDropCount — capacity issues (consider HA)

  • ConnectionAttemptCount — connection rate

  • ActiveConnectionCount — concurrent connections

Network Performance Testing

# Install iperf3 on two EC2 instances in different subnets
sudo apt install iperf3

# Server (instance in subnet A)
iperf3 -s

# Client (instance in subnet B)
iperf3 -c <server-private-ip> -t 30 -P 10

For cost optimization, see COST_OPTIMIZATION.md. For architecture patterns, see USER_GUIDE.md.