AWS Solutions Architect Associate: Domain 5 - Define Operationally Excellent Architectures
Complete guide to Domain 5: designing operationally excellent architectures using infrastructure as code, monitoring, automation, and best practices.
Introduction
Domain 5 focuses on designing architectures that are easy to operate and maintain. This includes Infrastructure as Code (IaC), monitoring, automation, documentation, and following AWS best practices. This domain represents approximately 16% of the exam.
Infrastructure as Code (IaC)
CloudFormation Basics
CloudFormation enables you to define AWS infrastructure in templates (JSON or YAML).
AWSTemplateFormatVersion: "2010-09-09" Description: "Production web application stack" Parameters: InstanceType: Type: String Default: t3.medium AllowedValues: [t3.small, t3.medium, t3.large] EnvironmentName: Type: String Default: production Resources: # VPC VPC: Type: AWS::EC2::VPC Properties: CidrBlock: 10.0.0.0/16 EnableDnsHostnames: true EnableDnsSupport: true Tags: - Key: Name Value: !Sub "${EnvironmentName}-vpc" # Public Subnet PublicSubnet: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC CidrBlock: 10.0.1.0/24 AvailabilityZone: !Select [0, !GetAZs ""] MapPublicIpOnLaunch: true # Internet Gateway InternetGateway: Type: AWS::EC2::InternetGateway AttachGateway: Type: AWS::EC2::VPCGatewayAttachment Properties: VpcId: !Ref VPC InternetGatewayId: !Ref InternetGateway # Security Group WebSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Web server security group VpcId: !Ref VPC SecurityGroupIngress: - IpProtocol: tcp FromPort: 80 ToPort: 80 CidrIp: 0.0.0.0/0 - IpProtocol: tcp FromPort: 443 ToPort: 443 CidrIp: 0.0.0.0/0 # EC2 Instance WebServer: Type: AWS::EC2::Instance Properties: ImageId: !Sub "" InstanceType: !Ref InstanceType SubnetId: !Ref PublicSubnet SecurityGroupIds: - !Ref WebSecurityGroup UserData: Fn::Base64: | #!/bin/bash yum update -y yum install -y httpd systemctl start httpd systemctl enable httpd Tags: - Key: Name Value: !Sub "${EnvironmentName}-web-server" Outputs: InstanceId: Value: !Ref WebServer Description: Instance ID PublicIP: Value: !GetAtt WebServer.PublicIp Description: Public IP address Stack Operations
# Create stack aws cloudformation create-stack \ --stack-name prod-web-app \ --template-body file://template.yaml \ --parameters ParameterKey=InstanceType,ParameterValue=t3.medium # Update stack aws cloudformation update-stack \ --stack-name prod-web-app \ --template-body file://template-updated.yaml # Delete stack aws cloudformation delete-stack \ --stack-name prod-web-app # Monitor stack events aws cloudformation describe-stack-events \ --stack-name prod-web-app \ --query 'StackEvents[0:5]' Terraform Alternative
terraform { required_version = ">= 1.0" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } backend "s3" { bucket = "terraform-state-prod" key = "prod/terraform.tfstate" region = "us-east-1" } } provider "aws" { region = var.aws_region } # VPC resource "aws_vpc" "main" { cidr_block = var.vpc_cidr enable_dns_hostnames = true tags = { Name = "${var.environment}-vpc" } } # Subnet resource "aws_subnet" "public" { vpc_id = aws_vpc.main.id cidr_block = var.public_subnet_cidr availability_zone = data.aws_availability_zones.available.names[0] tags = { Name = "${var.environment}-public-subnet" } } # Output output "vpc_id" { value = aws_vpc.main.id description = "VPC ID" } Monitoring and Logging
CloudWatch Dashboards
import boto3 cloudwatch = boto3.client('cloudwatch') # Create custom dashboard dashboard_body = { "widgets": [ { "type": "metric", "properties": { "metrics": [ ["AWS/EC2", "CPUUtilization", {"stat": "Average"}], ["AWS/RDS", "DatabaseConnections"], ["AWS/ApplicationELB", "TargetResponseTime"] ], "period": 300, "stat": "Average", "region": "us-east-1", "title": "Application Performance" } } ] } cloudwatch.put_dashboard( DashboardName='production-monitoring', DashboardBody=json.dumps(dashboard_body) ) CloudWatch Logs Insights
# Query recent errors aws logs start-query \ --log-group-name /aws/lambda/my-function \ --start-time 1604000000 \ --end-time 1604050000 \ --query-string 'fields @timestamp, @message | filter @message like /ERROR/' # Analyze response times aws logs start-query \ --log-group-name /aws/applicationelb/prod \ --query-string 'fields response_time | stats avg(response_time) as avg_response_time by target_status_code' X-Ray Tracing
from aws_xray_sdk.core import xray_recorder from aws_xray_sdk.core import patch_all patch_all() # Patch boto3, requests, etc. @xray_recorder.capture('process_order') def process_order(order_id): # X-Ray automatically tracks this function xray_recorder.put_annotation('order_id', order_id) # Make calls that will be traced response = s3_client.get_object(Bucket='orders', Key=order_id) return response Automation with Systems Manager
Parameter Store for Configuration
import boto3 ssm = boto3.client('ssm') # Store configuration ssm.put_parameter( Name='/prod/database/endpoint', Value='prod-db.123456.us-east-1.rds.amazonaws.com', Type='String', Tags=[ {'Key': 'Environment', 'Value': 'production'}, {'Key': 'Application', 'Value': 'web-app'} ] ) # Retrieve configuration response = ssm.get_parameter(Name='/prod/database/endpoint') db_endpoint = response['Parameter']['Value'] # Get all parameters for application response = ssm.get_parameters_by_path( Path='/prod/', Recursive=True ) parameters = {p['Name']: p['Value'] for p in response['Parameters']} Session Manager for Secure Access
# Start interactive session (no SSH keys needed) aws ssm start-session \ --target i-1234567890abcdef0 # Run command aws ssm send-command \ --instance-ids i-1234567890abcdef0 \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["sudo yum update -y"]' # Get command output aws ssm get-command-invocation \ --command-id 12a34b56-78cd-90ef-ghij-1234567890k \ --instance-id i-1234567890abcdef0 OpsWorks for Configuration Management
Stack: Name: production-stack VPC: prod-vpc Layers: - Type: custom Name: web-servers CustomRecipes: Deploy: recipes/deploy.rb Configure: recipes/configure.rb - Type: db-master Name: database Engine: mysql EngineVersion: "8.0" Instances: - InstanceType: t3.medium AvailabilityZone: us-east-1a Layer: web-servers AutoScaling: true Well-Architected Framework Review
Operational Excellence Pillar
Design_Principles: 1_Perform_operations_as_code: - CloudFormation for infrastructure - Systems Manager for automation - Lambda for event-driven automation 2_Annotate_documentation: - Runbooks for common tasks - Architecture diagrams - Troubleshooting guides 3_Monitor_and_alert: - CloudWatch metrics - CloudWatch alarms - X-Ray tracing 4_Improve_through_lessons_learned: - Post-incident reviews - Continuous improvement process - Regular architecture reviews Best_Practices: - Automate everything possible - Document all procedures - Use version control for IaC - Test changes in non-prod first - Maintain runbooks for operations Incident Response and Troubleshooting
Building Runbooks
# Database Connection Failure Runbook ## Symptoms - Application reports "Cannot connect to database" - CloudWatch shows RDS connection errors ## Investigation Steps 1. Check RDS instance status aws rds describe-db-instances --db-instance-identifier prod-db 2. Verify security group rules aws ec2 describe-security-groups --group-ids sg-db 3. Check application logs aws logs tail /aws/lambda/app --follow ## Resolution Steps 1. Restart RDS instance if status is "degraded" aws rds reboot-db-instance --db-instance-identifier prod-db 2. Increase max_connections if needed 3. Scale read replicas if under high load ## Prevention - Set CloudWatch alarms on connection count - Monitor RDS metrics continuously - Load test before production deployment Troubleshooting Commands
# Check ELB health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:... # Analyze VPC Flow Logs aws logs start-query \ --log-group-name /aws/vpc/flowlogs \ --query-string 'fields srcAddr, dstAddr, action | filter action = "REJECT" | stats count() as reject_count by srcAddr, dstAddr' # Check IAM permissions aws iam simulate-custom-policy \ --policy-input-list file://policy.json \ --action-names s3:GetObject \ --resource-arns arn:aws:s3:::my-bucket/* Change Management
Deployment Strategies
Blue-Green Deployment
Blue (Current): - Production environment running v1.0 - Handles 100% traffic Green (New): - New environment running v2.0 - No traffic yet Switch: - All traffic redirects to Green - Instant rollback to Blue if issues Canary Deployment
Deployment Process: 1. Deploy v2.0 to 10% of instances 2. Monitor error rates and latency 3. If healthy, deploy to 25% 4. Continue until 100% or rollback Benefits: - Early detection of issues - Minimal blast radius - Automatic rollback Change Control with AWS Systems Manager
import boto3 ssm = boto3.client('ssm') # Create change request change_request = ssm.create_document( Content=json.dumps({ "schemaVersion": "1.2", "description": "Deploy application v2.0", "parameters": {}, "mainSteps": [ { "action": "aws:runCommand", "name": "deploy", "inputs": { "documentType": "Command", "instanceIds": ["i-12345", "i-67890"], "parameters": { "commands": ["bash /opt/deploy.sh"] } } } ] }), Name='deploy-v2.0', DocumentType='Automation' ) Common Exam Questions
Q: You need to deploy infrastructure repeatably. What’s the best approach? A: Use Infrastructure as Code (CloudFormation or Terraform) with version control
Q: How do you ensure configuration consistency across environments? A: Use Systems Manager Parameter Store for centralized configuration management
Q: What’s the best way to access EC2 instances without SSH keys? A: Use AWS Systems Manager Session Manager for secure shell access
Key Takeaways
- Implement everything as code (IaC)
- Automate operational tasks
- Implement comprehensive monitoring
- Create and maintain runbooks
- Use version control for all configurations
- Implement change management processes
- Regular architecture reviews
- Document all procedures thoroughly