Post

AWS Solutions Architect Associate: Domain 5 - Define Operationally Excellent Architectures

Complete guide to Domain 5: designing operationally excellent architectures using infrastructure as code, monitoring, automation, and best practices.

Introduction

Domain 5 focuses on designing architectures that are easy to operate and maintain. This includes Infrastructure as Code (IaC), monitoring, automation, documentation, and following AWS best practices. This domain represents approximately 16% of the exam.

Infrastructure as Code (IaC)

CloudFormation Basics

CloudFormation enables you to define AWS infrastructure in templates (JSON or YAML).

AWSTemplateFormatVersion: "2010-09-09" Description: "Production web application stack" Parameters: InstanceType: Type: String Default: t3.medium AllowedValues: [t3.small, t3.medium, t3.large] EnvironmentName: Type: String Default: production Resources: # VPC VPC: Type: AWS::EC2::VPC Properties: CidrBlock: 10.0.0.0/16 EnableDnsHostnames: true EnableDnsSupport: true Tags: - Key: Name Value: !Sub "${EnvironmentName}-vpc" # Public Subnet PublicSubnet: Type: AWS::EC2::Subnet Properties: VpcId: !Ref VPC CidrBlock: 10.0.1.0/24 AvailabilityZone: !Select [0, !GetAZs ""] MapPublicIpOnLaunch: true # Internet Gateway InternetGateway: Type: AWS::EC2::InternetGateway AttachGateway: Type: AWS::EC2::VPCGatewayAttachment Properties: VpcId: !Ref VPC InternetGatewayId: !Ref InternetGateway # Security Group WebSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Web server security group VpcId: !Ref VPC SecurityGroupIngress: - IpProtocol: tcp FromPort: 80 ToPort: 80 CidrIp: 0.0.0.0/0 - IpProtocol: tcp FromPort: 443 ToPort: 443 CidrIp: 0.0.0.0/0 # EC2 Instance WebServer: Type: AWS::EC2::Instance Properties: ImageId: !Sub "" InstanceType: !Ref InstanceType SubnetId: !Ref PublicSubnet SecurityGroupIds: - !Ref WebSecurityGroup UserData: Fn::Base64: | #!/bin/bash yum update -y yum install -y httpd systemctl start httpd systemctl enable httpd Tags: - Key: Name Value: !Sub "${EnvironmentName}-web-server" Outputs: InstanceId: Value: !Ref WebServer Description: Instance ID PublicIP: Value: !GetAtt WebServer.PublicIp Description: Public IP address

Stack Operations

# Create stack aws cloudformation create-stack \ --stack-name prod-web-app \ --template-body file://template.yaml \ --parameters ParameterKey=InstanceType,ParameterValue=t3.medium # Update stack aws cloudformation update-stack \ --stack-name prod-web-app \ --template-body file://template-updated.yaml # Delete stack aws cloudformation delete-stack \ --stack-name prod-web-app # Monitor stack events aws cloudformation describe-stack-events \ --stack-name prod-web-app \ --query 'StackEvents[0:5]'

Terraform Alternative

terraform { required_version = ">= 1.0" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } backend "s3" { bucket = "terraform-state-prod" key = "prod/terraform.tfstate" region = "us-east-1" } } provider "aws" { region = var.aws_region } # VPC resource "aws_vpc" "main" { cidr_block = var.vpc_cidr enable_dns_hostnames = true tags = { Name = "${var.environment}-vpc" } } # Subnet resource "aws_subnet" "public" { vpc_id = aws_vpc.main.id cidr_block = var.public_subnet_cidr availability_zone = data.aws_availability_zones.available.names[0] tags = { Name = "${var.environment}-public-subnet" } } # Output output "vpc_id" { value = aws_vpc.main.id description = "VPC ID" }

Monitoring and Logging

CloudWatch Dashboards

import boto3 cloudwatch = boto3.client('cloudwatch') # Create custom dashboard dashboard_body = { "widgets": [ { "type": "metric", "properties": { "metrics": [ ["AWS/EC2", "CPUUtilization", {"stat": "Average"}], ["AWS/RDS", "DatabaseConnections"], ["AWS/ApplicationELB", "TargetResponseTime"] ], "period": 300, "stat": "Average", "region": "us-east-1", "title": "Application Performance" } } ] } cloudwatch.put_dashboard( DashboardName='production-monitoring', DashboardBody=json.dumps(dashboard_body) )

CloudWatch Logs Insights

# Query recent errors aws logs start-query \ --log-group-name /aws/lambda/my-function \ --start-time 1604000000 \ --end-time 1604050000 \ --query-string 'fields @timestamp, @message | filter @message like /ERROR/' # Analyze response times aws logs start-query \ --log-group-name /aws/applicationelb/prod \ --query-string 'fields response_time | stats avg(response_time) as avg_response_time by target_status_code'

X-Ray Tracing

from aws_xray_sdk.core import xray_recorder from aws_xray_sdk.core import patch_all patch_all() # Patch boto3, requests, etc. @xray_recorder.capture('process_order') def process_order(order_id): # X-Ray automatically tracks this function xray_recorder.put_annotation('order_id', order_id) # Make calls that will be traced response = s3_client.get_object(Bucket='orders', Key=order_id) return response

Automation with Systems Manager

Parameter Store for Configuration

import boto3 ssm = boto3.client('ssm') # Store configuration ssm.put_parameter( Name='/prod/database/endpoint', Value='prod-db.123456.us-east-1.rds.amazonaws.com', Type='String', Tags=[ {'Key': 'Environment', 'Value': 'production'}, {'Key': 'Application', 'Value': 'web-app'} ] ) # Retrieve configuration response = ssm.get_parameter(Name='/prod/database/endpoint') db_endpoint = response['Parameter']['Value'] # Get all parameters for application response = ssm.get_parameters_by_path( Path='/prod/', Recursive=True ) parameters = {p['Name']: p['Value'] for p in response['Parameters']}

Session Manager for Secure Access

# Start interactive session (no SSH keys needed) aws ssm start-session \ --target i-1234567890abcdef0 # Run command aws ssm send-command \ --instance-ids i-1234567890abcdef0 \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["sudo yum update -y"]' # Get command output aws ssm get-command-invocation \ --command-id 12a34b56-78cd-90ef-ghij-1234567890k \ --instance-id i-1234567890abcdef0

OpsWorks for Configuration Management

Stack: Name: production-stack VPC: prod-vpc Layers: - Type: custom Name: web-servers CustomRecipes: Deploy: recipes/deploy.rb Configure: recipes/configure.rb - Type: db-master Name: database Engine: mysql EngineVersion: "8.0" Instances: - InstanceType: t3.medium AvailabilityZone: us-east-1a Layer: web-servers AutoScaling: true

Well-Architected Framework Review

Operational Excellence Pillar

Design_Principles: 1_Perform_operations_as_code: - CloudFormation for infrastructure - Systems Manager for automation - Lambda for event-driven automation 2_Annotate_documentation: - Runbooks for common tasks - Architecture diagrams - Troubleshooting guides 3_Monitor_and_alert: - CloudWatch metrics - CloudWatch alarms - X-Ray tracing 4_Improve_through_lessons_learned: - Post-incident reviews - Continuous improvement process - Regular architecture reviews Best_Practices: - Automate everything possible - Document all procedures - Use version control for IaC - Test changes in non-prod first - Maintain runbooks for operations

Incident Response and Troubleshooting

Building Runbooks

# Database Connection Failure Runbook ## Symptoms - Application reports "Cannot connect to database" - CloudWatch shows RDS connection errors ## Investigation Steps 1. Check RDS instance status aws rds describe-db-instances --db-instance-identifier prod-db 2. Verify security group rules aws ec2 describe-security-groups --group-ids sg-db 3. Check application logs aws logs tail /aws/lambda/app --follow ## Resolution Steps 1. Restart RDS instance if status is "degraded" aws rds reboot-db-instance --db-instance-identifier prod-db 2. Increase max_connections if needed 3. Scale read replicas if under high load ## Prevention - Set CloudWatch alarms on connection count - Monitor RDS metrics continuously - Load test before production deployment

Troubleshooting Commands

# Check ELB health aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:... # Analyze VPC Flow Logs aws logs start-query \ --log-group-name /aws/vpc/flowlogs \ --query-string 'fields srcAddr, dstAddr, action | filter action = "REJECT" | stats count() as reject_count by srcAddr, dstAddr' # Check IAM permissions aws iam simulate-custom-policy \ --policy-input-list file://policy.json \ --action-names s3:GetObject \ --resource-arns arn:aws:s3:::my-bucket/*

Change Management

Deployment Strategies

Blue-Green Deployment

Blue (Current): - Production environment running v1.0 - Handles 100% traffic Green (New): - New environment running v2.0 - No traffic yet Switch: - All traffic redirects to Green - Instant rollback to Blue if issues

Canary Deployment

Deployment Process: 1. Deploy v2.0 to 10% of instances 2. Monitor error rates and latency 3. If healthy, deploy to 25% 4. Continue until 100% or rollback Benefits: - Early detection of issues - Minimal blast radius - Automatic rollback

Change Control with AWS Systems Manager

import boto3 ssm = boto3.client('ssm') # Create change request change_request = ssm.create_document( Content=json.dumps({ "schemaVersion": "1.2", "description": "Deploy application v2.0", "parameters": {}, "mainSteps": [ { "action": "aws:runCommand", "name": "deploy", "inputs": { "documentType": "Command", "instanceIds": ["i-12345", "i-67890"], "parameters": { "commands": ["bash /opt/deploy.sh"] } } } ] }), Name='deploy-v2.0', DocumentType='Automation' )

Common Exam Questions

Q: You need to deploy infrastructure repeatably. What’s the best approach? A: Use Infrastructure as Code (CloudFormation or Terraform) with version control

Q: How do you ensure configuration consistency across environments? A: Use Systems Manager Parameter Store for centralized configuration management

Q: What’s the best way to access EC2 instances without SSH keys? A: Use AWS Systems Manager Session Manager for secure shell access

Key Takeaways

  1. Implement everything as code (IaC)
  2. Automate operational tasks
  3. Implement comprehensive monitoring
  4. Create and maintain runbooks
  5. Use version control for all configurations
  6. Implement change management processes
  7. Regular architecture reviews
  8. Document all procedures thoroughly

Resources

This post is licensed under CC BY 4.0 by the author.