Problem
When deploying a new version of an application, you want to make sure it’s working as expected before routing all traffic to it. If you push bad code, and it’s not working as expected, every user sees this.
Solution
A Canary Deployment solves this problem by routing a small percentage of traffic to the new version of the application while the rest of the traffic is routed to the stable version. Project Canary is my version of a Canary Deployment environment. I built a full canary deployment environment on AWS using Terraform, from VPC networking and security group chaining to ALB weighted routing that splits traffic 90/10 between stable and canary versions. The goal was to demonstrate the infrastructure decisions behind a real deployment pattern.
Architecture Overview
The focus of this architecture was to build a resilient system that routes traffic to two separate application versions. To ensure high availability, I deployed across two availability zones so that if one AZ goes down, the stable version continues serving from the surviving AZ. Using ALB weighted target groups, 90% of traffic routes to the stable version and 10% to the canary. This allows a controlled rollout where the new version can be validated against real traffic before a full promotion. The EC2 instances sit in private subnets with no direct internet access. Security group chaining ensures they only accept traffic from the ALB on port 80 and SSH from the bastion host. Responses flow back through the ALB to the end user.
Design Decisions and Trade-offs
- Terraform: Terraform was used to create the infrastructure as code. This allows for easy deployment, rollback, and version control. If I decided to build all these resources via the console, it would be a lot of manual work and extremely error prone. This allows for a more reproducible and scalable solution.
- Security Groups: Security was the main focus for the resources in this architecture. I wanted to ensure that the resources were only accessible by the necessary components. I wanted to avoid any direct internet access and potentially expose sensitive data to the internet. Security group chaining was used to ensure that the app instances accept HTTP traffic only from the ALB’s security group and SSH only from the bastion’s security group. No IP addresses are referenced, so the rules survive IP changes automatically.
- ALB: The Application Load Balancer is pivotal to the design because it allows us to correctly route traffic to the stable or canary version. It acts as our single point of entry to the application. The weighted target groups help us to control the percentage of traffic sent to each version. I also incorporated health checks to ensure that the ALB is routing traffic to healthy instances.
- Bastion Host: This Bastion Host is used as a “jump server” to access the private subnets. Because I chose to set this in a public subnet, I’m able to SSH into it via port 22, and only from my IP address, I’m able to SSH into the private subnets to access the EC2 instances. If a disaster were to happen, I’m able to SSH into the Bastion Host and then to the EC2 instances to access the logs and troubleshoot the issue quickly.
- EC2 Instances: I went with EC2 instances, even though it was a simple web server, which could have been done with S3 / CloudFront, because it allows for SSH access to the instances. I can point the target groups to the EC2 instances, this wasn’t possible with S3 / CloudFront. It also doesn’t have the ability to talk to DyanmoDB or respond to ALB health checks.
- DynamoDB: Although not a heavy use case for this project, I connect to the DynamoDB table via VPC Endpoint to ensure that the traffic is not routed to the internet. I could have used a NAT Gateway, but this would have added cost and complexity to the architecture. Speed and costs were a factor for this decision. We could have used RDS or Aurora, but this would have added cost and complexity to the architecture because DynamoDB is serverlesss, it only charges based on usage, especially with my PAY_PER_REQUEST billing attribute added. RDS will require a lot more costs because it requires a running instance even when idle.
What I’d do differently
- Terraform: Because this was siloed to my local machine, I didn’t have to store my Terraform state in a remote backend. However, I would want to store this in S3 so that multiple developers can access the state and not have to worry about sharing their local state files.
- Metrics: I would want to implment metrics tracking to monitor the health of the system. Set up CloudWatch metrics and alarms to alert me based on threshold breaches. I could even set it up to run on a schedule, to provide continuous monitoring.
- SSM Session Manager: Instead of a Bastion Host, I could have used SSM Session Manager to access the private EC2 instances. This would have trade offs as well, I would need to introduce another VPC endpoint for this. However, it would be more secure, because I don’t have to open up port 22 to the internet via a bastion host. It also isn’t reliant on SSH keys, but instead uses IAM roles.