The Problem
Teams running infrastructure on AWS don’t have a great all-in-one platform to see what’s happening across their environment without paying a decent chunk of change. CloudWatch has the data and Cost Explorer has the spend numbers, but they live separately in the console — no custom alerting, no historical trends. I want to catch cost spikes early, not during a monthly billing review.
The Solution
Nimbus is a self-hosted, lightweight monitoring and spend dashboard designed as a low-cost alternative to third-party tooling like Datadog or New Relic. It’s built entirely with AWS serverless services, managed via Terraform, and designed to run as close to $0/month as possible with a hard cap of $10/month.
The core idea: nothing accumulates cost while idle.
Features:
- Collect CloudWatch metrics and Cost Explorer data on an automated schedule
- Evaluate configurable thresholds and send email alerts before problems escalate
- Present a dashboard that consolidates infrastructure health and spend data
- Reproducible, version-controlled, and deployable via CI/CD
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EventBridge │────▶│ Lambda │────▶│ DynamoDB │
│ (scheduled) │ │ (collector) │ │ (metrics) │
└──────────────┘ └──────┬───────┘ └──────────────┘
│
┌──────▼───────┐ ┌──────────────┐
│ Lambda │────▶│ SNS │
│ (alerting) │ │ (alerts) │
└──────────────┘ └──────┬───────┘
│
┌──────▼───────┐
│ Email │
└──────────────┘
All data encrypted at rest with KMS (customer-managed key)
CloudWatch Metrics
↓
EventBridge (scheduled cron)
↓
Lambda (Python)
→ Pull metrics via Boto3
→ Evaluate thresholds
→ Write to DynamoDB
→ Trigger SNS if alert condition met
↓
SNS → Email notification
↓
API Gateway → Lambda (read metrics)
↓
React Frontend (S3 + CloudFront)
Deployment
I’m deploying this project in phased releases so that I can isolate bugs to a specific layer rather than debugging a monolithic push.
Phase 0: Project Scaffolding & Terraform Backend
Before writing any infrastructure, Terraform needs somewhere to store its state. This phase sets up the S3 backend, state locking, and the project structure that everything else builds on.
Decisions:
- S3 backend with native state locking (
use_lockfile). This is HashiCorp’s recommended approach. The Terraform state will live in a versioned and encrypted S3 bucket. Locking uses conditional S3 writes. - Partial configuration for credentials. No credentials live in any
.tffile. Terraform picks them up from the AWS CLI profile locally and from environment variables in CI/CD. Backend config is stored in plaintext by Terraform, so hardcoding secrets there would be a serious exposure risk.
Phase 1: IAM Roles & KMS Encryption
This phase sets up the security foundation. AWS denies everything by default, so the Lambda functions need an IAM role to read metrics, write to DynamoDB, or publish to SNS. This phase creates that identity (IAM role) and a shared encryption key (KMS).
Policies:
- CloudWatch Logs: write execution logs (AWS managed policy)
- CloudWatch Metrics + Cost Explorer: read infrastructure and spend data
- DynamoDB: read/write to the metrics table only (exact table ARN)
- SNS: publish to the alerts topic only (exact topic ARN)
- KMS: encrypt/decrypt with the project key only (exact key ARN)
Every policy follows the principle of least privilege — each permission is scoped to the exact resource ARN it needs. If this role is compromised, the blast radius is limited to exactly the resources the app was designed to touch.
The KMS key provides a single encryption key shared across services (SNS, DynamoDB) with automatic annual rotation and a 30-day deletion safety window. A customer-managed key gives us control over access policies and a full audit trail, which AWS managed keys don’t offer.
Phase 2: DynamoDB Storage Layer
This is the central data store, it’s where all the records live. The collector Lambda writes metrics here based on a scheduled CRON job. The API Lambda reads from here to serve our frontend dashboard visuals. The table uses a composite key - metric_name as the partition key, timestamp as the sort key. This gives us the query pattern: “give me all records for metric X between time A and time B”.
Decisions:
- DynamoDB with on-demand billing: Went with on-demand because the permanent free tier should be enough for the project. Don’t want to provision anything as it’s a bit too much overkill for this solution.
- TTL with 90-day expiry: Each record has an
expiry_epochfiled. DynamoDB automatically help clean up expired items at no cost. - No point-in-time-recovery: The data that is coming from CloudWatch and Cost Explorer can be rebuilt on the next Lambda run, if for some reason, the table were to get wiped. The PITR doesn’t make sense for our use-case.
- Customer-managed KMS encryption: Still using the same encryption governance across our AWS services. It’s the same key from Phase 1.
Phase 3: SNS Notifications
SNS is the notification layer. When the collector Lambda detects a threshold breach, it will publish a message to the SNS topic. SNS will deliver this message to every subscriber of that SNS topic. The Lambda only publishing a message to SNS, SNS handles who should receive the alert and how it’s delivered.
Decisions:
- SNS vs SES or SQS + SES: The focus was on simplicity and the ability to send emails directly to subscribers. SNS provides this and I don’t need to setup domain verification or SMTP setup. SQS and SES would add an async queue consumer, which is added complexity that we don’t currently need.
- KMS encryption on the topic: These alert messages contain account-specific data (metric values, cost thresholds, etc…) Therefore, I am using the same encryption from Phase 1.
Phase 4: Writing Lambda Functions (Collector + Alerting)
This is where the pipeline comes alive. Two Python functions, one shared IAM role: Collector Lambda runs on a schedule (wired up in Phase 5).
Collector Lambda: It pulls metrics from CloudWatch and cost data from Cost Explorer, evaluates them against configurable thresholds, writes everything to DynamoDB, and publishes to SNS if something’s wrong.
API Reader Lambda: runs on demand behind API Gateway (Phase 6). When the dashboard requests data, it queries DynamoDB by metric name and time range and returns JSON.
Decisions: Python 3.13 runtime. The Lambda runtime includes boto3 (the AWS SDK) out of the box, so no external dependencies, no Lambda layers, no requirements.txt. The deployment package is a single zipped .py file. archive_file for packaging. Terraform’s archive_file data source zips the Python code at plan time and computes a hash. If the code changes, the hash changes, and Terraform updates the function. If it hasn’t changed, nothing happens. Explicit CloudWatch Log Groups with 14-day retention. Lambda auto-creates log groups if you don’t, but with no retention policy — logs accumulate forever. Creating them in Terraform lets us set retention_in_days = 14. Environment variables for configuration. The table name and SNS topic ARN are passed as env vars, not hardcoded. Same code works across dev/staging/prod — only the Terraform config changes. Clients initialized outside the handler. AWS SDK clients are created at module level so they’re reused across invocations. Lambda reuses execution environments — creating clients inside the handler would add latency on every call.