← Back to blog

Building Nimbus: A Cloud Monitoring Dashboard from Scratch with Terraform

Brandon · April 02, 2026
cloudterraformawsdevopsproject

The Problem

Teams running infrastructure on AWS don’t have a great all-in-one platform to see what’s happening across their environment without paying a decent chunk of change. CloudWatch has the data and Cost Explorer has the spend numbers, but they live separately in the console — no custom alerting, no historical trends. I want to catch cost spikes early, not during a monthly billing review.

The Solution

Nimbus is a self-hosted, lightweight monitoring and spend dashboard designed as a low-cost alternative to third-party tooling like Datadog or New Relic. It’s built entirely with AWS serverless services, managed via Terraform, and designed to run as close to $0/month as possible with a hard cap of $10/month.

The core idea: nothing accumulates cost while idle.

Features:

Architecture


┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  EventBridge │────▶│   Lambda     │────▶│  DynamoDB    │
│  (scheduled) │     │  (collector) │     │  (metrics)   │
└──────────────┘     └──────┬───────┘     └──────────────┘

                     ┌──────▼───────┐     ┌──────────────┐
                     │   Lambda     │────▶│     SNS      │
                     │  (alerting)  │     │  (alerts)    │
                     └──────────────┘     └──────┬───────┘

                                          ┌──────▼───────┐
                                          │    Email     │
                                          └──────────────┘

All data encrypted at rest with KMS (customer-managed key)
CloudWatch Metrics

EventBridge (scheduled cron)

Lambda (Python)
  → Pull metrics via Boto3
  → Evaluate thresholds
  → Write to DynamoDB
  → Trigger SNS if alert condition met

SNS → Email notification

API Gateway → Lambda (read metrics)

React Frontend (S3 + CloudFront)

Deployment

I’m deploying this project in phased releases so that I can isolate bugs to a specific layer rather than debugging a monolithic push.

Phase 0: Project Scaffolding & Terraform Backend

Before writing any infrastructure, Terraform needs somewhere to store its state. This phase sets up the S3 backend, state locking, and the project structure that everything else builds on.

Decisions:

Phase 1: IAM Roles & KMS Encryption

This phase sets up the security foundation. AWS denies everything by default, so the Lambda functions need an IAM role to read metrics, write to DynamoDB, or publish to SNS. This phase creates that identity (IAM role) and a shared encryption key (KMS).

Policies:

  1. CloudWatch Logs: write execution logs (AWS managed policy)
  2. CloudWatch Metrics + Cost Explorer: read infrastructure and spend data
  3. DynamoDB: read/write to the metrics table only (exact table ARN)
  4. SNS: publish to the alerts topic only (exact topic ARN)
  5. KMS: encrypt/decrypt with the project key only (exact key ARN)

Every policy follows the principle of least privilege — each permission is scoped to the exact resource ARN it needs. If this role is compromised, the blast radius is limited to exactly the resources the app was designed to touch.

The KMS key provides a single encryption key shared across services (SNS, DynamoDB) with automatic annual rotation and a 30-day deletion safety window. A customer-managed key gives us control over access policies and a full audit trail, which AWS managed keys don’t offer.

Phase 2: DynamoDB Storage Layer

This is the central data store, it’s where all the records live. The collector Lambda writes metrics here based on a scheduled CRON job. The API Lambda reads from here to serve our frontend dashboard visuals. The table uses a composite key - metric_name as the partition key, timestamp as the sort key. This gives us the query pattern: “give me all records for metric X between time A and time B”.

Decisions:

Phase 3: SNS Notifications

SNS is the notification layer. When the collector Lambda detects a threshold breach, it will publish a message to the SNS topic. SNS will deliver this message to every subscriber of that SNS topic. The Lambda only publishing a message to SNS, SNS handles who should receive the alert and how it’s delivered.

Decisions:

Phase 4: Writing Lambda Functions (Collector + Alerting)

This is where the pipeline comes alive. Two Python functions, one shared IAM role: Collector Lambda runs on a schedule (wired up in Phase 5).

Collector Lambda: It pulls metrics from CloudWatch and cost data from Cost Explorer, evaluates them against configurable thresholds, writes everything to DynamoDB, and publishes to SNS if something’s wrong.

API Reader Lambda: runs on demand behind API Gateway (Phase 6). When the dashboard requests data, it queries DynamoDB by metric name and time range and returns JSON.

Decisions: Python 3.13 runtime. The Lambda runtime includes boto3 (the AWS SDK) out of the box, so no external dependencies, no Lambda layers, no requirements.txt. The deployment package is a single zipped .py file. archive_file for packaging. Terraform’s archive_file data source zips the Python code at plan time and computes a hash. If the code changes, the hash changes, and Terraform updates the function. If it hasn’t changed, nothing happens. Explicit CloudWatch Log Groups with 14-day retention. Lambda auto-creates log groups if you don’t, but with no retention policy — logs accumulate forever. Creating them in Terraform lets us set retention_in_days = 14. Environment variables for configuration. The table name and SNS topic ARN are passed as env vars, not hardcoded. Same code works across dev/staging/prod — only the Terraform config changes. Clients initialized outside the handler. AWS SDK clients are created at module level so they’re reused across invocations. Lambda reuses execution environments — creating clients inside the handler would add latency on every call.