Software systems fail. Not sometimes, not rarely, but consistently and in ways you cannot predict. The question is never whether something will go wrong, but whether you will know about it before your users do, and whether you will have enough information to fix it quickly.

That is what observability is about. Not dashboards for the sake of dashboards, not alerts firing into a void, but genuine operational awareness: knowing what your system is doing, why it is behaving that way, and what to do when it stops.

This post walks through how we built a complete observability platform from scratch on two Ubuntu servers, using Prometheus, Loki, Tempo, and Grafana, with every service running as a hardened systemd unit, every dashboard provisioned as code, and every alert routed to Slack with a runbook link attached. We will cover the architecture decisions, the configuration, the dashboards, the alerting, and the Game Day scenarios where we deliberately broke things to prove it all worked.

There will be real errors along the way. The kind that does not appear in tutorials.

Final architecture diagram showing app server and monitoring server with all data flows labeled

Part 1: Before We Write a Single Config File

What we are actually trying to solve

Most monitoring setups answer one question reasonably well: is the server up? They fall apart when you need to answer harder questions. Which service caused the degradation? Was it slow before it failed? Did a recent deployment trigger it?

A properly built observability stack answers four questions at all times:

  • Is my system healthy right now?

  • When did it start degrading?

  • Which component caused it?

  • What does the data show happened in the moments before it broke?

The difference between a system that can answer those questions and one that cannot is the difference between a fifteen-minute incident and a four-hour one.

The vocabulary you need

Five concepts underpin everything in this post. None of them are complicated.

  1. A metric is a number measured over time. CPU at 73%. Requests per second at 42. Memory used at 81%. Numbers that change and that you track because they tell you something about system health.

  2. A log is a text record of something that happened. A request came in, was processed, and returned an error. A service started. A configuration file was read. Logs are the narrative of your system.

  3. A trace is the journey of a single request through your system. When a user hits /api/checkout, that request might touch five services before returning a response. A trace captures every hop, every duration, every failure along the way.

  4. An SLO (Service Level Objective) is a promise about reliability. We commit to 99.5% of HTTP probes returning a successful response over any rolling 30-day window. That is an SLO.

  5. An error budget is the arithmetic consequence of an SLO. If you promise 99.5% availability, you are allowed 0.5% failure. Over 30 days, that is 3.6 hours of downtime. That is your budget. Spend it carefully.

Those five concepts connect everything else. Every dashboard panel, every alert rule, and every PromQL expression in this post traces back to one of them.

Part 2: The Architecture Decision

Why the monitoring stack lives on a separate server

The most common mistake in observability setups is running the monitoring stack on the same machine as the application it monitors. The flaw in that approach is obvious once you say it out loud: if the application server goes down, so does your monitoring. You lose visibility precisely when you need it most.

We run two servers in the same AWS VPC, connected over private networking.

Final architecture diagram showing app server and monitoring server with all data flows labeled

The app server runs the application (our fake-service simulator), node-exporter for system metrics, and an OTel Collector that ships logs and traces to the monitoring server.

The monitoring server runs Prometheus, Loki, Tempo, Grafana, Alertmanager, and Blackbox Exporter. Nothing on the monitoring server depends on the app server being alive. If the app server crashes, the monitoring server is the first place that knows.

The three network flows

Three types of traffic cross between the servers. Understanding the direction of each one matters for firewall configuration.

Prometheus uses a pull model. It reaches out to the app server every 15 seconds and collects whatever metrics are exposed. The OTel Collector on the app server uses a push model. It batches logs and traces and sends them to Loki and Tempo on the monitoring server.

Data flow across the architecture

Blackbox probes endpoints from the monitoring server’s perspective, which means it catches outages that originate anywhere between the monitoring server and the target.

Why systemd instead of Docker Compose

We ran the stack under Docker Compose initially. The debugging experience taught us why systemd is the better choice for a production-grade setup.

With Docker Compose, services like Prometheus and Loki are processes inside containers, managed by the Docker daemon. If the daemon crashes, everything crashes. Port conflicts between containers and the host are opaque. Logging goes through Docker’s json-file driver rather than journald, which means you cannot use journalctl to follow service output.

With systemd, each service is a first-class OS citizen. systemctl status prometheus tells you exactly what is running and why. journalctl -u loki -f streams logs directly. Security hardening directives like ProtectSystem=full, NoNewPrivileges=yes, and ReadWritePaths= are baked into each unit file, giving you kernel-enforced isolation at zero additional cost. And services survive reboots by design. systemctl enable registers them in the boot target.

Terraform provisions the servers

Both EC2 instances are provisioned with Terraform, with Elastic IPs attached so the addresses never change between restarts. The security group rules are explicit: Loki (3100), Tempo (4317, 4318), and the Grafana UI (3000) are the only ports open to traffic, and each is restricted to exactly the source that needs it.

Terraform plan output showing EC2 instances, Elastic IPs, and security group rules

Part 3: Building the Foundation

Five phases, one command each

Rather than a series of manual steps, the entire setup is automated by a script across five phases. Each one validates its own output before exiting. If a phase exits non-zero, the next one does not run.

Phase 0 and 1: Preflight and layout

Before any binary is installed, the preflight script checks OS version, systemd version, available disk space, required kernel parameters, and ulimits. It then creates the full directory structure: binaries under /opt/lgtm/, configurations under /etc/lgtm/, persistent data under /var/lib/lgtm/, and secrets in /etc/lgtm/secrets at mode 600.

The secrets file is where the Slack webhook URL and Grafana admin password live. Nothing sensitive is passed as environment variables or hardcoded in config files. Systemd reads from the secrets file via EnvironmentFile= at service start.

Phase 2: Binary installation

Every binary is pinned to an explicit version. There is no latest anywhere in the install scripts. Prometheus 3.5.3, Alertmanager 0.32.1, Loki 3.7.2, Tempo 3.0.0, Grafana Enterprise 13.0.1, and OTel Collector 0.152.0. Each binary is verified with --version before the script moves on.

Terminal output showing binaries installed with version confirmations

Phase 3: Configuration

Every config file is written by the script and validated before the script exits. Prometheus rules are checked with promtool check rules. Alertmanager config is validated with amtool check-config. Loki config runs with -verify-config. OTel Collector config runs through otelcol validate.

Phase 4: systemd units

Eight unit files are installed, one per service. Every unit includes the same security hardening block:

NoNewPrivileges=yes
PrivateTmp=yes
ProtectHome=yes
ProtectSystem=full
ReadWritePaths=/var/lib/lgtm/<service>
CapabilityBoundingSet=
LimitNOFILE=65536

ProtectSystem=full makes /usr, /boot, and /etc read-only for the service process. ReadWritePaths= explicitly whitelists the one directory each service needs to write to. Everything else is locked down at the kernel level.

Phase 5: Hardening and bring-up

The final script audits permissions, applies kernel parameters, validates that secrets are not placeholders, then starts services in dependency order. Each service must pass its health check before the next one starts. Node-exporter must respond at :9100/metrics before Blackbox Exporter starts. Loki must return ready before OTel Collector starts. Prometheus must show all scrape targets as UP before Grafana starts.

Terminal output showing all services passing health checks sequentially

What actually broke

Tutorials skip the errors. We will not.

Smart quotes broke PromQL. When copying the CPU query from a formatted document, the quotes around "idle" came through as typographic curly quotes rather than straight ASCII quotes. Grafana's PromQL parser rejected the query with a cryptic parse error. The fix was retyping the quotes manually from the keyboard.

The Alertmanager | default pipe does not exist. The Slack alert template used {{ .Labels.instance | default "N/A" }} which works in Helm and Sprig template engines but not in Go's standard text/template library that Alertmanager uses. The replacement was {{ if .Labels.instance }}{{ .Labels.instance }}{{ else }}N/A{{ end }}.

ReadOnlyPaths=/proc /sys / blocked node-exporter. The systemd unit had an explicit ReadOnlyPaths= directive listing /proc, /sys, and /. When systemd applied this, it remounted those paths in a restricted namespace that dropped the specific files node-exporter reads for CPU metrics. Removing the directive and relying on ProtectSystem=full alone fixed it.

The security group silently dropped OTLP traffic. After splitting to two servers, traces and logs stopped arriving in Tempo and Loki. The OTel Collector on the app server was correctly pointed at the monitoring server’s private IP, but ports 4317 and 3100 were not open in the monitoring server’s security group. The collector retried every 30 seconds with i/o timeout errors in journald. Adding three inbound rules to the Terraform security group and running terraform apply fixed it without touching either server.

Tempo and Loki were bound to 127.0.0.1. Even after opening the security group, connections were refused. Both services were configured to listen only on loopback, which meant traffic arriving from the app server's IP was rejected at the bind address before the security group had any relevance. Updating both configs to bind on 0.0.0.0 and restarting the services resolved it.

Part 4: The Four Golden Signals

The only four things that matter when a service is degrading

Google’s SRE team established the four golden signals as the minimum viable set of metrics for understanding service health. The argument is that no matter how complex your system, degradation always manifests as changes in latency, traffic, errors, or saturation. If you measure those four things well, you catch every class of failure.

Latency measures how long it takes to serve a request. The key distinction is between successful request latency and error request latency. Errors that return instantly are not the same as errors that time out after 30 seconds. Tracking both separately tells you whether your system is failing fast or failing slow.

Traffic measures the demand on your system. Requests per second, active connections, jobs processed per minute. Traffic gives you context for everything else. A 5% error rate at 10 requests per second is a different problem than a 5% error rate at 10,000 requests per second.

Errors measures the rate of failed requests, including explicit failures like 5xx responses, implicit failures like returning the wrong content, and policy failures like requests that succeed but exceed a latency threshold.

Saturation measures how full your system is. CPU, memory, disk, connection pool utilisation. Saturation is a leading indicator. A service at 95% memory is not yet failing, but it is close.

The PromQL expressions for each signal:

# Latency — p50, p95, p99
histogram_quantile(0.50, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))
# Traffic - requests per second
sum(rate(http_requests_total[1m]))
# Errors - error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# Saturation - CPU and memory
process_cpu_usage_percent
process_memory_usage_percent

Golden Signals dashboard showing all four panels with live data from fake-service

One thing that will catch you out: the latency query needs sum by(le) before the quantile function, not sum by(le, endpoint). If you include endpoint in the grouping, you get one line per endpoint per percentile. With five endpoints and three percentiles, that is fifteen lines on a single panel. The dashboard becomes unreadable. Collapse to sum by(le) for the overview panel, then use a separate panel with sum by(le, endpoint) if you want the per-endpoint breakdown.

Part 5: SLOs and Error Budgets

Turning reliability into a number you can act on

The question “how reliable is reliable enough?” sounds philosophical until you put numbers to it. An SLO forces that conversation to happen before an incident, not during one.

Our availability SLO is 99.5%. That number was not picked arbitrarily. It reflects the actual availability we can sustain given our infrastructure, our deployment frequency, and our team’s capacity to respond to incidents.

The error budget calculation is straightforward:

Error budget = (1 - SLO target) x measurement window
             = (1 - 0.995) x 30 days
             = 0.005 x 43,200 minutes
             = 216 minutes
             = 3.6 hours per month

That is the total amount of downtime we can have before we breach our SLO. Every minute of degradation consumes from that budget.

Burn rate: the metric that changes how you alert

Tracking whether you are within budget is useful. Tracking how fast you are consuming it is what enables proactive response.

A burn rate of 1x means you are consuming budget at exactly the rate your SLO allows. A burn rate of 14.4x means you are consuming it 14.4 times faster than normal, and at that rate you will exhaust the entire monthly budget in about 48 hours.

We alert on burn rate rather than on raw availability. Two thresholds:

  • Fast burn (critical): burn rate above 14.4x for two minutes. Act immediately.

  • Slow burn (warning): burn rate above 5x for fifteen minutes. Needs attention before it escalates.

The fast burn threshold sounds aggressive, but consider what it catches. A 14.4x burn rate means you are failing far more than your SLO allows. If you wait for a threshold alert, you might catch it after the budget is already gone. Burn rate alerts catch it while you still have room to respond.

# Fast burn rate (1h window)
1 - (
  sum(rate(probe_success{job="blackbox-http"}[1h]))
  / count(probe_success{job="blackbox-http"})
)
# Slow burn rate (6h window)  
1 - (
  sum(rate(probe_success{job="blackbox-http"}[6h]))
  / count(probe_success{job="blackbox-http"})
)

SLO and Error Budget dashboard showing availability gauge, budget remaining gauge, and burn rate time series with threshold lines

The error budget policy

A budget without a policy is just a number. Our policy defines what happens at each threshold:

  • 0 to 50% consumed: Normal operations. Feature work continues.

  • 50 to 75% consumed: Review ongoing changes. Increase monitoring attention.

  • 75 to 99% consumed: Pause non-critical deployments. Incident team on standby.

  • 100% consumed: Feature freeze. Reliability sprint only. SLO review meeting within 24 hours.

The policy makes the budget real. It connects a technical metric to an engineering team’s behaviour.

Part 6: The Dashboards

What good looks like before you explain how to build it

Every dashboard in this stack was built in the Grafana UI, validated against live data, then exported as JSON and dropped into /etc/lgtm/grafana/dashboards/. Grafana picks up new files within 30 seconds without a restart.

shared.image.missing_image

The dashboards are organised by audience, not by data source. Four folders, four audiences.

Infrastructure (operations team): Is the machine healthy? Reliability (SRE team): Is the service keeping its promises? Delivery (engineering leads): Is the team shipping well? Observability (anyone debugging): Where did it break?

Infrastructure dashboards

The Node Exporter dashboard covers CPU usage by core and total, memory broken into used, cached, and available, disk I/O read and write rates, network receive and transmit, and system load averages at one, five, and fifteen minute windows.

Node Exporter dashboard showing all six panels with live system data

The Blackbox Exporter dashboard shows probe success status as a stat panel (UP in green, DOWN in red), HTTP response time by phase (DNS lookup, TCP connect, TLS handshake, server processing, content transfer), SSL certificate expiry in days remaining, and probe success rate averaged over one hour

Blackbox Exporter dashboard with probe status, response time breakdown, and SSL expiry countdown

Reliability dashboards

The Golden Signals dashboard surfaces the four signals discussed in Part 4. One panel per signal, clean lines, thresholds coloured to match severity.

The SLO and Error Budget dashboard is built around four gauge panels at the top (availability SLI percentage, error budget remaining, latency SLI percentage, and budget consumed) followed by the burn rate time series with the fast and slow burn threshold lines drawn as reference lines.

SLO dashboard with four gauges and burn rate time series

The HTTP Errors dashboard breaks down 5xx and 4xx rates separately, then shows the same data summed by endpoint. That second view is what tells you whether /api/checkout is responsible for 80% of your errors.

The unified observability dashboard

This is the dashboard that justifies building the whole stack.

The entry point is the error rate spike panel. A spike appears. You drag to select the time window. The correlated logs panel below it automatically filters to that window, showing every log line from the fake-service during the spike. Each log line carries a traceID field. You click it. Grafana opens Tempo directly to that trace.

Unified dashboard showing error spike panel, correlated log lines with clickable traceIDs, and Tempo trace view

In Tempo, you see the full request journey. The parent span shows a 2.1 second request. The child db.query span consumed 1.3 seconds of that. The slow database call caused the latency. You have gone from a metric spike to the exact line of code responsible, without leaving Grafana.

This drill-down works because every log line from the fake-service includes a traceID field that matches the trace stored in Tempo. The Loki datasource in Grafana has a derived field configured to detect that pattern and render it as a clickable link. The configuration lives in datasources.yml:

derivedFields:
  - datasourceUid: tempo
    matcherRegex: "traceID=(\\w+)"
    name: TraceID
    url: "$${__value.raw}"
    urlDisplayLabel: "Open in Tempo"

DORA metrics dashboard

DORA dashboard showing deployment frequency classification, lead time breakdown, CFR gauge, and MTTR stat panel

The DORA dashboard shows four stat panels at the top with colour-coded benchmark classifications (Elite, High, Medium, Low per DORA research benchmarks), followed by time series for deployment frequency over the measured period and lead time broken into its sub-intervals: commit to pipeline trigger, trigger to build complete, build complete to deployment confirmed.

Part 7: Alerting That Does Not Lie

What a useful alert looks like

Most alert fatigue comes from alerts that tell you something is wrong without telling you what to do about it. A page that says “CPU high on prod-server-1” at 3am is not useful. A Slack message that tells you the severity, the affected host, the current metric value, a link to the dashboard showing the spike, and a link to the runbook explaining the first three investigation steps is useful.

Slack alert showing firing notification with all fields annotated: name, severity, host, value, dashboard link, runbook link

Slack alert showing resolved notification with duration of the incident

Every alert in this stack includes:

  • Alert name and severity label

  • Affected host or instance

  • Current metric value at fire time

  • A direct link to the Grafana dashboard panel

  • A direct link to the Markdown runbook in the repository

The Slack template that produces this payload uses Go’s text/template syntax. One thing to note: the | default pipe function does not exist in this template engine. If you are copying templates from Helm chart examples, this will catch you.

Infrastructure alerts

Four alert categories cover the infrastructure signals:

CPU warning fires when usage exceeds 80% for five minutes. CPU critical fires when it exceeds 90% for ten minutes. The for: duration is intentional. Without it, a momentary spike from a cron job triggers a page every night.

Memory follows the same 80% warning and 90% critical pattern. Disk warns at 75% and goes critical at 90%. Service downtime fires when the Blackbox probe fails for two consecutive minutes.

Burn rate alerts

Two burn rate alerts replace what would otherwise be a dozen availability threshold alerts:

# Fast burn — critical
expr: |
  (1 - sum(rate(probe_success{job="blackbox-http"}[1h]))
  / count(probe_success{job="blackbox-http"}))
  > (14.4 * 0.005)
for: 2m
labels:
  severity: critical
# Slow burn - warning  
expr: |
  (1 - sum(rate(probe_success{job="blackbox-http"}[6h]))
  / count(probe_success{job="blackbox-http"}))
  > (5 * 0.005)
for: 15m
labels:
  severity: warning

Inhibition rules

When a host is completely unreachable, Alertmanager suppresses the CPU, memory, latency, and disk alerts for that same host. The reasoning: if the server is down, the symptom alerts are noise. Alert on the cause (service down) and silence everything downstream.

inhibit_rules:
  - source_match:
      alertname: "ServiceDown"
      severity: "critical"
    target_match_re:
      alertname: "CPU.*|Memory.*|Disk.*|Latency.*"
    equal: ["instance"]

Runbooks

Every alert has a corresponding Markdown runbook in the repository. A runbook answers six questions:

  • What is this alert?

  • What is the likely cause?

  • What are the first four investigation steps?

  • How do I resolve it?

  • Should I roll back, and when?

  • Who do I escalate to and when?

The runbook link is embedded in the Alertmanager template so it appears in every Slack notification automatically.

Part 8: DORA Metrics

Connecting reliability to how well your team ships

The four DORA metrics measure your engineering team’s delivery performance. They complement the reliability signals by answering a different set of questions: not “is the system healthy” but “is the team shipping sustainably.”

Deployment frequency measures how often you deploy to production. Elite performers deploy multiple times per day. The DORA research shows that higher deployment frequency correlates with lower change failure rate, not higher. Teams that deploy often are forced to keep changes small and reversible.

Lead time for changes measures the time from a commit to that commit running in production. It breaks into four sub-intervals: commit to pipeline trigger, trigger to build complete, build complete to deployment confirmed. Long lead times indicate bottlenecks in the pipeline. Short lead times require small, well-tested changes.

Change failure rate measures the percentage of deployments that cause a degradation requiring a rollback or hotfix. The DORA elite benchmark is below 5%. Above 15% indicates a testing or deployment process problem.

Mean time to restore measures how quickly you recover from a failure. This is where your observability stack has a direct impact. Teams with good dashboards, correlated logs, and trace data restore service faster than teams guessing from raw server logs.

How metrics get into Prometheus

GitHub Actions pushes DORA metrics to Pushgateway after every deployment. Prometheus scrapes Pushgateway every 15 seconds. The metrics appear in Grafana within a minute of a deployment completing.

GitHub Actions workflow run showing the DORA metrics push step completing successfully

The workflow has three jobs in dependency order. The infrastructure job runs Terraform to provision or update servers and captures the output IPs. The deploy job runs after infrastructure and deploys the application. The DORA metrics job runs after both and always runs, even if the deploy job failed, because a failed deployment is exactly the kind of event you want recorded.

dora-metrics:
  needs: [infrastructure, deploy]
  if: always()

DORA dashboard showing deployment frequency classification, lead time breakdown, CFR gauge, and MTTR stat panel

Part 9: Game Day

Breaking things on purpose

Game Day is where you prove the stack works. Not by checking that dashboards load, but by triggering real failure scenarios and verifying that the right alerts fire, the right data appears in the right places, and the recovery path is visible.

One scenarios with three screenshots to show.

Scenario: Latency injection

We ran the chaos script against the fake-service, forcing all requests to the /api/checkout endpoint into the error path where latency runs between 200ms and 2 seconds:

./chaos.sh error-burst

Within one minute, the p99 latency panel showed the spike. The SLO burn rate climbed above the fast burn threshold.

Golden Signals dashboard showing latency spike on p99 panel

SLO dashboard showing Latency SLI latency depletion

The fast burn critical alert fired in Slack.

Slack showing SLO Fast Burn critical alert

We clicked through to the unified observability dashboard. The error rate panel showed the spike. The correlated logs panel showed the error log lines. We clicked a traceID.

Tempo showing the full trace with parent span and slow db.query child span highlighted

The trace showed the db.query span consuming 1.4 seconds of a 1.8 second request. Root cause identified in under two minutes from alert to trace.

Part 10: What We Learned

What worked, what surprised us, and what we would do differently

What worked well from the start:

The systemd security hardening was the right call. ProtectSystem=full and ReadWritePaths= meant that even when we made configuration mistakes, the blast radius was contained. A misconfigured Prometheus could not write outside its data directory.

The two-server architecture made debugging significantly cleaner. When traces stopped arriving, we immediately knew the problem was between the servers, not inside either one. The separation of concerns made the failure domain obvious.

Burn rate alerting is genuinely better than threshold alerting. The first time the fast burn alert fired during a chaos test before any threshold alert would have triggered, the value was immediately clear.

What surprised us:

The OTel Collector binary path changes between the core and contrib builds. The installation script assumed /usr/bin/otelcol but the actual binary was at a different path depending on which package was installed. Every systemd unit and every validation call in the configure script had to reference the correct path.

Terraform security group rules need to exist before services start. We provisioned servers and ran scripts before adding the inter-server firewall rules. The OTel Collector spent its first hour retrying connections that were being silently dropped by the security group.

What we would do differently at scale:

Elastic IPs from day one. We added them later and had to update several config files, GitHub secrets, and Terraform outputs to reflect the new stable addresses. Starting with them would have saved the work.

Object storage (S3) backends for Loki and Tempo. Local filesystem works for a 30-day retention window on a single server, but it does not survive an instance replacement. Object storage backends decouple the data from the server.

Tail-based trace sampling instead of 100% capture. We capture every trace the fake-service emits. At production traffic volumes, this would be expensive. Tail-based sampling lets you keep all traces for errors and slow requests while discarding most traces for fast, successful requests.

A self-hosted GitHub Actions runner inside the VPC. The DORA metrics push uses SSH from a public runner to the monitoring server. A runner inside the VPC would push directly over private networking without exposing the monitoring server to any public traffic.

Part 11: What Is Next

Three concrete steps if you want to build this yourself:

Start with the preflight script on a single server. Clone the repository and run terraform init && terraform apply on a single Ubuntu server. It validates your environment and creates the directory structure without installing anything. If it exits with errors, fix them before going further.

Add your own application endpoints to Blackbox. Once the stack is running, edit /etc/lgtm/prometheus/prometheus.yml and replace the example targets with your actual endpoints. Reload Prometheus with sudo lgtm-reload prometheus. The Blackbox dashboard will start showing real probe data within 15 seconds.

Instrument a real service with the OTel SDK. The fake-service is a simulator. The real value comes from instrumenting your actual application. The OpenTelemetry SDK exists for Python, Go, Node.js, Java, and most other languages. Point it at the OTel Collector’s gRPC endpoint and your traces will appear in Tempo automatically.

Keep Reading