ADR-013: Cloudflare-Native Observability with Incremental LGTM
Status: Accepted
Date: 2026-03-08
Context
The application runs on Cloudflare Workers and includes payment, webhook, licensing, and sharing flows where operational visibility is critical.
We need an observability strategy that is reliable in Workers runtime, cost-aware, and quick to operate.
Decision
Use a Cloudflare-first observability baseline, then add LGTM incrementally:
- Start with Cloudflare native logs/analytics/alerts as the default foundation.
- Forward structured logs to Loki for centralized search and retention.
- Add sampled OpenTelemetry traces to Tempo only for critical paths:
/api/v1/payments/*/api/v1/payments/webhook/api/v1/download/*
- Add minimal low-cardinality metrics for service health and business-critical flows.
- Use Grafana dashboards and alerting for incident response.
Rationale
- Best runtime fit for Cloudflare Workers.
- Lowest implementation risk with immediate operational value.
- Avoids over-instrumenting early and controls telemetry egress cost.
- Preserves flexibility to deepen observability as production load grows.
Scope and Rollout
Phase 1: Baseline (Now)
- Cloudflare native logs + alerts.
- Structured JSON logs with request correlation IDs.
Phase 2: LGTM Logs
- Ship logs to Loki.
- Build dashboards for error rate, latency, and route-level failures.
Phase 3: Critical Tracing
- OTLP HTTP export to Tempo.
- Sampling enabled to reduce overhead.
Phase 4: Minimal Metrics
- Add:
- request count
- error count
- p95 latency
- webhook success/failure rate
Consequences
Positive
- Fast, Cloudflare-compatible visibility.
- Clear path from minimal to advanced observability.
- Better incident triage for payment/webhook regressions.
Negative
- Two control planes initially (Cloudflare + Grafana).
- Requires discipline to keep metric cardinality low.
- Tracing depth is intentionally limited at first.
Non-Goals
- Full-system deep tracing for all routes from day one.
- High-cardinality metrics across every domain entity.