Skip to main content

ADR-013: Cloudflare-Native Observability with Incremental LGTM

Status: Accepted
Date: 2026-03-08

Context

The application runs on Cloudflare Workers and includes payment, webhook, licensing, and sharing flows where operational visibility is critical.
We need an observability strategy that is reliable in Workers runtime, cost-aware, and quick to operate.

Decision

Use a Cloudflare-first observability baseline, then add LGTM incrementally:

  1. Start with Cloudflare native logs/analytics/alerts as the default foundation.
  2. Forward structured logs to Loki for centralized search and retention.
  3. Add sampled OpenTelemetry traces to Tempo only for critical paths:
    • /api/v1/payments/*
    • /api/v1/payments/webhook
    • /api/v1/download/*
  4. Add minimal low-cardinality metrics for service health and business-critical flows.
  5. Use Grafana dashboards and alerting for incident response.

Rationale

  • Best runtime fit for Cloudflare Workers.
  • Lowest implementation risk with immediate operational value.
  • Avoids over-instrumenting early and controls telemetry egress cost.
  • Preserves flexibility to deepen observability as production load grows.

Scope and Rollout

Phase 1: Baseline (Now)

  • Cloudflare native logs + alerts.
  • Structured JSON logs with request correlation IDs.

Phase 2: LGTM Logs

  • Ship logs to Loki.
  • Build dashboards for error rate, latency, and route-level failures.

Phase 3: Critical Tracing

  • OTLP HTTP export to Tempo.
  • Sampling enabled to reduce overhead.

Phase 4: Minimal Metrics

  • Add:
    • request count
    • error count
    • p95 latency
    • webhook success/failure rate

Consequences

Positive

  • Fast, Cloudflare-compatible visibility.
  • Clear path from minimal to advanced observability.
  • Better incident triage for payment/webhook regressions.

Negative

  • Two control planes initially (Cloudflare + Grafana).
  • Requires discipline to keep metric cardinality low.
  • Tracing depth is intentionally limited at first.

Non-Goals

  • Full-system deep tracing for all routes from day one.
  • High-cardinality metrics across every domain entity.