ADR-013: Cloudflare-Native Observability with Incremental LGTM

Status: Accepted
Date: 2026-03-08

Context

The application runs on Cloudflare Workers and includes payment, webhook, licensing, and sharing flows where operational visibility is critical.
We need an observability strategy that is reliable in Workers runtime, cost-aware, and quick to operate.

Decision

Use a Cloudflare-first observability baseline, then add LGTM incrementally:

Start with Cloudflare native logs/analytics/alerts as the default foundation.
Forward structured logs to Loki for centralized search and retention.
Add sampled OpenTelemetry traces to Tempo only for critical paths:
- /api/v1/payments/*
- /api/v1/payments/webhook
- /api/v1/download/*
Add minimal low-cardinality metrics for service health and business-critical flows.
Use Grafana dashboards and alerting for incident response.

Rationale

Best runtime fit for Cloudflare Workers.
Lowest implementation risk with immediate operational value.
Avoids over-instrumenting early and controls telemetry egress cost.
Preserves flexibility to deepen observability as production load grows.

Scope and Rollout

Phase 1: Baseline (Now)

Cloudflare native logs + alerts.
Structured JSON logs with request correlation IDs.

Phase 2: LGTM Logs

Ship logs to Loki.
Build dashboards for error rate, latency, and route-level failures.

Phase 3: Critical Tracing

OTLP HTTP export to Tempo.
Sampling enabled to reduce overhead.

Phase 4: Minimal Metrics

Add:
- request count
- error count
- p95 latency
- webhook success/failure rate

Consequences

Positive

Fast, Cloudflare-compatible visibility.
Clear path from minimal to advanced observability.
Better incident triage for payment/webhook regressions.

Negative

Two control planes initially (Cloudflare + Grafana).
Requires discipline to keep metric cardinality low.
Tracing depth is intentionally limited at first.

Non-Goals

Full-system deep tracing for all routes from day one.
High-cardinality metrics across every domain entity.

Context​

Decision​

Rationale​

Scope and Rollout​

Phase 1: Baseline (Now)​

Phase 2: LGTM Logs​

Phase 3: Critical Tracing​

Phase 4: Minimal Metrics​

Consequences​

Positive​

Negative​

Non-Goals​

Related​