Cloudflare Observability Baseline Runbook
Scope
Baseline observability for the Cloudflare Workers deployment before full OTel/LGTM rollout.
This runbook covers:
- Cloudflare logs and analytics setup
- Initial alert rules
- Incident triage workflow
- Handoff path to Loki/Tempo/Grafana
Prerequisites
- Worker deployed with
observability.enabled = trueinwrangler.jsonc. - Structured JSON logging enabled in backend middleware/use-cases.
- Error context policy applied:
docs/observability/error-context-policy.md
Baseline Signals
1) Request Error Rate
- Source: Worker logs (
request.error,request.error.unhandled) - Target: detect backend regressions quickly
2) Payment/Webhook Reliability
- Source: domain events and error logs
- Monitor events:
domain.eventwitheventType=payment.createddomain.eventwitheventType=payment.status.transitioneddomain.eventwitheventType=license.activatedrequest.erroron/api/v1/payments/*
3) API Latency
- Source: request-end logs (
request.end) withdurationMs - Monitor p95 latency for:
/api/v1/payments/create/api/v1/payments/webhook/api/v1/download/*
Initial Alert Rules
Rule A: 5xx spike
- Condition: 5xx responses > 2% for 5 minutes
- Scope: all
/api/* - Severity: High
- Action: page/on-call notification
Rule B: Payment webhook failures
- Condition:
request.erroror 5xx for/api/v1/payments/webhook>= 3 in 5 minutes - Severity: High
- Action: investigate payment activation flow immediately
Rule C: Payment create degradation
- Condition: p95 latency on
/api/v1/payments/create> 2000ms for 10 minutes - Severity: Medium
- Action: inspect upstream provider health and network path
Rule D: Download route errors
- Condition: 4xx/5xx on
/api/v1/download/*> 5% for 10 minutes - Severity: Medium
- Action: inspect signed URL expiry/signature and storage availability
Triage Procedure
- Identify affected route and time window from alert.
- Filter logs by:
requestIdpathcode- domain IDs (
paymentId,customerId) when available
- Check nearest
domain.eventsequence for the same aggregate (aggregateId/paymentId). - Classify issue:
- Input/validation
- Provider/downstream dependency
- Internal code regression
- Apply mitigation:
- Rollback if release regression
- Retry/reconcile payment state if provider issue
- Create incident note with affected IDs
LGTM Handoff (Incremental)
- Logs -> Loki:
- Forward Cloudflare logs to centralized retention/search.
- Traces -> Tempo:
- Add sampled OTel spans for payment/webhook/download routes.
- Worker env:
OTEL_EXPORTER_OTLP_ENDPOINT(secret; OTLP HTTP endpoint)OTEL_SERVICE_NAME(default:photobooth-api)OTEL_TRACE_SAMPLE_RATE(0.0 - 1.0, default recommendation:0.2)
- Dashboards/alerts -> Grafana:
- Recreate baseline rules with richer querying and longer retention.
Ownership
- Primary: Backend/API owner
- Secondary: Ops/infra owner
- Review cadence: every 2 weeks while payment rollout is active