Skip to main content

Cloudflare Observability Baseline Runbook

Scope

Baseline observability for the Cloudflare Workers deployment before full OTel/LGTM rollout.

This runbook covers:

  • Cloudflare logs and analytics setup
  • Initial alert rules
  • Incident triage workflow
  • Handoff path to Loki/Tempo/Grafana

Prerequisites

  1. Worker deployed with observability.enabled = true in wrangler.jsonc.
  2. Structured JSON logging enabled in backend middleware/use-cases.
  3. Error context policy applied:
    • docs/observability/error-context-policy.md

Baseline Signals

1) Request Error Rate

  • Source: Worker logs (request.error, request.error.unhandled)
  • Target: detect backend regressions quickly

2) Payment/Webhook Reliability

  • Source: domain events and error logs
  • Monitor events:
    • domain.event with eventType=payment.created
    • domain.event with eventType=payment.status.transitioned
    • domain.event with eventType=license.activated
    • request.error on /api/v1/payments/*

3) API Latency

  • Source: request-end logs (request.end) with durationMs
  • Monitor p95 latency for:
    • /api/v1/payments/create
    • /api/v1/payments/webhook
    • /api/v1/download/*

Initial Alert Rules

Rule A: 5xx spike

  • Condition: 5xx responses > 2% for 5 minutes
  • Scope: all /api/*
  • Severity: High
  • Action: page/on-call notification

Rule B: Payment webhook failures

  • Condition: request.error or 5xx for /api/v1/payments/webhook >= 3 in 5 minutes
  • Severity: High
  • Action: investigate payment activation flow immediately

Rule C: Payment create degradation

  • Condition: p95 latency on /api/v1/payments/create > 2000ms for 10 minutes
  • Severity: Medium
  • Action: inspect upstream provider health and network path

Rule D: Download route errors

  • Condition: 4xx/5xx on /api/v1/download/* > 5% for 10 minutes
  • Severity: Medium
  • Action: inspect signed URL expiry/signature and storage availability

Triage Procedure

  1. Identify affected route and time window from alert.
  2. Filter logs by:
    • requestId
    • path
    • code
    • domain IDs (paymentId, customerId) when available
  3. Check nearest domain.event sequence for the same aggregate (aggregateId/paymentId).
  4. Classify issue:
    • Input/validation
    • Provider/downstream dependency
    • Internal code regression
  5. Apply mitigation:
    • Rollback if release regression
    • Retry/reconcile payment state if provider issue
    • Create incident note with affected IDs

LGTM Handoff (Incremental)

  1. Logs -> Loki:
    • Forward Cloudflare logs to centralized retention/search.
  2. Traces -> Tempo:
    • Add sampled OTel spans for payment/webhook/download routes.
    • Worker env:
      • OTEL_EXPORTER_OTLP_ENDPOINT (secret; OTLP HTTP endpoint)
      • OTEL_SERVICE_NAME (default: photobooth-api)
      • OTEL_TRACE_SAMPLE_RATE (0.0 - 1.0, default recommendation: 0.2)
  3. Dashboards/alerts -> Grafana:
    • Recreate baseline rules with richer querying and longer retention.

Ownership

  • Primary: Backend/API owner
  • Secondary: Ops/infra owner
  • Review cadence: every 2 weeks while payment rollout is active