Engineering

Webhook Delivery Reliability: Retry Architectures That Never Lose an Event

TOT
Traffic Orchestrator Team
Engineering
April 19, 2026 14 min read 949 words
Share

Webhooks are the backbone of modern event-driven architectures. When a license is activated, a payment succeeds, or a subscription changes, your system needs to notify downstream services reliably. But HTTP is inherently unreliable — networks drop packets, servers restart, and DNS fails. A webhook delivery system that doesn't account for these failures will silently lose events.

The Reliability Spectrum

GuaranteeWhat It MeansImplementation ComplexityUse Case
At-most-onceFire and forget. Event may be lost.LowAnalytics, logging
At-least-onceRetry until acknowledged. May duplicate.MediumLicense events, payments
Exactly-onceDelivered once, processed once.High (requires idempotency)Financial transactions

Most webhook systems target at-least-once delivery — guaranteeing the event arrives, while requiring recipients to handle duplicates via idempotency keys.

Exponential Backoff with Jitter

The naive approach to retries — fixed interval (e.g., retry every 60 seconds) — creates thundering herd problems. When a recipient recovers from an outage, all queued retries hit simultaneously, causing another outage.

Exponential backoff with jitter solves this:

// Exponential backoff with full jitter
const calculateDelay = (attempt, baseDelay = 60, maxDelay = 86400) => {
  // Exponential: 60s, 120s, 240s, 480s, 960s, 1920s, 3840s, 7680s...
  const exponential = baseDelay * Math.pow(2, attempt - 1)

  // Cap at maxDelay (24 hours)
  const capped = Math.min(exponential, maxDelay)

  // Full jitter: random between 0 and capped delay
  // This spreads retries evenly across the window
  return Math.floor(Math.random() * capped)
}

// Retry schedule (approximate):
// Attempt 1: 0-60s after failure
// Attempt 2: 0-120s
// Attempt 3: 0-240s
// Attempt 4: 0-480s
// Attempt 5: 0-960s (16 min)
// Attempt 6: 0-1920s (32 min)
// Attempt 7: 0-3840s (64 min)
// Attempt 8: 0-7680s (2.1 hrs)
// After 8 failures: move to Dead Letter Queue

The Delivery Pipeline

A production-grade webhook delivery system has five stages:

  1. Event Ingestion — Business logic emits an event (e.g., "license.activated")
  2. Fanout — The event is duplicated for each registered webhook endpoint
  3. Delivery Attempt — HTTP POST to the endpoint with signed payload
  4. Response Processing — 2xx = success, 4xx = permanent failure, 5xx = retry
  5. Retry or DLQ — Failed deliveries are re-queued or moved to the dead letter queue
// Webhook delivery pipeline
const deliverWebhook = async (event, endpoint, db) => {
  const delivery = {
    id: crypto.randomUUID(),
    eventId: event.id,
    endpointUrl: endpoint.url,
    attempt: 1,
    maxAttempts: 8,
    status: 'pending',
    createdAt: Date.now()
  }

  // Store delivery record (idempotency + audit trail)
  await db.prepare(
    'INSERT INTO webhook_deliveries (id, event_id, endpoint_url, attempt, status, created_at) VALUES (?, ?, ?, ?, ?, ?)'
  ).bind(delivery.id, delivery.eventId, delivery.endpointUrl, delivery.attempt, delivery.status, delivery.createdAt).run()

  // Attempt delivery
  return attemptDelivery(delivery, event, endpoint, db)
}

const attemptDelivery = async (delivery, event, endpoint, db) => {
  const payload = JSON.stringify({
    id: event.id,
    type: event.type,
    data: event.data,
    timestamp: event.timestamp,
    deliveryId: delivery.id,
    attempt: delivery.attempt
  })

  // Sign the payload
  const signature = await signPayload(payload, endpoint.secret)

  try {
    const response = await fetch(endpoint.url, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Webhook-Signature': signature,
        'X-Webhook-ID': delivery.id,
        'X-Webhook-Timestamp': String(Date.now()),
        'User-Agent': 'WebhookDelivery/1.0'
      },
      body: payload,
      signal: AbortSignal.timeout(30000) // 30s timeout
    })

    if (response.ok) {
      await updateDelivery(db, delivery.id, 'delivered', response.status)
      return { success: true }
    }

    // 4xx = client error, don't retry (except 429)
    if (response.status >= 400 && response.status < 500 && response.status !== 429) {
      await updateDelivery(db, delivery.id, 'failed_permanent', response.status)
      return { success: false, permanent: true }
    }

    // 5xx or 429 = retry
    return scheduleRetry(delivery, event, endpoint, db, response.status)
  } catch (error) {
    // Network error, timeout, DNS failure = retry
    return scheduleRetry(delivery, event, endpoint, db, 0)
  }
}

Dead Letter Queues

After exhausting all retry attempts, events move to a Dead Letter Queue (DLQ). The DLQ serves three purposes:

  • Data preservation — Events are never lost, even after all retries fail
  • Manual replay — Operators can inspect and manually re-deliver DLQ events
  • Alerting — DLQ depth triggers alerts to the operations team
// Dead Letter Queue management
const moveToDLQ = async (delivery, event, endpoint, db, lastError) => {
  await db.prepare(`
    INSERT INTO webhook_dlq (delivery_id, event_id, event_type, event_data,
      endpoint_url, attempts, last_error, created_at)
    VALUES (?, ?, ?, ?, ?, ?, ?, ?)
  `).bind(
    delivery.id, event.id, event.type, JSON.stringify(event.data),
    endpoint.url, delivery.attempt, lastError, Date.now()
  ).run()

  // Update delivery status
  await updateDelivery(db, delivery.id, 'dead_letter', 0)

  // Alert operations
  await sendAlert({
    severity: 'warning',
    title: 'Webhook delivery failed permanently',
    details: `Event ${event.id} (${event.type}) failed after ${delivery.attempt} attempts to ${endpoint.url}`
  })
}

// Manual replay from DLQ
const replayDLQ = async (deliveryId, db) => {
  const dlqEntry = await db.prepare(
    'SELECT * FROM webhook_dlq WHERE delivery_id = ?'
  ).bind(deliveryId).first()

  if (!dlqEntry) throw new Error('DLQ entry not found')

  // Re-create the event and delivery, reset attempts
  const event = {
    id: dlqEntry.event_id,
    type: dlqEntry.event_type,
    data: JSON.parse(dlqEntry.event_data),
    timestamp: Date.now() // New timestamp for replay
  }

  // Remove from DLQ
  await db.prepare('DELETE FROM webhook_dlq WHERE delivery_id = ?')
    .bind(deliveryId).run()

  // Re-deliver
  return deliverWebhook(event, { url: dlqEntry.endpoint_url }, db)
}

Idempotency: Making Duplicates Safe

At-least-once delivery guarantees duplicates. Recipients must handle them gracefully using idempotency keys:

// Recipient-side idempotency
const processWebhook = async (request) => {
  const webhookId = request.headers.get('X-Webhook-ID')
  const payload = await request.json()

  // Check if we've already processed this delivery
  const exists = await db.prepare(
    'SELECT 1 FROM processed_webhooks WHERE webhook_id = ?'
  ).bind(webhookId).first()

  if (exists) {
    // Already processed — return 200 to stop retries
    return new Response('OK (duplicate)', { status: 200 })
  }

  // Process the event
  await handleEvent(payload)

  // Record as processed (with TTL for cleanup)
  await db.prepare(
    'INSERT INTO processed_webhooks (webhook_id, processed_at) VALUES (?, ?)'
  ).bind(webhookId, Date.now()).run()

  return new Response('OK', { status: 200 })
}

Monitoring Webhook Health

Key metrics to track for webhook delivery systems:

MetricHealthy ThresholdAlert Threshold
First-attempt success rate>98%<95%
Retry success rate>99.5%<98%
DLQ depth0>10
P95 delivery latency<5 seconds>30 seconds
Average attempts per event<1.1>1.5

A well-architected webhook delivery system with exponential backoff, dead letter queues, and idempotency keys achieves 99.99% delivery reliability — losing less than 1 event per 10,000. Combined with HMAC-SHA256 payload signing and TLS encryption, it becomes a trustworthy foundation for event-driven license management, payment processing, and real-time integrations.

TOT
Traffic Orchestrator Team
Engineering

The engineering team behind Traffic Orchestrator, building enterprise-grade software licensing infrastructure used by developers worldwide.

Was this article helpful?
Get licensing insights delivered

Engineering deep-dives, security advisories, and product updates. Unsubscribe anytime.

Share this article
Free Plan Available

Ship licensing in your next release

5 licenses, 500 validations/month, full API access. Set up in under 5 minutes — no credit card required.

2-minute setup No credit card Cancel anytime