Webhooks are the backbone of modern event-driven architectures. When a license is activated, a payment succeeds, or a subscription changes, your system needs to notify downstream services reliably. But HTTP is inherently unreliable — networks drop packets, servers restart, and DNS fails. A webhook delivery system that doesn't account for these failures will silently lose events.
The Reliability Spectrum
| Guarantee | What It Means | Implementation Complexity | Use Case |
|---|---|---|---|
| At-most-once | Fire and forget. Event may be lost. | Low | Analytics, logging |
| At-least-once | Retry until acknowledged. May duplicate. | Medium | License events, payments |
| Exactly-once | Delivered once, processed once. | High (requires idempotency) | Financial transactions |
Most webhook systems target at-least-once delivery — guaranteeing the event arrives, while requiring recipients to handle duplicates via idempotency keys.
Exponential Backoff with Jitter
The naive approach to retries — fixed interval (e.g., retry every 60 seconds) — creates thundering herd problems. When a recipient recovers from an outage, all queued retries hit simultaneously, causing another outage.
Exponential backoff with jitter solves this:
// Exponential backoff with full jitter
const calculateDelay = (attempt, baseDelay = 60, maxDelay = 86400) => {
// Exponential: 60s, 120s, 240s, 480s, 960s, 1920s, 3840s, 7680s...
const exponential = baseDelay * Math.pow(2, attempt - 1)
// Cap at maxDelay (24 hours)
const capped = Math.min(exponential, maxDelay)
// Full jitter: random between 0 and capped delay
// This spreads retries evenly across the window
return Math.floor(Math.random() * capped)
}
// Retry schedule (approximate):
// Attempt 1: 0-60s after failure
// Attempt 2: 0-120s
// Attempt 3: 0-240s
// Attempt 4: 0-480s
// Attempt 5: 0-960s (16 min)
// Attempt 6: 0-1920s (32 min)
// Attempt 7: 0-3840s (64 min)
// Attempt 8: 0-7680s (2.1 hrs)
// After 8 failures: move to Dead Letter Queue
The Delivery Pipeline
A production-grade webhook delivery system has five stages:
- Event Ingestion — Business logic emits an event (e.g., "license.activated")
- Fanout — The event is duplicated for each registered webhook endpoint
- Delivery Attempt — HTTP POST to the endpoint with signed payload
- Response Processing — 2xx = success, 4xx = permanent failure, 5xx = retry
- Retry or DLQ — Failed deliveries are re-queued or moved to the dead letter queue
// Webhook delivery pipeline
const deliverWebhook = async (event, endpoint, db) => {
const delivery = {
id: crypto.randomUUID(),
eventId: event.id,
endpointUrl: endpoint.url,
attempt: 1,
maxAttempts: 8,
status: 'pending',
createdAt: Date.now()
}
// Store delivery record (idempotency + audit trail)
await db.prepare(
'INSERT INTO webhook_deliveries (id, event_id, endpoint_url, attempt, status, created_at) VALUES (?, ?, ?, ?, ?, ?)'
).bind(delivery.id, delivery.eventId, delivery.endpointUrl, delivery.attempt, delivery.status, delivery.createdAt).run()
// Attempt delivery
return attemptDelivery(delivery, event, endpoint, db)
}
const attemptDelivery = async (delivery, event, endpoint, db) => {
const payload = JSON.stringify({
id: event.id,
type: event.type,
data: event.data,
timestamp: event.timestamp,
deliveryId: delivery.id,
attempt: delivery.attempt
})
// Sign the payload
const signature = await signPayload(payload, endpoint.secret)
try {
const response = await fetch(endpoint.url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Webhook-Signature': signature,
'X-Webhook-ID': delivery.id,
'X-Webhook-Timestamp': String(Date.now()),
'User-Agent': 'WebhookDelivery/1.0'
},
body: payload,
signal: AbortSignal.timeout(30000) // 30s timeout
})
if (response.ok) {
await updateDelivery(db, delivery.id, 'delivered', response.status)
return { success: true }
}
// 4xx = client error, don't retry (except 429)
if (response.status >= 400 && response.status < 500 && response.status !== 429) {
await updateDelivery(db, delivery.id, 'failed_permanent', response.status)
return { success: false, permanent: true }
}
// 5xx or 429 = retry
return scheduleRetry(delivery, event, endpoint, db, response.status)
} catch (error) {
// Network error, timeout, DNS failure = retry
return scheduleRetry(delivery, event, endpoint, db, 0)
}
}
Dead Letter Queues
After exhausting all retry attempts, events move to a Dead Letter Queue (DLQ). The DLQ serves three purposes:
- Data preservation — Events are never lost, even after all retries fail
- Manual replay — Operators can inspect and manually re-deliver DLQ events
- Alerting — DLQ depth triggers alerts to the operations team
// Dead Letter Queue management
const moveToDLQ = async (delivery, event, endpoint, db, lastError) => {
await db.prepare(`
INSERT INTO webhook_dlq (delivery_id, event_id, event_type, event_data,
endpoint_url, attempts, last_error, created_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
`).bind(
delivery.id, event.id, event.type, JSON.stringify(event.data),
endpoint.url, delivery.attempt, lastError, Date.now()
).run()
// Update delivery status
await updateDelivery(db, delivery.id, 'dead_letter', 0)
// Alert operations
await sendAlert({
severity: 'warning',
title: 'Webhook delivery failed permanently',
details: `Event ${event.id} (${event.type}) failed after ${delivery.attempt} attempts to ${endpoint.url}`
})
}
// Manual replay from DLQ
const replayDLQ = async (deliveryId, db) => {
const dlqEntry = await db.prepare(
'SELECT * FROM webhook_dlq WHERE delivery_id = ?'
).bind(deliveryId).first()
if (!dlqEntry) throw new Error('DLQ entry not found')
// Re-create the event and delivery, reset attempts
const event = {
id: dlqEntry.event_id,
type: dlqEntry.event_type,
data: JSON.parse(dlqEntry.event_data),
timestamp: Date.now() // New timestamp for replay
}
// Remove from DLQ
await db.prepare('DELETE FROM webhook_dlq WHERE delivery_id = ?')
.bind(deliveryId).run()
// Re-deliver
return deliverWebhook(event, { url: dlqEntry.endpoint_url }, db)
}
Idempotency: Making Duplicates Safe
At-least-once delivery guarantees duplicates. Recipients must handle them gracefully using idempotency keys:
// Recipient-side idempotency
const processWebhook = async (request) => {
const webhookId = request.headers.get('X-Webhook-ID')
const payload = await request.json()
// Check if we've already processed this delivery
const exists = await db.prepare(
'SELECT 1 FROM processed_webhooks WHERE webhook_id = ?'
).bind(webhookId).first()
if (exists) {
// Already processed — return 200 to stop retries
return new Response('OK (duplicate)', { status: 200 })
}
// Process the event
await handleEvent(payload)
// Record as processed (with TTL for cleanup)
await db.prepare(
'INSERT INTO processed_webhooks (webhook_id, processed_at) VALUES (?, ?)'
).bind(webhookId, Date.now()).run()
return new Response('OK', { status: 200 })
}
Monitoring Webhook Health
Key metrics to track for webhook delivery systems:
| Metric | Healthy Threshold | Alert Threshold |
|---|---|---|
| First-attempt success rate | >98% | <95% |
| Retry success rate | >99.5% | <98% |
| DLQ depth | 0 | >10 |
| P95 delivery latency | <5 seconds | >30 seconds |
| Average attempts per event | <1.1 | >1.5 |
A well-architected webhook delivery system with exponential backoff, dead letter queues, and idempotency keys achieves 99.99% delivery reliability — losing less than 1 event per 10,000. Combined with HMAC-SHA256 payload signing and TLS encryption, it becomes a trustworthy foundation for event-driven license management, payment processing, and real-time integrations.
Ship licensing in your next release
5 licenses, 500 validations/month, full API access. Set up in under 5 minutes — no credit card required.