JAN 15, 2025

Scaling Webhook Delivery at Buildkite

Webhooks power Buildkite's integrations, notifying external systems of builds, jobs, and artifacts. Here's how we scaled our delivery system to handle millions of events daily.

Jane Smith 3 min read

Heavy on Code with practical implementation examples, Strong focus on Craft and reusable patterns. minimal opinion.

code85%

craft75%

lessons70%

opinion30%

story40%

Webhooks power Buildkite’s integrations, notifying external systems of builds, jobs, and artifacts. Here’s how we scaled our delivery system to handle millions of events daily.

The Challenge

Our original webhook system was straightforward: when an event occurred, we’d immediately attempt to deliver the webhook. This worked fine at small scale, but as our customer base grew, we encountered several problems:

Thundering herd: Popular repositories could generate thousands of events in seconds
Slow receivers: Some endpoints took seconds to respond, blocking delivery workers
Retry storms: Failed deliveries would pile up during outages
No backpressure: We had no way to slow down during high load

The Solution: Event-Driven Architecture

We redesigned the system around an event-driven architecture with several key components:

1. Event Queue with Partitioning

Instead of delivering webhooks synchronously, we now publish events to a partitioned queue:

type WebhookEvent struct {
    ID            string    `json:"id"`
    OrganizationID string   `json:"organization_id"`
    Type          string    `json:"type"`
    Payload       []byte    `json:"payload"`
    CreatedAt     time.Time `json:"created_at"`
}

func (p *Publisher) Publish(ctx context.Context, event WebhookEvent) error {
    // Partition by organization to maintain ordering guarantees
    partition := hash(event.OrganizationID) % p.numPartitions

    return p.queue.Publish(ctx, partition, event)
}

Partitioning by organization ensures that webhooks for the same organization are processed in order, while allowing parallel processing across organizations.

2. Adaptive Rate Limiting

We implemented per-endpoint rate limiting that adapts based on the receiver’s behavior:

type AdaptiveRateLimiter struct {
    baseRate     float64
    currentRate  float64
    successCount int64
    failureCount int64
}

func (r *AdaptiveRateLimiter) OnSuccess() {
    atomic.AddInt64(&r.successCount, 1)
    // Gradually increase rate on success
    r.currentRate = min(r.currentRate * 1.1, r.baseRate * 2)
}

func (r *AdaptiveRateLimiter) OnFailure() {
    atomic.AddInt64(&r.failureCount, 1)
    // Quickly back off on failure
    r.currentRate = max(r.currentRate * 0.5, r.baseRate * 0.1)
}

3. Circuit Breakers

For endpoints that are completely down, we use circuit breakers to avoid wasting resources:

type CircuitBreaker struct {
    state        State
    failures     int
    threshold    int
    resetTimeout time.Duration
    lastFailure  time.Time
}

func (cb *CircuitBreaker) Allow() bool {
    switch cb.state {
    case StateClosed:
        return true
    case StateOpen:
        if time.Since(cb.lastFailure) > cb.resetTimeout {
            cb.state = StateHalfOpen
            return true
        }
        return false
    case StateHalfOpen:
        return true
    }
    return false
}

Results

After deploying the new system, we saw dramatic improvements:

Metric	Before	After
P99 Delivery Latency	12s	200ms
Failed Deliveries	2.3%	0.1%
Resource Usage	High variance	Stable

Lessons Learned

Design for failure: Assume receivers will be slow or unavailable
Partition wisely: Choose partition keys that balance load while maintaining necessary ordering
Monitor everything: We added detailed metrics for every stage of the pipeline
Backpressure is essential: Systems without backpressure will eventually fail

What’s Next

We’re now working on:

Webhook replay: Allowing customers to replay failed webhooks
Filtering: Let customers subscribe to specific event types
Batching: Combine multiple events into single deliveries for high-volume customers

Interested in solving challenges like this? We’re hiring! Check out our open positions.