Scaling Webhook Delivery at Buildkite

Webhooks power Buildkite's integrations, notifying external systems of builds, jobs, and artifacts. Here's how we scaled our delivery system to handle millions of events daily.

Jane Smith 3 min read
CodeCraftLessonsOpinionStory

Heavy on Code with practical implementation examples, Strong focus on Craft and reusable patterns. minimal opinion.

code85%
craft75%
lessons70%
opinion30%
story40%

Webhooks power Buildkite’s integrations, notifying external systems of builds, jobs, and artifacts. Here’s how we scaled our delivery system to handle millions of events daily.

The Challenge

Our original webhook system was straightforward: when an event occurred, we’d immediately attempt to deliver the webhook. This worked fine at small scale, but as our customer base grew, we encountered several problems:

  1. Thundering herd: Popular repositories could generate thousands of events in seconds
  2. Slow receivers: Some endpoints took seconds to respond, blocking delivery workers
  3. Retry storms: Failed deliveries would pile up during outages
  4. No backpressure: We had no way to slow down during high load

The Solution: Event-Driven Architecture

We redesigned the system around an event-driven architecture with several key components:

1. Event Queue with Partitioning

Instead of delivering webhooks synchronously, we now publish events to a partitioned queue:

go
type WebhookEvent struct {
    ID            string    `json:"id"`
    OrganizationID string   `json:"organization_id"`
    Type          string    `json:"type"`
    Payload       []byte    `json:"payload"`
    CreatedAt     time.Time `json:"created_at"`
}

func (p *Publisher) Publish(ctx context.Context, event WebhookEvent) error {
    // Partition by organization to maintain ordering guarantees
    partition := hash(event.OrganizationID) % p.numPartitions

    return p.queue.Publish(ctx, partition, event)
}

Partitioning by organization ensures that webhooks for the same organization are processed in order, while allowing parallel processing across organizations.

2. Adaptive Rate Limiting

We implemented per-endpoint rate limiting that adapts based on the receiver’s behavior:

go
type AdaptiveRateLimiter struct {
    baseRate     float64
    currentRate  float64
    successCount int64
    failureCount int64
}

func (r *AdaptiveRateLimiter) OnSuccess() {
    atomic.AddInt64(&r.successCount, 1)
    // Gradually increase rate on success
    r.currentRate = min(r.currentRate * 1.1, r.baseRate * 2)
}

func (r *AdaptiveRateLimiter) OnFailure() {
    atomic.AddInt64(&r.failureCount, 1)
    // Quickly back off on failure
    r.currentRate = max(r.currentRate * 0.5, r.baseRate * 0.1)
}

3. Circuit Breakers

For endpoints that are completely down, we use circuit breakers to avoid wasting resources:

go
type CircuitBreaker struct {
    state        State
    failures     int
    threshold    int
    resetTimeout time.Duration
    lastFailure  time.Time
}

func (cb *CircuitBreaker) Allow() bool {
    switch cb.state {
    case StateClosed:
        return true
    case StateOpen:
        if time.Since(cb.lastFailure) > cb.resetTimeout {
            cb.state = StateHalfOpen
            return true
        }
        return false
    case StateHalfOpen:
        return true
    }
    return false
}

Results

After deploying the new system, we saw dramatic improvements:

MetricBeforeAfter
P99 Delivery Latency12s200ms
Failed Deliveries2.3%0.1%
Resource UsageHigh varianceStable

Lessons Learned

  1. Design for failure: Assume receivers will be slow or unavailable
  2. Partition wisely: Choose partition keys that balance load while maintaining necessary ordering
  3. Monitor everything: We added detailed metrics for every stage of the pipeline
  4. Backpressure is essential: Systems without backpressure will eventually fail

What’s Next

We’re now working on:

  • Webhook replay: Allowing customers to replay failed webhooks
  • Filtering: Let customers subscribe to specific event types
  • Batching: Combine multiple events into single deliveries for high-volume customers

Interested in solving challenges like this? We’re hiring! Check out our open positions.