Scaling Webhook Delivery at Buildkite
Webhooks power Buildkite's integrations, notifying external systems of builds, jobs, and artifacts. Here's how we scaled our delivery system to handle millions of events daily.
Heavy on Code with practical implementation examples, Strong focus on Craft and reusable patterns. minimal opinion.
Webhooks power Buildkite’s integrations, notifying external systems of builds, jobs, and artifacts. Here’s how we scaled our delivery system to handle millions of events daily.
The Challenge
Our original webhook system was straightforward: when an event occurred, we’d immediately attempt to deliver the webhook. This worked fine at small scale, but as our customer base grew, we encountered several problems:
- Thundering herd: Popular repositories could generate thousands of events in seconds
- Slow receivers: Some endpoints took seconds to respond, blocking delivery workers
- Retry storms: Failed deliveries would pile up during outages
- No backpressure: We had no way to slow down during high load
The Solution: Event-Driven Architecture
We redesigned the system around an event-driven architecture with several key components:
1. Event Queue with Partitioning
Instead of delivering webhooks synchronously, we now publish events to a partitioned queue:
type WebhookEvent struct {
ID string `json:"id"`
OrganizationID string `json:"organization_id"`
Type string `json:"type"`
Payload []byte `json:"payload"`
CreatedAt time.Time `json:"created_at"`
}
func (p *Publisher) Publish(ctx context.Context, event WebhookEvent) error {
// Partition by organization to maintain ordering guarantees
partition := hash(event.OrganizationID) % p.numPartitions
return p.queue.Publish(ctx, partition, event)
}Partitioning by organization ensures that webhooks for the same organization are processed in order, while allowing parallel processing across organizations.
2. Adaptive Rate Limiting
We implemented per-endpoint rate limiting that adapts based on the receiver’s behavior:
type AdaptiveRateLimiter struct {
baseRate float64
currentRate float64
successCount int64
failureCount int64
}
func (r *AdaptiveRateLimiter) OnSuccess() {
atomic.AddInt64(&r.successCount, 1)
// Gradually increase rate on success
r.currentRate = min(r.currentRate * 1.1, r.baseRate * 2)
}
func (r *AdaptiveRateLimiter) OnFailure() {
atomic.AddInt64(&r.failureCount, 1)
// Quickly back off on failure
r.currentRate = max(r.currentRate * 0.5, r.baseRate * 0.1)
}3. Circuit Breakers
For endpoints that are completely down, we use circuit breakers to avoid wasting resources:
type CircuitBreaker struct {
state State
failures int
threshold int
resetTimeout time.Duration
lastFailure time.Time
}
func (cb *CircuitBreaker) Allow() bool {
switch cb.state {
case StateClosed:
return true
case StateOpen:
if time.Since(cb.lastFailure) > cb.resetTimeout {
cb.state = StateHalfOpen
return true
}
return false
case StateHalfOpen:
return true
}
return false
}Results
After deploying the new system, we saw dramatic improvements:
| Metric | Before | After |
|---|---|---|
| P99 Delivery Latency | 12s | 200ms |
| Failed Deliveries | 2.3% | 0.1% |
| Resource Usage | High variance | Stable |
Lessons Learned
- Design for failure: Assume receivers will be slow or unavailable
- Partition wisely: Choose partition keys that balance load while maintaining necessary ordering
- Monitor everything: We added detailed metrics for every stage of the pipeline
- Backpressure is essential: Systems without backpressure will eventually fail
What’s Next
We’re now working on:
- Webhook replay: Allowing customers to replay failed webhooks
- Filtering: Let customers subscribe to specific event types
- Batching: Combine multiple events into single deliveries for high-volume customers
Interested in solving challenges like this? We’re hiring! Check out our open positions.