Engineering

How We Engineered an 8ms Median Latency for Bot Execution

Sarah Jenkins Oct 24, 2023 8 min read

Most AI orchestration platforms handle latency by throwing more hardware at the problem. They spin up containers on every webhook, wait for models to warm up, and accept 200ms+ p50s as "normal." In high-frequency trading and real-time support, that margin is a cliff.

The Approach: Pre-warmed Slots

We don't spin up. We provision. When a pipeline is deployed, our runtime reserves a pool of lightweight Go fibers alongside the containerized worker. These slots are pre-initialized with the SDK and ready to accept context immediately.

Combined with predictive allocation, which uses a small per-pipeline state machine to guess when traffic will spike, we reduce the "cold start" variance to near-zero. Instead of allocating a new container on every request, we hand off the request to a slot that is already sitting at 0.5ms overhead.

Latency comparison graph showing 8ms median vs 180ms competitors
Methodology

Benchmarking & Metrics

We measure latency at the execution boundary, from webhook receipt to the final state update. Our methodology accounts for network jitter and queue depth, ensuring our numbers reflect real-world production behavior.

// Sample output from our internal load testing suite
p50_latency: 8.2ms
p95_latency: 14.5ms
p99_latency: 32.1ms
throughput: 450k req/s
      

The Scheduler Loop

Written entirely in Go, the scheduler runs at the kernel level. It doesn't block; it polls a lock-free ring buffer of tasks.

We removed the mutex contention found in standard worker pools by using atomic operations for queue head/tail pointers. This allows the runtime to schedule thousands of micro-batches per millisecond without context switching overhead.

type Task struct { ID uint64; Payload []byte }
var ring [RingSize]Task
var head, tail uint64

Architecture diagram of the lock-free queue scheduler

Real-World Impact

This architecture powers 12 million monthly executions across our Growth and Enterprise tiers. The distribution is tight; the tail is consistently low.

> "Our support chatbot now responds in real-time. Latency dropped from 180ms to 8ms."

— Engineering Lead, FinTech Startup

What's Next

Targeting 5ms p50 in the next runtime release using WebAssembly (Wasm) for isolated worker execution. We're also exploring FUSE-based filesystem caching for heavy data pipelines.

Read the Changelog

Discussion

Have thoughts on lock-free queues or Go runtime tuning? Join our Discord or open an issue on GitHub.

Related Articles