Handling long-running LLM streams in a stateful backend

Leap is a new AI app builder for developers building real-world applications, and not just prototypes. We've had a few people ask us how it works under the hood, so this is the first part in a series of posts digging into the technology behind Leap.

There's already a lot of ink spilled on how LLMs work out there, so we figured it would be interesting to look at the other side of it: what it takes to build a large-scale, high-throughput application that uses AI, and the unique challenges that brings.

The challenges with AI

When using AI/LLMs to generate large amounts of code, one of the first things you'll note is that it takes time. When creating a new application, the initial code generation can take 5-10 minutes.

As an end-user this isn't that bad, but it has huge implications for how to architect the system to handle this at scale. Most of the advice you will find online for building backends focuses on stateless backends, where no state is stored between requests and each request completes quickly (typically on the order of milliseconds to single-digit seconds, at worst). If you can get away with it, structuring your backend to enable that gives you easy horizontal scaling, simple deployment strategies, and straightforward load balancing.

Our reality when building Leap is the complete opposite: requests can take minutes, and the backend is heavily stateful with real-time demands. This series of blog posts digs into what implications those requirements have, and how we solve for them while maintaining high shipping velocity.

To start off, we're delving into two problems we faced early on that we knew were important to get right: real-time collaboration at scale, and zero downtime deployments. The next article will tackle how we manage a large fleet of virtual machines for building and serving Leap's preview environments.

Real-time collaboration

Since we're designing for building real-world applications and not just simple prototypes, we realized from the start that we needed to support real-time multiplayer collaboration. As most ambitious applications will be built by multiple developers.

If you've ever built that you know this poses a few different and tricky problems to solve:

How to avoid out-of-sync issues, where different clients see different things?
How to avoid race conditions, where the application breaks if multiple clients take conflicting actions concurrently?
How to scale the system, when the tried-and-true "everything is stateless and therefore can be scaled and load-balanced in the most naïve way possible" paradigm breaks down?
How to achieve all of these things while still keeping the system easy to reason about and fast to iterate and improve upon?

To accomplish all these things at the same time we had to get a bit more creative. The architecture we settled on involves stateful, protocol-aware load balancing, designing an event-driven state update protocol, and taking inspiration from functional programming.

Correctness

To achieve correctness (meaning no out-of-sync issues or race conditions) while keeping the system easy to reason about (to keep maintainability, quality, and development velocity high), we opted to connect all active clients for the same Leap project to a single in-memory object representing the project's current state. In our domain model we call this an active Session.

The Session keeps track of the current state of the project, as well as all connected clients. When a client connects to begin working on a project, we first check if an existing Session is active for that project. If no, we instantiate a new Session, which loads the up-to-date project state from our PostgreSQL database. If a Session already exists, we instead connect the client to that Session.

To support real-time collaboration multiple clients need to be kept in complete sync with one another. We do this by having the Session contain a State Machine that is responsible for handling all mutations to the project's state.

When a client wants to take an action it sends a message to the Session, which is then enqueued and processed by the State Machine one-at-a-time. As the State Machine processes actions it updates its own internal state and then broadcasts the state changes to all connected clients.

When a new client connects to an existing Session it first receives a "full sync" event representing the current up-to-date project state, and then begins receiving state updates along with all other clients. By linearizing all events in this way we ensure there are no race conditions where different clients observe different states.

All of Leap is written in Go, so this is implemented as a single goroutine that reads incoming action events on a channel and handles them in a giant switch:

for msg := range m.messages {
  switch msg := msg.(type) {
  case registerClientMsg:
      m.registerClient(msg)
  case deregisterClientMsg:
      m.deregisterClient(msg)
  case beginWorkflowMsg:
      m.beginWorkflow(msg)
  case abortMsg:
      m.abort(msg)
  case buildResultMsg:
      m.buildResult(msg)
  // many more cases...
  }
}

This simple achieves static type-safety (every message implements the Message interface) while every message can have its own data. (For the Rust crowd: yes, we get it, this part of the code would be even nicer in Rust with an enum a.k.a. sum type! Other parts would not be.)

One major benefit of this approach is that the state machine is completely synchronous and single-threaded. This means it never needs to worry about data races or locking, dramatically simplifying the code and improving developer productivity and quality through a simpler programming model that's easier to reason about.

The main downside is that we must be careful when taking actions that are slow, like waiting for a build to complete. If we do that synchronously in the state machine's goroutine it will block the state machine from processing other messages, which makes for a poor user experience. Those actions must instead be processed asynchronously (without accessing the state machine, to avoid data races) and communicate the result back to the state machine via message passing (like the buildResultMsg case above).

Scalability and zero-downtime deployments

Beyond the programming paradigm, this stateful architecture also comes with several operations-related hurdles:

How do we scale this architecture to handle our quickly-growing user base?
How do we deploy new releases in a graceful way, when actions like streaming responses from LLMs can take 5+ minutes?

We could go with either horizontal (spreading the load over multiple services) or vertical (getting bigger servers) scalability. Perhaps the most relevant aspect here is that if we go with a solution where we need to run multiple processes, we need to start keeping track of where each active session is hosted (on which server and which process).

At first blush it seems easier to just scale vertically with a single, giant server, and skip this problem entirely. But with a single server and a single process it makes graceful, zero-downtime deployments virtually impossible, as we'd need to shift all active sessions over from one process to another at the same time. Given that we have many LLM streams running at every second of every day that's not going to work. Not to mention that a single server makes for a reliability nightmare.

Introducing the WebSocket Gateway

We decided instead to introduce a stateful, API-aware proxy we call the WebSocket Gateway. The proxy keeps track of which Leap projects are running where (on which machine, and in which instance on that machine).

When a project-specific request or WebSocket comes in, the proxy routes it to the specific instance that is managing that project. If the project isn't already managed by any instance the proxy will instead assign it to one of the machines using a load-balancing strategy. As soon as there are no open connections for a given project, the proxy will delete the link.

In order to handle zero-downtime deployments, the proxy tracks multiple processes per machine. To deploy a new release we notify the proxy that a new binary exists. The proxy then spins up a new process (on each server) with the new version, and begins allocating new projects to those processes. Crucially the existing processes are kept around, possibly indefinitely, until there are no more open connections to them, at which point they can be shut down.

In this way we manage to achieve a horizontally scalable service, with zero-downtime deployments, with a highly stateful backend service. The main downside to this approach is that we need to be careful when rolling out modifications to the database schema, since old releases can potentially be running for several more hours after deployment.

So far this trade-off has been worth it, as we already always make sure to structure schema modifications as backwards-compatible, but your mileage may vary.

Up next: Firecracker at scale

Next up we'll be diving into how Leap uses Firecracker to build and serve full-stack applications, complete with infrastructure resources (backend services, SQL Databases, Pub/Sub, Object Storage, etc.) for tens of thousands of projects per day.