Snapshot reads now route through database replica

ericallam

·Apr 22, 2026·#3423feat(run-engine): flag to route getSnapshotsSince through read replica

A feature flag lets snapshot-polling queries hit the read-only replica instead of the primary database — reducing writer load during high-concurrency task execution.

Database writers stay under less pressure during task execution. Snapshot-polling queries — fired by every running task runner, multiple times per second — have been hammering the primary database even though the data they read is tolerant of slight staleness.

A new feature flag routes those queries to the read-only replica instead. When disabled (the default), behavior is unchanged. When enabled, the method uses the replica client, offloading snapshot reads from the primary entirely.

The flag ships disabled to allow gradual rollout with monitoring — a prudent approach given replica lag considerations during replication catch-up windows. Aurora deployments shrink those windows to single-digit milliseconds, making the risk minimal in production environments.

The change lives in the package, a core component in the task runner infrastructure.

View Original GitHub Description

Summary

Adds RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED (default "0"). When enabled, the Prisma reads inside RunEngine.getSnapshotsSince run against the read-only replica client instead of the primary. Offloads the snapshot-polling queries fired by every running task runner off the writer.

Why

getSnapshotsSince is called from the managed runner's fetch-and-process loop (once per poll interval, plus on every snapshot-change notification). It runs four sequential reads per call — one findFirst by snapshot id, one findMany on snapshots with createdAt > X, one raw SQL against _completedWaitpoints, and chunked findMany on waitpoint. Per concurrent run, every few seconds. It's read-only, tolerates a small amount of staleness, and is an obvious candidate for the replica.

Replica-lag considerations

Step 1 "since snapshot not found": if the runner just received a snapshot id from the primary and asks the replica before it replicates, the function throws and the caller treats the response as an error (runner falls back to a metadata refresh). Self-correcting, not silent.
Step 2 missing newly-created snapshots: the next poll's createdAt > sinceSnapshot.createdAt filter still picks them up once the replica catches up.
Waitpoint junction race: the riskiest path — if a latest snapshot is replicated but its _completedWaitpoints join rows aren't yet, the runner could advance past that snapshot with completedWaitpoints: []. WAL/storage-level replication replays commits in order, so in practice both should appear atomically on the reader, but the race window is why the flag ships disabled.

Aurora reader shrinks all three windows to single-digit ms in typical conditions, and its storage-level replication gives atomic visibility of committed transactions on the reader.

Test plan

Flip the flag on in a non-prod environment, confirm snapshot polling behaves normally and getSnapshotsSince errors in Sentry stay flat.
Verify writer query volume drops and reader query volume rises on the snapshot-polling queries.
Keep an eye on AuroraReplicaLag (or equivalent) during rollout.