Debounce lock contention resolved

ericallam

·Apr 28, 2026·#3453fix(run-engine): debounce hot-key lock contention and 5xx feedback loop

The run engine quantizes debounce delays and handles lock contention gracefully, eliminating 5xx errors and retry storms on hot keys.

Heavy concurrent triggers on a single debounce key previously caused lock contention, leading to 5xx errors and a vicious cycle of SDK retries. The debounce engine is now smarter about handling "thundering herd" scenarios.

The system now buckets debounce delay times into one-second intervals to group concurrent triggers. This allows an unlocked fast-path skip before acquiring a lock. If lock contention does occur, the engine falls back gracefully to the existing run instead of throwing an error.

This breaks the 5xx retry feedback loop. The engine stays stable under heavy debounce load without dropping updates or overwhelming the database. These changes are isolated to the run-engine package and can be tuned via new environment variables.

View Original GitHub Description

Changes

Three changes in internal-packages/run-engine/src/engine/systems/debounceSystem.ts, in order of impact:

Fast-path skip before the lock. In handleExistingRun, do an unlocked read of delayUntil (and createdAt for the max-duration check) from the run row before entering runLock.lock("handleDebounce", ...). If newDelayUntil <= currentDelayUntil and the run is still within its max-duration window, return the existing run immediately without taking the lock. Safe because debounce is monotonic-forward only — a stale read either matches reality or undershoots, both of which decay correctly (re-checked properly inside the lock by whichever caller is actually pushing forward). Trailing-mode triggers carrying updateData still take the lock so the data update is applied.
Quantize newDelayUntil. Round the computed newDelayUntil to 1-second buckets (configurable via quantizeNewDelayUntilMs, set to 0 to disable). Without quantization, every call has a slightly larger newDelayUntil than the last and they all pass the fast-path check. With it, concurrent callers on the same key share a target time and ~95% short-circuit. User-visible effect: a debounced run might fire up to 1s earlier than the strict spec — non-issue for typical debounce use cases (chat summarization, batched notifications, etc.).
Graceful lock-contention fallback. Wrap the runLock.lock(...) call so LockAcquisitionTimeoutError and Redlock ExecutionError / ResourceLockedError return the existing run id with success instead of propagating a 5xx. Debounce is best-effort: if we can't take the lock, the herd is already updating it for us; fall in line. This kills the 5xx → SDK-retry feedback loop. With (1)+(2) this rarely fires; without them it's the difference between 5xx and 200.

Defaults preserve current behaviour aside from quantization (1s) and fast-path (on). Both are configurable via RunEngineOptions.debounce.

✅ Checklist

I have followed every step in the contributing guide
The PR title follows the convention.
I ran and tested the code works

Changelog

Reduce 5xx feedback loops on hot debounce keys by quantizing delayUntil, adding an unlocked fast-path skip before the redlock, and gracefully handling redlock contention in handleDebounce so the SDK no longer retries into a herd.