Lock contention eliminated for batchTriggerAndWait processing

ericallam

·Mar 17, 2026·#3232fix(engine): lockless waitpoint insert for batch items to eliminate lock contention

High-concurrency batchTriggerAndWait processing no longer triggers LockAcquisitionTimeoutError, eliminating 880 daily errors and preventing parent runs from getting stuck.

When processing batchTriggerAndWait items at scale, each batch item was unnecessarily acquiring a Redis lock on the parent run to insert a TaskRunWaitpoint row. With processingConcurrency set to 50, this created intense lock contention that timed out, leaving runs orphaned and stuck.

The fix bypasses the lock entirely for batch items. Since blockRunWithCreatedBatch already transitions the parent run to EXECUTING_WITH_WAITPOINTS before any items are processed, the per-item lock is redundant. A new lockless method performs only the idempotent CTE insert—using ON CONFLICT DO NOTHING to safely handle concurrent inserts—without touching the parent run lock.

In the run-engine, batch items now route through this optimized path while single triggerAndWait operations continue using the original locking method. This eliminates the root cause of the timeout errors while preserving safety for non-batch scenarios.

The change lives in the run-engine package, targeting the waitpointSystem that manages task run coordination.

View Original GitHub Description

When processing batchTriggerAndWait items, each batch item was acquiring a Redis lock on the parent run to insert a TaskRunWaitpoint row. With high concurrency (processingConcurrency=50), this caused LockAcquisitionTimeoutError (880 errors/24h in prod), orphaned runs, and stuck parent runs.

Since blockRunWithCreatedBatch already transitions the parent to EXECUTING_WITH_WAITPOINTS before items are processed, the per-item lock is unnecessary. The new blockRunWithWaitpointLockless method performs only the idempotent CTE insert and timeout scheduling without acquiring the lock.