Merged
Size
M
Change Breakdown
Performance80%
Refactor20%
#3395perf(run-engine): merge dequeue snapshot creation into taskRun.update transaction [TRI-8450]

Database commits halved for task dequeues

Database transactions are being merged in the task dequeue flow, cutting commits per operation from two to one to reduce database pressure.

Database transactions are being consolidated in the engine service. The task dequeue flow now merges execution snapshot creation into the primary task update transaction. Snapshot IDs are pre-generated locally, eliminating the need for subsequent database read steps to construct event payloads.

Consolidating these calls cuts the database commit load in half per dequeue operation, significantly reducing overhead on the system's highest-volume execution path. This represents the first phase of a broader initiative targeting five distinct operational flows to lower database resource consumption.

View Original GitHub Description

Summary

Nests the TaskRunExecutionSnapshot creation inside the taskRun.update() Prisma call in the dequeue flow, reducing 2 DB commits → 1 per dequeue operation. This is the highest-volume of the five unmerged flows identified in TRI-8450 (~9,200 commits/sec on the engine service).

Pattern: Follows the same nested-write approach already used in the completion path (runAttemptSystem.ts:735) and trigger path (engine/index.ts:674).

Changes:

  • dequeueSystem.ts: Moved snapshot creation into executionSnapshots: { create: {...} } within the existing taskRun.update(). Pre-generates the snapshot ID via generateInternalId() (plain cuid, matching what Prisma's @default(cuid()) produces) so the event emission, heartbeat enqueue, and return value can all be constructed from data already in scope — no extra DB read needed after the merged write. SnapshotId.toFriendlyId() is used only for the return value's friendlyId field, matching the original createExecutionSnapshot behavior.
  • executionSnapshotSystem.ts: Added public enqueueHeartbeatIfNeeded() method that exposes the heartbeat scheduling logic (previously only available internally via createExecutionSnapshot). This is needed because PENDING_EXECUTING requires a heartbeat, unlike the FINISHED status in the completion reference pattern. This method is reusable by future merge targets (retry-immediate, checkpoint, cancel, requeue).

Net DB change per dequeue: eliminates 1 write transaction (the separate TaskRunExecutionSnapshot.create). No extra reads added — the snapshot ID is pre-generated and the executionSnapshotCreated event payload is constructed inline from values already available in the closure.

Review & Testing Checklist for Human

  • Verify manually-constructed event payload matches DB state: The executionSnapshotCreated event is now built inline (not read back from DB). Confirm the field values (runStatus: "PENDING", attemptNumber, checkpointId, workerId, runnerId, completedWaitpointIds) match what Prisma actually writes. A mismatch here would be silent — event consumers would get stale/wrong data.
  • Verify attemptNumber source is equivalent: Old code used lockedTaskRun.attemptNumber (post-update result). New code uses result.run.attemptNumber (pre-update). The taskRun.update() data payload does NOT include attemptNumber, so they should be identical — but confirm this assumption holds for all dequeue scenarios (e.g. retried runs).
  • Verify isValid defaults to true in schema: The old createExecutionSnapshot explicitly set isValid: error ? false : true. The nested create omits isValid (no error in the dequeue happy path). Confirm the Prisma schema default for TaskRunExecutionSnapshot.isValid is true.
  • Verify runStatus: "PENDING" hardcoding matches the mapping: The old code passed lockedTaskRun.status ("DEQUEUED") to createExecutionSnapshot, which mapped it to "PENDING" via run.status === "DEQUEUED" ? "PENDING" : run.status. The new code hardcodes "PENDING" directly. This is correct but brittle if status ever changes from "DEQUEUED" to something else upstream.
  • Spot-check completedWaitpoints connect + order logic: The nested create replicates the connect/order logic from createExecutionSnapshot (lines 387-393). Verify the snapshot.completedWaitpoints type provides id and index fields compatible with this usage.
  • Verify checkpoint in return value: The return now uses snapshot.checkpoint (from the previous snapshot) instead of reading the newly-created snapshot's checkpoint relation. Since checkpointId is passed through unchanged, they should be identical — but worth a sanity check.

Recommended test plan: deploy to staging, run the sample_pg_activity.py sampler for a 5-minute window, and verify the COMMIT count drop on the engine service + proportional IO:XactSync reduction.

Notes

  • This only covers the dequeue flow (flow #1 from TRI-8450). The remaining four flows (retry-immediate, checkpoint, requeue, cancel) are separate follow-ups.
  • The new enqueueHeartbeatIfNeeded method is deliberately designed for reuse by those follow-up PRs.
  • CI note: the priority.test.ts failure in shard 7 is a flaky ordering assertion unrelated to this change (it compares friendlyId values in dequeue order). The audit check is also pre-existing/unrelated.

Link to Devin session: https://app.devin.ai/sessions/034fe0e7224f49278a2de260203e1377 Requested by: @ericallam

© 2026 · via Gitpulse