Fast-completing batch streams no longer trigger false errors

matt-aitken

·Apr 23, 2026·#3427fix: handle fast-completion race in batch streaming seal check

Large batch payloads should no longer fail with API errors when background workers finish processing before the initial streaming loop completes.

A race condition occasionally caused large batch payloads to throw false errors upon completion. If background workers processed the enqueued items faster than the initial streaming loop could finish, the system would clean up temporary tracking data early. This caused the API endpoint to report the batch as incomplete, leading the SDK to retry and eventually fail, even though all child runs executed successfully.

The fallback logic in the streaming service has been updated to recognize a COMPLETED batch status in the database, alongside the standard sealed flag. This ensures that fast-completing batches are properly acknowledged, preventing unnecessary SDK retries and false failures in the parent run.

View Original GitHub Description

Problem

When batchTrigger() is called with large payloads, each item's payload is uploaded to R2 server-side during the streaming loop before being enqueued. This makes the loop slow — around 3 seconds per item. Workers pick up and execute each item as it's enqueued, running concurrently with the ongoing stream.

For the last item in the batch, a race exists between the streaming loop finishing and the batch completion cleanup:

The loop enqueues the last item and returns from enqueueBatchItem()
A waiting worker picks up the item almost instantly and executes it
recordSuccess() fires, processedCount hits the expected total, finalizeBatch() runs
cleanup() deletes all Redis keys for the batch, including enqueuedItemsKey
The streaming loop exits and calls getBatchEnqueuedCount() — reads the now-deleted key — returns 0

The count check finds enqueuedCount (0) !== batch.runCount, falls through to a Postgres fallback, but the fallback only checked sealed. The BatchQueue completion path sets status = COMPLETED in Postgres without setting sealed = true (that's the streaming endpoint's job), so the fallback misses it too.

This causes the endpoint to return sealed: false. The SDK treats this as retryable and retries up to 5 times with exponential backoff. Each retry calls enqueueBatchItem(), which reads the batch meta key from Redis — also deleted by cleanup() — and throws "Batch not found or not initialized" (500). The final retry gets a 422 because the batch is already COMPLETED, which the SDK does not retry, causing an ApiError to be thrown from await batchTrigger() in the parent run — even though all child runs completed successfully.

Fix

In the Postgres fallback inside StreamBatchItemsService, also check status === "COMPLETED" alongside sealed. This covers the fast-completion path where the BatchQueue finishes all runs before the streaming endpoint gets to seal the batch normally.

Also switches findUnique to findFirst per webapp convention.