Boundary error logs downgraded to warnings

ericallam

·Apr 29, 2026·#3462chore(webapp,run-engine): downgrade boundary log noise to warn

Expected system conditions like client disconnects and quota limits no longer trigger severe error alerts, making genuine bugs easier to spot.

The system was treating expected, handled conditions—like client disconnects, quota limits, or customer data validation failures—as severe errors. This buried genuine bugs under operational noise.

Error logs are now filtered and quieted down, so only true system failures trigger alerts. Expected boundary catches and client aborts are logged as warnings instead. For signals that remain operationally useful, like billing call failures or slow database queries, data is routed into OpenTelemetry dashboards. This allows metrics to be monitored independently of the main error feed across the web application and run-engine layers.

View Original GitHub Description

Summary

Several boundary catches and customer-input validation paths were logging at error level for failures the system already handles gracefully — disconnect on auth failure, return undefined, skip retries, etc. This batch routes them to warn (which stays in stdout) or counts them as OTel metrics, so visibility is preserved without surfacing them as alerts.

Changes

New helper / pattern:

apiBuilder.server.ts — logBoundaryError(message, error, url) inspects the inner error type at loader/action boundary catches; downgrades to warn for AbortError, ServiceValidationError, and EngineServiceValidationError.
platform.v3.server.ts — platform_client.failures_total OTel counter with {function, kind} labels; helper recordPlatformFailure(fn, kind) replaces the previous error-level logging across all BillingClient wrappers.

Log-level downgrades:

handleSocketIo.server.ts — Worker authentication failed → warn (system disconnects on failure; refs TRI-8863)
waitpointSystem.ts — when runStatus === "CANCELED" in the suspended-without-checkpoint branch, skip the throw and warn instead (benign cancel-vs-resume race, nothing to resume)
runAttemptSystem.ts — flushedMetadata parse/validate failures → warn (customer-side data shape, system returns gracefully)
batch-queue/index.ts — final-attempt failures with result.skipRetries → warn (callbacks already opted out of retry, e.g. queue size limit hit)
queryPerformanceMonitor.server.ts — slow queries → warn (observability signal, not an application error)
timeoutDeployment.server.ts — deployment-state mismatch in the timeout job → warn (timeout-vs-completion race)

Inner error preservation:

waitpointCompletionPacket.server.ts — logger.error(uploadError) before throwing the ServiceValidationError wrapper, so the underlying upload error stays visible.

Why

The pattern across all of these is the same: a boundary log treated any thrown/returned error as error regardless of cause, even when the cause was an expected, system-handled condition (client disconnect, customer quota, race condition, schema validation of customer data). That made the logs noisy and made it harder to spot real bugs.

Where the underlying signal is still useful operationally (slow queries, billing call failures), we route it to OTel metrics with low-cardinality labels so dashboards and alerts can be tuned independently of error logs.

Test plan

pnpm run typecheck --filter webapp
pnpm run build --filter @internal/run-engine
Trigger a run on hello-world and verify task lifecycle is unaffected
Cancel a suspended run and verify the cancel-while-suspended branch in waitpointSystem.ts returns {status: "skipped"} instead of throwing
Confirm platform_client.failures_total counter shows up in metrics with {function, kind} labels when the billing client errors