Boundary error logs downgraded to warnings
Expected system conditions like client disconnects and quota limits no longer trigger severe error alerts, making genuine bugs easier to spot.
The system was treating expected, handled conditions—like client disconnects, quota limits, or customer data validation failures—as severe errors. This buried genuine bugs under operational noise.
Error logs are now filtered and quieted down, so only true system failures trigger alerts. Expected boundary catches and client aborts are logged as warnings instead. For signals that remain operationally useful, like billing call failures or slow database queries, data is routed into OpenTelemetry dashboards. This allows metrics to be monitored independently of the main error feed across the web application and run-engine layers.
View Original GitHub Description
Summary
Several boundary catches and customer-input validation paths were logging at error level for failures the system already handles gracefully — disconnect on auth failure, return undefined, skip retries, etc. This batch routes them to warn (which stays in stdout) or counts them as OTel metrics, so visibility is preserved without surfacing them as alerts.
Changes
New helper / pattern:
apiBuilder.server.ts—logBoundaryError(message, error, url)inspects the inner error type at loader/action boundary catches; downgrades towarnforAbortError,ServiceValidationError, andEngineServiceValidationError.platform.v3.server.ts—platform_client.failures_totalOTel counter with{function, kind}labels; helperrecordPlatformFailure(fn, kind)replaces the previous error-level logging across allBillingClientwrappers.
Log-level downgrades:
handleSocketIo.server.ts—Worker authentication failed→ warn (system disconnects on failure; refs TRI-8863)waitpointSystem.ts— whenrunStatus === "CANCELED"in the suspended-without-checkpoint branch, skip the throw and warn instead (benign cancel-vs-resume race, nothing to resume)runAttemptSystem.ts—flushedMetadataparse/validate failures → warn (customer-side data shape, system returns gracefully)batch-queue/index.ts— final-attempt failures withresult.skipRetries→ warn (callbacks already opted out of retry, e.g. queue size limit hit)queryPerformanceMonitor.server.ts— slow queries → warn (observability signal, not an application error)timeoutDeployment.server.ts— deployment-state mismatch in the timeout job → warn (timeout-vs-completion race)
Inner error preservation:
waitpointCompletionPacket.server.ts—logger.error(uploadError)before throwing theServiceValidationErrorwrapper, so the underlying upload error stays visible.
Why
The pattern across all of these is the same: a boundary log treated any thrown/returned error as error regardless of cause, even when the cause was an expected, system-handled condition (client disconnect, customer quota, race condition, schema validation of customer data). That made the logs noisy and made it harder to spot real bugs.
Where the underlying signal is still useful operationally (slow queries, billing call failures), we route it to OTel metrics with low-cardinality labels so dashboards and alerts can be tuned independently of error logs.
Test plan
-
pnpm run typecheck --filter webapp -
pnpm run build --filter @internal/run-engine - Trigger a run on hello-world and verify task lifecycle is unaffected
- Cancel a suspended run and verify the cancel-while-suspended branch in
waitpointSystem.tsreturns{status: "skipped"}instead of throwing - Confirm
platform_client.failures_totalcounter shows up in metrics with{function, kind}labels when the billing client errors