Wait node made fully crash-safe with database persistence

shortstacked

·Apr 2, 2026·#27066perf(core): Make Wait node fully durable by removing in-memory execution path

Short waits in n8n workflows can now survive crashes and restarts — previously, waits under 65 seconds ran entirely in memory and were lost on failure. The tradeoff is up to 5 seconds of jitter on resume timing.

The Wait node in n8n workflows now persists all time-based pauses to the database, making them recoverable after crashes or restarts. Previously, waits shorter than 65 seconds ran entirely in memory via setTimeout and were invisible to crash recovery, multi-instance failover, and the internal WaitTracker. If the server went down mid-wait, those executions simply vanished.

This change removes the dual-execution-path behavior. All waits — regardless of duration — now immediately call putToWait and are tracked by the WaitTracker polling system. The poll interval was reduced from 60 seconds to 5 seconds, and the lookahead window is now anchored to the database server's clock rather than the local instance's Date.now(). This eliminates clock skew issues between n8n instances in multi-main setups.

The trade-off is precision: in-memory waits could resume within milliseconds of their target time, while DB-persisted waits now resume within ±5 seconds of the target. For most workflow automation scenarios, this jitter is an acceptable exchange for full crash durability. A clock skew warning is logged if the local instance drifts more than 2 seconds from the database server.

This work sits within a broader initiative to make n8n's execution layer fully durable and cluster-aware.

View Original GitHub Description

Summary

Removes the dual-execution-path behaviour from the Wait node. Previously, waits shorter than 65 seconds ran entirely in-memory via setTimeout and were never persisted to the database. This made them invisible to crash recovery, multi-main failover, and the WaitTracker entirely.

What changed:

Wait node (Wait.node.ts): removed the < 65s in-memory branch. All time-based waits now call putToWait immediately, regardless of duration.
ExecutionRepository (execution.repository.ts): getWaitingExecutions() now uses a DB-server-clock-anchored 15-second lookahead window (NOW() + INTERVAL '15 seconds' / datetime('now', '+15 seconds')) via createQueryBuilder. Added getServerTime() to fetch the DB server's current timestamp (PostgreSQL: CURRENT_TIMESTAMP(3), SQLite: STRFTIME).
WaitTracker (wait-tracker.ts): poll interval reduced from 60s → 5s. triggerTime is now computed relative to the DB server clock (via a 60s-TTL cache with elapsed-time interpolation) rather than Date.now(), eliminating inter-instance clock skew from timer precision. Logs a warning when skew exceeds 2s.
PrometheusMetricsService (prometheus-metrics.service.ts): added n8n_db_clock_skew_ms gauge, scraped live on each Prometheus pull.

Why: The 65s threshold existed because the old 60s poll interval made DB-persisted short waits resume late. Reducing the poll to 5s and adding a 15s lookahead window eliminates the need for the in-memory path entirely. The trade-off is up to ~5s of jitter on short waits in exchange for full crash/restart durability.

Blast radius: Narrow — only affects time-based Wait node resume and WaitTracker scheduling. No schema changes, no API changes. Safe to revert with a single commit revert; in-flight waiting executions survive revert (they resume ~60s late via the old poll cycle).

How to test:

Create a workflow: Manual Trigger → Wait (15s) → Set node
Execute — verify execution enters waiting status in DB immediately (not after 15s)
Verify it resumes at ~15s (±5s acceptable)
Kill n8n mid-wait, restart — verify it resumes after restart
Scrape /metrics and confirm n8n_db_clock_skew_ms gauge is present

Related Linear tickets, Github issues, and Community forum posts

Review / Merge checklist

PR title and summary are descriptive. (conventions)
Docs updated or follow-up ticket created.
Tests included.
PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)