Multi-main workflow activation retries fixed
A bug in n8n's multi-main setup was preventing exponential backoff from kicking in when workflow activation failed. Retries were silently failing because the leader was publishing activation messages to itself instead of actually retrying activation.
In multi-main n8n setups, when a workflow fails to activate during startup, the system is supposed to keep retrying with exponential backoff — waiting longer between each attempt. Instead, a bug was causing those retries to silently fail and stop prematurely.
The root cause: when retrying activation, the code called the add method without explicitly disabling publishing. This caused the leader to send a pubsub message to itself, which returned successfully without actually activating the workflow. The retry loop saw no error and assumed success, stopping all further attempts. If the leader then failed to activate when handling its own message, the system permanently deactivated with no additional retries.
Two changes fix this. First, retries now explicitly set shouldPublish to false, so activation happens directly rather than through the pubsub mechanism. Second, when leadership changes hands, any queued activation timers from the former leader are now cleared, preventing stale retries from firing on a node that no longer has authority.
The result is reliable workflow activation recovery in multi-main clusters. Failed activations will now properly exhaust the exponential backoff sequence instead of giving up after the first silent failure.
View Original GitHub Description
Summary
On multi-main, when a workflow fails to activate on startup, we are supposed to keep retrying activation with exponential backoff. Instead, on retry we currently call this.add() without passing { shouldPublish: false }, so shouldPublish defaults to true, which causes the leader to publish a pubsub message to itself and return successfully without actually activating the workflow. The retry loop sees no error and stops, thinking it succeeded. If the leader then fails to activate the workflow when handling its own pubsub message, we permanently deactivate with no further retries, bypassing the exponential backoff mechanism.
This PR fixes the above and also clears queued activation timers on leader stepdown, preventing a former leader from firing stale retries.
Testing
Workflow activation is ancient and in dire need of refactoring. Mocking setTimeout, publisher, instanceSettings.isMultiMain, etc. just to assert add was called with the right 4th arg would be restating the implementation in test form, which doesn't really prove anything. We could find some way to orchestrate this in e2e, but I think it's pragmatic to go ahead as-is until we streamline workflow publication. Please let me know if you disagree.
Related Linear tickets, Github issues, and Community forum posts
<!-- Include links to **Linear ticket** or Github issue or Community forum post. Important in order to close *automatically* and provide context to reviewers. https://linear.app/n8n/issue/ --> <!-- Use "closes #<issue-number>", "fixes #<issue-number>", or "resolves #<issue-number>" to automatically close issues when the PR is merged. -->Review / Merge checklist
- PR title and summary are descriptive. (conventions) <!-- **Remember, the title automatically goes into the changelog. Use `(no-changelog)` otherwise.** -->
- Docs updated or follow-up ticket created.
- Tests included. <!-- A bug is not considered fixed, unless a test is added to prevent it from happening again. A feature is not complete without tests. -->
- PR Labeled with
Backport to Beta,Backport to Stable, orBackport to v1(if the PR is an urgent fix that needs to be backported)