Merged
Size
L
Change Breakdown
Bug Fix55%
Feature45%
#67426BlueBubbles/catchup: per-message retry cap for wedged messages (#66870)

BlueBubbles catchup now skips wedged messages after 10 failed retries

A new retry ceiling stops catchup replay from getting stuck on malformed messages forever, while fixes squashed three silent bugs that caused duplicate replies on gateway restarts.

The catchup replay mechanism for BlueBubbles iMessage handling had a wedge problem: if any message payload failed processing repeatedly, the cursor would stall at that message's timestamp and never advance. Gateway restarts hit the same failure, forever.

The fix adds a configurable retry ceiling. Messages that fail more than 10 consecutive times (default, clamped 1-1000) are marked "given up" and skipped on sight. The cursor advances past them, and catchup resumes normal processing. Operators see a WARN log when a message crosses the threshold.

While testing this fix, three additional latent bugs surfaced. A re-entrant file lock in the persistent deduplication layer was letting concurrent callers read stale data and silently overwrite each other's writes—producing duplicate replies after restart. A file naming change between beta versions left the dedupe store empty on upgrade, replaying every recently-handled message. And balloon events (URL previews, stickers) were bypassing the debouncer during catchup replay since they have different GUIDs than their parent text messages, again generating duplicate replies.

All four issues are resolved. Live testing confirmed clean behavior across stop/restart cycles with zero replayed messages.

View Original GitHub Description

Summary

What started as a retry cap for #66870 uncovered and fixed two latent bugs in the catchup/dedupe plumbing from #66857 and #66230 that would have caused duplicate replies on every gateway restart for any user with catchup enabled.

1. Per-message retry cap (#66870)

  • Adds catchup.maxFailureRetries (default 10, clamped [1, 1000]) so a persistently-failing message no longer wedges the catchup cursor forever.
  • Persists per-GUID failure counts in the cursor file. count >= max marks the GUID as "given up": catchup skips it on sight without another processMessage attempt, and the cursor advances past it.
  • Correctly handles the mixed case — an earlier still-retrying GUID plus a later given-up GUID: cursor holds below the still-retrying message while the given-up one is skipped.
  • Emits a distinct WARN on give-up transitions for operator visibility.

2. Lost-update race in persistent dedupe (found during live testing)

  • Root cause: the re-entrant file lock in file-lock.ts gave concurrent callers for the same file immediate access instead of serializing them. Two checkAndRecordInner calls (inbound user message + outbound agent reply) would both read the same stale file, then the last writer silently overwrote the first writer's additions. The in-memory cache masked this within a process lifetime, but after restart the lost GUID caused catchup to replay already-handled messages — producing duplicate replies.
  • Fix: added an in-process write queue per file path in persistent-dedupe.ts so read-modify-write cycles targeting the same dedupe file are serialized. The file lock continues to guard cross-process contention.

3. Dedupe file naming migration gap (found during live testing)

  • The dedupe file naming changed from ${safe}.json to ${safe}__${hash}.json between beta iterations. Upgrading started with an empty dedupe file and replayed the entire catchup window, producing duplicate replies for every recently-handled message.
  • Fix: one-time migration in inbound-dedupe.ts that renames the legacy file on first access. Also added a warmupBlueBubblesInboundDedupe call in catchup before the fetch so the migration and memory warmup run eagerly, not only when processMessage happens to be called.

4. Balloon events bypassing debouncer (found during live testing)

  • The live webhook path coalesces text + URL-preview balloon events via the debouncer. Catchup processes each query result individually. A URL balloon has a different GUID from its parent text message and no balloonBundleId in the query API response, so catchup replayed it as a standalone message — producing a duplicate reply.
  • Fix: catchup now skips messages with associatedMessageGuid set (tapbacks, reactions, balloons). Threaded replies use threadOriginatorGuid instead and are unaffected.

Fixes #66870.

Live testing

Dogfooded on a live BlueBubbles install with real iMessage traffic across multiple stop/restart cycles:

  • openclaw doctor — clean after upgrade from 2026.4.14 to beta+retry-cap
  • Live iMessage send → single reply, no duplicate
  • Stop gateway → restart → catchup runs → replayed=0 (dedupe correctly recognizes the live-handled message)
  • Verified dedupe file contains both inbound and outbound GUIDs after the write queue fix (previously only the outbound survived the race)
  • Verified legacy default.json renamed to default__37a8eec1ce19.json on first startup after migration fix
  • Verified replayed=0 fetched=0 on a clean bounce with no intervening messages (cursor fully caught up, no stale leftovers)
  • Verified balloon filter: associatedMessageGuid messages are tapbacks/reactions only (checked 200 messages), threaded replies use threadOriginatorGuid and are not filtered

Automated tests

  • pnpm test extensions/bluebubbles/ — 425 passed
  • pnpm tsgo — green
  • pnpm check — 0 warnings, 0 errors
  • pnpm config:docs:check / pnpm plugin-sdk:api:check — baselines match
  • 14 new tests for retry cap (counter increment, give-up transition, skip-on-sight, stickiness, mixed earlier/later failures, counter clear on success, legacy cursor compat, stale entry pruning, config clamping, sanitization)
  • Existing 22 catchup tests + 5 dedupe persistence tests pass unchanged

🤖 Generated with Claude Code

© 2026 · via Gitpulse