Reply runs no longer crash on resolved secrets
Embedded reply runs can now keep already-resolved skill and memory secret config stable instead of discarding it for raw SecretRefs, preventing crashes from secondary services.
The problem was subtle but destructive: when embedded reply runs started with an already-resolved config from the gateway, deeper skill and memory helpers could overwrite it by re-reading raw SecretRef-backed settings. This meant a secondary integration like Whisper or memory search could crash the entire reply path even when the live environment variables were already available.
Skill and memory secret resolution logic now respects the caller's already-resolved config. When the incoming config is resolved but the runtime snapshot still contains raw SecretRefs, the resolved version takes precedence. Secondary secrets that reference environment variables use the daemon's env when available rather than re-attempting resolution. ApiKey SecretRefs are no longer eagerly resolved if the primary environment is already provided by the host.
The changes span the skill runtime, memory host SDK, and config resolution layers. This keeps replies stable even when secondary skill settings are stored as env-backed SecretRefs and the gateway already has the live values.
The fix is scoped to runtime behavior only—no doctor or path-state repairs were included, maintaining a tight boundary around the actual failure path.
View Original GitHub Description
Summary
- Problem: embedded reply runs could resolve gateway secrets once, then later throw that resolved config away and reread raw SecretRef-backed skill or memory settings.
- Why it matters: a secondary integration like Whisper or memory search could crash the whole reply path even when the gateway already had the live env vars.
- What changed: keep caller-resolved skill config stable, avoid eagerly resolving secondary skill apiKey refs when host env already provides the needed primary env, and let memory secret inputs reuse daemon env for env-backed refs.
- What did NOT change (scope boundary): no doctor/path-state repair in this PR; this is only the runtime failure-path fix.
Change Type (select all)
- Bug fix
- Feature
- Refactor required for the fix
- Docs
- Security hardening
- Chore/infra
Scope (select all touched areas)
- Gateway / orchestration
- Skills / tool execution
- Auth / tokens
- Memory / storage
- Integrations
- API / contracts
- UI / DX
- CI/CD / infra
Linked Issue/PR
- Closes #
- Related #63217
- This PR fixes a bug or regression
Root Cause (if applicable)
- Root cause: reply-time code could start from a resolved config, but deeper skill and memory helpers could overwrite it with the active runtime snapshot or eagerly resolve raw SecretRefs again.
- Missing detection / guardrail: we had tests for individual resolution helpers, but not for keeping already-resolved caller config stable through embedded reply and secondary-service paths.
- Contributing context (if known): the bug surfaced as generic Telegram failures because secondary services were allowed to hard-fail the whole reply before their env was actually needed.
Regression Test Plan (if applicable)
- Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
- Target test or file:
src/agents/skills.test.ts,src/agents/pi-embedded-runner/skills-runtime.test.ts,src/memory-host-sdk/host/secret-input.test.ts - Scenario the test should lock in: embedded reply code keeps a caller-resolved config instead of reintroducing raw SecretRefs, and env-backed secondary secrets do not crash when the daemon env already provides them.
- Why this is the smallest reliable guardrail: the failure happens in shared runtime helpers below Telegram/channel code, so tight runtime-seam tests are enough to prove the root cause.
- Existing test that already covers this (if any): prior tests covered single helpers but not this end-to-end runtime stability behavior.
- If no new test is added, why not: N/A
User-visible / Behavior Changes
- Replies no longer crash just because a secondary skill or memory secret setting is stored as an env SecretRef while the gateway already has the resolved env value.
- If a real unresolved SecretRef remains, the earlier PR's concrete error surfacing still applies.
Diagram (if applicable)
Before:
[reply run] -> [resolved config] -> [deep helper rereads raw SecretRef config] -> [secondary service throws] -> [reply fails]
After:
[reply run] -> [resolved config] -> [deep helper keeps resolved config / uses host env] -> [secondary service can run or stay dormant] -> [reply continues]
Security Impact (required)
- New permissions/capabilities? (
No) - Secrets/tokens handling changed? (
Yes) - New/changed network calls? (
No) - Command/tool execution surface changed? (
No) - Data access scope changed? (
No) - If any
Yes, explain risk + mitigation: this only changes which already-available secret source wins at runtime. It prefers the gateway/daemon env or already-resolved caller config over re-reading raw SecretRefs, and adds regression coverage for that behavior.
Repro + Verification
Environment
- OS: Ubuntu on
mb-server - Runtime/container: repo worktree with shared
node_modules - Model/provider: N/A
- Integration/channel (if any): reply-runtime path, skill runtime, memory host runtime
- Relevant config (redacted): env-backed SecretRefs for secondary skill/memory settings
Steps
- Build an embedded reply/skill runtime config that starts resolved.
- Keep the active runtime snapshot raw or set a secondary env-backed SecretRef.
- Run the skill/memory helper.
Expected
- The helper keeps the resolved caller config or uses daemon env, and does not reintroduce a raw SecretRef failure.
Actual
- Before this patch, the helper could overwrite the resolved config with raw SecretRefs and fail the reply.
Evidence
- Failing test/log before + passing after
- Trace/log snippets
- Screenshot/recording
- Perf numbers (if relevant)
Human Verification (required)
- Verified scenarios: targeted
mb-servertests for skill runtime stability, embedded-run skill runtime, and memory secret-input resolution. - Edge cases checked: raw runtime snapshot still present, host env already providing the primary env, env-backed memory SecretRef missing from env still throws.
- What you did not verify: full Telegram end-to-end in CI; this PR stays at the shared runtime seam where the bug actually lives.
Review Conversations
- I replied to or resolved every bot review conversation I addressed in this PR.
- I left unresolved only the conversations that still need reviewer or maintainer judgment.
Compatibility / Migration
- Backward compatible? (
Yes) - Config/env changes? (
No) - Migration needed? (
No) - If yes, exact upgrade steps:
Risks and Mitigations
- Risk: preferring caller config over runtime snapshot could preserve a stale caller config in cases where the caller intentionally wanted the current snapshot.
- Mitigation: the fallback only prefers caller config when the runtime snapshot still contains raw skill SecretRefs and the caller config does not; otherwise existing snapshot behavior stays intact.
- Risk: delaying
apiKeySecretRef resolution could skip a needed validation for some skill paths.- Mitigation: resolution still happens when the override is actually needed; tests cover both the injected and already-present env cases.