Mock-driven parity gate compares GPT-5.4 vs Opus 4.6 without real credentials

pashpashpash

·Apr 13, 2026·#65664qa: salvage GPT-5.4 parity proof slice

A new CI workflow runs the GPT-5.4 / Opus 4.6 parity gate entirely against a mock server, staging placeholder credentials so the comparison can execute without burning real API budget. The mock server now routes Anthropic /v1/messages alongside OpenAI formats, and second-wave parity scenarios are guarded by tool-call assertions that catch fake progress.

Running a model parity comparison used to mean feeding real API keys to a CI job and hoping the budget held. Real providers mean real latency, real rate limits, and real cost — which made the parity gate brittle for regular use. This PR delivers a mock-driven alternative: the GitHub Actions workflow runs both model lanes against a local mock server with staged placeholder credentials, generates the parity report, and uploads the artifacts. No real providers touched.

The mock server gains an Anthropic /v1/messages route alongside the existing OpenAI lane, with proper tool_result ordering in the messages adapter. Mock auth profiles are staged per provider in each agent directory the QA suite uses — the placeholder key qa-mock-not-a-real-key is intentionally not shaped like a real API key to avoid triggering secret scanners.

Ten scenarios now make up the parity pack, with a new instruction-followthrough scenario verifying the agent reads repo instruction files before writing. Second-wave scenarios like subagent-handoff, subagent-fanout-synthesis, and config-restart-capability-flip are guarded by tool-call assertions that require real plannedToolName evidence from the mock's /debug/requests log. Image-understanding scenarios use imageInputCount as their tool-call proxy since vision processing happens inside the provider. Memory-recall stays prose-only by design — the model can pull facts either through a memory-search tool or directly from conversation context, and forcing a plannedToolName assertion would test the harness, not the models.

The parity report now verifies that the run metadata on each summary matches the caller-supplied candidate/baseline labels, throwing a QaParityLabelMismatchError if the paths get swapped. Required scenario failures fail the gate regardless of baseline parity — a scenario that fails on both sides was previously silent. The report title is parametrized so it renders accurately for any provider/model pair, not only the OpenAI versus Anthropic default.

In the qa-lab extension and qa-channel integration points.

View Original GitHub Description

Summary

salvage the conflicted #65224 parity-proof work onto current main
keep the QA-lab parity proof slice only: scenarios, parity-report semantics, Anthropic mock lane, summary run metadata, mock auth staging, and the parity-gate workflow
intentionally leave the stale parity program docs out of this rescue so they can be refreshed separately against the current merged state

What this includes

second-wave parity scenarios and strengthened tool-call assertions
Anthropic /v1/messages mock routing and related mock-server/report coverage
qa-suite-summary.json run metadata and parity-report label/provenance checks
mock auth profile staging so the parity gate can run without real provider credentials
.github/workflows/parity-gate.yml

Validation

CI=1 pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/agentic-parity-report.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/qa-gateway-config.test.ts extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/gateway-child.test.ts
pnpm check

Context

Supersedes the proof/harness portion of #65224. The stale parity narrative docs were dropped from this rescue branch on purpose.