Agent text claims now validated against actual workflow changes
The AI workflow builder evaluation system can now verify that agent text explanations match what actually changed in the workflow JSON, adding a new layer of validation to catch hallucinated claims.
n8n's AI workflow builder evaluation system previously assessed whether generated workflows were correct, but the agent's explanatory text was discarded during evaluation. This created a gap: an agent could claim to have "added a Slack node" or "configured POST method" without those changes actually appearing in the workflow.
A new binary check called response_matches_workflow_changes fills this gap. It compares the before and after workflow JSON to determine what actually changed, then verifies each specific claim the agent made in their text response. If the agent says they connected nodes that weren't connected, the check fails.
The agent's text response is now captured throughout the evaluation pipeline — from the initial generation through to artifact saving. Evaluation runs that produce response.txt files containing the agent's full text explanation alongside the generated workflow.
This matters for AI assistant reliability. As agents are evaluated on more complex tasks, their explanations become part of the user-facing experience. Users who ask "what did you change?" deserve answers that match reality.
The work lives in the ai-workflow-builder evaluation package, part of a broader initiative to build comprehensive binary checks for AI workflow builder quality assurance.
View Original GitHub Description
Summary
Enables evaluating AI workflow builder based on both text output and workflow changes in the same conversation turn. Adds a new binary check (LLM-powered) to check if all the claims in Agent's text response are reflected in actual workflow JSON changes.
Changes:
- Captures the agent's text response from the stream during evaluation (instead of discarding it via
consumeGenerator) - Threads
agentTextResponsethrough the full evaluation pipeline:GenerationResult→EvaluationContext→BinaryCheckContext→ExampleResult - Saves agent text response to
response.txtin evaluation artifacts - Adds new
response_matches_workflow_changesLLM binary check that verifies the agent's text claims match actual workflow changes by comparing before and after workflow JSON - Passes
existingWorkflow(pre-turn state) intoBinaryCheckContextfor diff-based evaluation
Related Linear tickets, Github issues, and Community forum posts
https://linear.app/n8n/issue/AI-2257/feature-create-binary-1-check-for-ai-workflow-builder
Review / Merge checklist
- PR title and summary are descriptive. (conventions)
- Docs updated or follow-up ticket created.
- Tests included.
- PR Labeled with
release/backport(if the PR is an urgent fix that needs to be backported)