Agent text claims now validated against actual workflow changes

burivuhster

·Mar 31, 2026·#27755feat(ai-builder): Add agent text response evaluation and workflow changes binary check (no-changelog)

The AI workflow builder evaluation system can now verify that agent text explanations match what actually changed in the workflow JSON, adding a new layer of validation to catch hallucinated claims.

n8n's AI workflow builder evaluation system previously assessed whether generated workflows were correct, but the agent's explanatory text was discarded during evaluation. This created a gap: an agent could claim to have "added a Slack node" or "configured POST method" without those changes actually appearing in the workflow.

A new binary check called response_matches_workflow_changes fills this gap. It compares the before and after workflow JSON to determine what actually changed, then verifies each specific claim the agent made in their text response. If the agent says they connected nodes that weren't connected, the check fails.

The agent's text response is now captured throughout the evaluation pipeline — from the initial generation through to artifact saving. Evaluation runs that produce response.txt files containing the agent's full text explanation alongside the generated workflow.

This matters for AI assistant reliability. As agents are evaluated on more complex tasks, their explanations become part of the user-facing experience. Users who ask "what did you change?" deserve answers that match reality.

The work lives in the ai-workflow-builder evaluation package, part of a broader initiative to build comprehensive binary checks for AI workflow builder quality assurance.

View Original GitHub Description

Summary

Enables evaluating AI workflow builder based on both text output and workflow changes in the same conversation turn. Adds a new binary check (LLM-powered) to check if all the claims in Agent's text response are reflected in actual workflow JSON changes.

Changes:

Captures the agent's text response from the stream during evaluation (instead of discarding it via consumeGenerator)
Threads agentTextResponse through the full evaluation pipeline: GenerationResult → EvaluationContext → BinaryCheckContext → ExampleResult
Saves agent text response to response.txt in evaluation artifacts
Adds new response_matches_workflow_changes LLM binary check that verifies the agent's text claims match actual workflow changes by comparing before and after workflow JSON
Passes existingWorkflow (pre-turn state) into BinaryCheckContext for diff-based evaluation

Related Linear tickets, Github issues, and Community forum posts

https://linear.app/n8n/issue/AI-2257/feature-create-binary-1-check-for-ai-workflow-builder

Review / Merge checklist

PR title and summary are descriptive. (conventions)
Docs updated or follow-up ticket created.
Tests included.
PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)