Workflow evaluation framework runs AI-built automations with mock HTTP responses

A new testing framework lets developers validate n8n workflows built by Instance AI without needing real API credentials — all HTTP calls are intercepted and answered by an LLM using Context7 API documentation.
Testing whether an AI-built workflow actually works has always required real credentials and live API connections — a friction-heavy process that slows iteration. A new evaluation framework removes this dependency entirely by intercepting HTTP requests and generating realistic API responses on-the-fly using an LLM.
The system works in three phases. First, the workflow is analyzed and consistent mock data hints are generated in a single LLM call, ensuring data flows logically through the entire workflow. Second, the workflow executes normally while every HTTP request is captured before it leaves the process — an LLM generates a contextually appropriate response using the node's configuration and real API documentation fetched from Context7. Third, an LLM verifier evaluates whether success criteria were met and categorizes failures as builder issues, mock issues, legitimate failures, or verification gaps.
Developers can run test cases against workflows and receive HTML reports showing execution traces, mock responses, and diagnostic conclusions. The framework handles six HTTP interception points covering axios, legacy requests, and OAuth flows. AI root nodes and protocol-based nodes that bypass the HTTP layer receive pin data instead, generated consistently with the mock data plan.
This framework supports 8 test cases with 27 scenarios spanning webhook routing, contact forms, Linear-Slack reporting, weather monitoring, and more. The contact-form-automation test case passes reliably at 5 out of 5 runs.
View Original GitHub Description
Summary
Adds a complete evaluation framework for testing workflows built by Instance AI. Workflows are executed with LLM-generated mock HTTP responses — no real credentials or API connections needed.
- Phase 1: Analyzes the workflow and generates consistent mock data hints (1 Sonnet call per scenario)
- Phase 2: Executes the workflow with all HTTP requests intercepted. Each request goes to an LLM that generates a realistic API response using node configuration and API docs from Context7
- Phase 3: An LLM verifier evaluates whether success criteria were met and categorizes failures as
builder_issue,mock_issue,legitimate_failure, orverification_gap
Key components
- 6 HTTP interception points in
request-helper-functions.tscovering all n8n request helpers (axios, legacy, OAuth1, OAuth2) - Mock credential generation for OAuth flows (
eval-mock-helpers.ts) - 8 workflow test cases with 27 scenarios across different node types
- HTML report with execution traces, mock responses, connections JSON, and failure diagnosis
- Verification prompt with rules for accurate failure attribution (structure vs execution, chronological ordering, connection verification)
- Context7 integration for API-accurate mock response shapes
Current results
- contact-form-automation: 5/5 stable across runs
- notification-router: 3/3 when agent uses Switch node (IF node
conditions.optionsbug causes failures ~40% of runs) - Other test cases: vary based on builder non-determinism
Related
- https://linear.app/n8n/issue/AI-2298
- Depends on Instance AI module (target:
feature/instance-ai)
Test plan
-
pnpm buildpasses -
pnpm typecheckpasses in@n8n/instance-ai,packages/cli,packages/core - Run
dotenvx run -f .env.local -- pnpm eval:instance-ai workflows --verbosefrompackages/@n8n/instance-ai/ - Verify contact-form-automation passes 5/5
- Verify HTML report generates at
.data/workflow-eval-*.html
🤖 Generated with Claude Code