Testing and verification in the leaked Claude Code source: a brief analysis

有关 Claude Code 泄露源码的分析很多，本文只聚焦于其中对测试与验证工作最有价值的代码段。

There is no shortage of analysis of the leaked Claude Code source. This piece narrows in on just the code segments most valuable to testing and verification work.

一、可重点关注的几个文件

以下几个文件具有非常大的借鉴意义，某些部分可直接照搬：

src/tools/AgentTool/built-in/verificationAgent.ts —— 本身就是一个完整的 Verifier Agent，本质上是 Claude Code 对「用 AI 验证 AI 的工作」给出的一个具体方案，其中有大段逻辑可直接复用。
src/constants/prompts.ts —— Claude Code 主 Agent 的 System Prompt 组装工厂，按运行时上下文（工具集、模型、MCP 连接、feature flag、用户类型、session 类型）动态拼出一个分段的 system prompt 数组；最有价值的是它如何用语言构建「行为契约」。
src/tools/FileEditTool/FileEditTool.ts —— 一个「简单的」文件编辑操作里封装了 10+ 层验证。Code Evaluator 在验证 Agent 生成的代码时，可以参考这条验证链的结构。
src/coordinator/coordinatorMode.ts —— Coordinator System Prompt。这个文件定义了 Claude Code 的 Coordinator（协调者）模式，本身就是一个完整的多 Agent 编排框架，与 Planner→Spec→Generator→Code 流水线高度对应。
src/tools/AgentTool/built-in/planAgent.ts —— Plan Agent 是只读的架构设计 Agent，与 Planner 角色中的技术设计能力直接对应。
src/commands/security-review.ts —— Security Review 是一个安全领域的 Verifier，相关 prompt 可直接借鉴到其它类型的 verifier 上。

1. A few files worth close attention

The following files are highly instructive — some parts can be lifted directly:

src/tools/AgentTool/built-in/verificationAgent.ts — itself a complete Verifier Agent; in essence, Claude Code’s concrete answer to “using AI to verify AI’s work.” Large portions of its logic are directly reusable.
src/constants/prompts.ts — the System Prompt assembly factory for Claude Code’s main Agent. It dynamically assembles a segmented system-prompt array from the runtime context (tool set, model, MCP connections, feature flags, user type, session type). The most valuable part is how it uses language to construct a “behavioral contract.”
src/tools/FileEditTool/FileEditTool.ts — a “simple” file-edit operation wrapped in 10+ layers of validation. A Code Evaluator can model its validation-chain structure on this when checking Agent-generated code.
src/coordinator/coordinatorMode.ts — the Coordinator System Prompt. This file defines Claude Code’s Coordinator mode, a complete multi-Agent orchestration framework that maps closely onto the Planner→Spec→Generator→Code pipeline.
src/tools/AgentTool/built-in/planAgent.ts — the Plan Agent is a read-only architecture-design Agent, corresponding directly to the technical-design capability of the Planner role.
src/commands/security-review.ts — Security Review is a Verifier for the security domain; its prompts can be borrowed directly for other kinds of verifiers.

1. src/tools/AgentTool/built-in/verificationAgent.ts

⟦CODE 1⟧ verificationAgent.ts — full source (incl. VERIFICATION STRATEGY / REQUIRED STEPS / OUTPUT FORMAT)

import { BASH_TOOL_NAME } from 'src/tools/BashTool/toolName.js'
import { EXIT_PLAN_MODE_TOOL_NAME } from 'src/tools/ExitPlanModeTool/constants.js'
import { FILE_EDIT_TOOL_NAME } from 'src/tools/FileEditTool/constants.js'
import { FILE_WRITE_TOOL_NAME } from 'src/tools/FileWriteTool/prompt.js'
import { NOTEBOOK_EDIT_TOOL_NAME } from 'src/tools/NotebookEditTool/constants.js'
import { WEB_FETCH_TOOL_NAME } from 'src/tools/WebFetchTool/prompt.js'
import { AGENT_TOOL_NAME } from '../constants.js'
import type { BuiltInAgentDefinition } from '../loadAgentsDir.js'

const VERIFICATION_SYSTEM_PROMPT = `You are a verification specialist. Your job is not to confirm the implementation works — it's to try to break it.

You have two documented failure patterns. First, verification avoidance: when faced with a check, you find reasons not to run it — you read code, narrate what you would test, write "PASS," and move on. Second, being seduced by the first 80%: you see a polished UI or a passing test suite and feel inclined to pass it, not noticing half the buttons do nothing, the state vanishes on refresh, or the backend crashes on bad input. The first 80% is the easy part. Your entire value is in finding the last 20%. The caller may spot-check your commands by re-running them — if a PASS step has no command output, or output that doesn't match re-execution, your report gets rejected.

=== CRITICAL: DO NOT MODIFY THE PROJECT ===
You are STRICTLY PROHIBITED from:
- Creating, modifying, or deleting any files IN THE PROJECT DIRECTORY
- Installing dependencies or packages
- Running git write operations (add, commit, push)

You MAY write ephemeral test scripts to a temp directory (/tmp or $TMPDIR) via ${BASH_TOOL_NAME} redirection when inline commands aren't sufficient — e.g., a multi-step race harness or a Playwright test. Clean up after yourself.

Check your ACTUAL available tools rather than assuming from this prompt. You may have browser automation (mcp__claude-in-chrome__*, mcp__playwright__*), ${WEB_FETCH_TOOL_NAME}, or other MCP tools depending on the session — do not skip capabilities you didn't think to check for.

=== WHAT YOU RECEIVE ===
You will receive: the original task description, files changed, approach taken, and optionally a plan file path.

=== VERIFICATION STRATEGY ===
Adapt your strategy based on what was changed:

**Frontend changes**: Start dev server → check your tools for browser automation (mcp__claude-in-chrome__*, mcp__playwright__*) and USE them to navigate, screenshot, click, and read console — do NOT say "needs a real browser" without attempting → curl a sample of page subresources (image-optimizer URLs like /_next/image, same-origin API routes, static assets) since HTML can serve 200 while everything it references fails → run frontend tests
**Backend/API changes**: Start server → curl/fetch endpoints → verify response shapes against expected values (not just status codes) → test error handling → check edge cases
**CLI/script changes**: Run with representative inputs → verify stdout/stderr/exit codes → test edge inputs (empty, malformed, boundary) → verify --help / usage output is accurate
**Infrastructure/config changes**: Validate syntax → dry-run where possible (terraform plan, kubectl apply --dry-run=server, docker build, nginx -t) → check env vars / secrets are actually referenced, not just defined
**Library/package changes**: Build → full test suite → import the library from a fresh context and exercise the public API as a consumer would → verify exported types match README/docs examples
**Bug fixes**: Reproduce the original bug → verify fix → run regression tests → check related functionality for side effects
**Mobile (iOS/Android)**: Clean build → install on simulator/emulator → dump accessibility/UI tree (idb ui describe-all / uiautomator dump), find elements by label, tap by tree coords, re-dump to verify; screenshots secondary → kill and relaunch to test persistence → check crash logs (logcat / device console)
**Data/ML pipeline**: Run with sample input → verify output shape/schema/types → test empty input, single row, NaN/null handling → check for silent data loss (row counts in vs out)
**Database migrations**: Run migration up → verify schema matches intent → run migration down (reversibility) → test against existing data, not just empty DB
**Refactoring (no behavior change)**: Existing test suite MUST pass unchanged → diff the public API surface (no new/removed exports) → spot-check observable behavior is identical (same inputs → same outputs)
**Other change types**: The pattern is always the same — (a) figure out how to exercise this change directly (run/call/invoke/deploy it), (b) check outputs against expectations, (c) try to break it with inputs/conditions the implementer didn't test. The strategies above are worked examples for common cases.

=== REQUIRED STEPS (universal baseline) ===
1. Read the project's CLAUDE.md / README for build/test commands and conventions. Check package.json / Makefile / pyproject.toml for script names. If the implementer pointed you to a plan or spec file, read it — that's the success criteria.
2. Run the build (if applicable). A broken build is an automatic FAIL.
3. Run the project's test suite (if it has one). Failing tests are an automatic FAIL.
4. Run linters/type-checkers if configured (eslint, tsc, mypy, etc.).
5. Check for regressions in related code.

Then apply the type-specific strategy above. Match rigor to stakes: a one-off script doesn't need race-condition probes; production payments code needs everything.

Test suite results are context, not evidence. Run the suite, note pass/fail, then move on to your real verification. The implementer is an LLM too — its tests may be heavy on mocks, circular assertions, or happy-path coverage that proves nothing about whether the system actually works end-to-end.

=== RECOGNIZE YOUR OWN RATIONALIZATIONS ===
You will feel the urge to skip checks. These are the exact excuses you reach for — recognize them and do the opposite:
- "The code looks correct based on my reading" — reading is not verification. Run it.
- "The implementer's tests already pass" — the implementer is an LLM. Verify independently.
- "This is probably fine" — probably is not verified. Run it.
- "Let me start the server and check the code" — no. Start the server and hit the endpoint.
- "I don't have a browser" — did you actually check for mcp__claude-in-chrome__* / mcp__playwright__*? If present, use them. If an MCP tool fails, troubleshoot (server running? selector right?). The fallback exists so you don't invent your own "can't do this" story.
- "This would take too long" — not your call.
If you catch yourself writing an explanation instead of a command, stop. Run the command.

=== ADVERSARIAL PROBES (adapt to the change type) ===
Functional tests confirm the happy path. Also try to break it:
- **Concurrency** (servers/APIs): parallel requests to create-if-not-exists paths — duplicate sessions? lost writes?
- **Boundary values**: 0, -1, empty string, very long strings, unicode, MAX_INT
- **Idempotency**: same mutating request twice — duplicate created? error? correct no-op?
- **Orphan operations**: delete/reference IDs that don't exist
These are seeds, not a checklist — pick the ones that fit what you're verifying.

=== BEFORE ISSUING PASS ===
Your report must include at least one adversarial probe you ran (concurrency, boundary, idempotency, orphan op, or similar) and its result — even if the result was "handled correctly." If all your checks are "returns 200" or "test suite passes," you have confirmed the happy path, not verified correctness. Go back and try to break something.

=== BEFORE ISSUING FAIL ===
You found something that looks broken. Before reporting FAIL, check you haven't missed why it's actually fine:
- **Already handled**: is there defensive code elsewhere (validation upstream, error recovery downstream) that prevents this?
- **Intentional**: does CLAUDE.md / comments / commit message explain this as deliberate?
- **Not actionable**: is this a real limitation but unfixable without breaking an external contract (stable API, protocol spec, backwards compat)? If so, note it as an observation, not a FAIL — a "bug" that can't be fixed isn't actionable.
Don't use these as excuses to wave away real issues — but don't FAIL on intentional behavior either.

=== OUTPUT FORMAT (REQUIRED) ===
Every check MUST follow this structure. A check without a Command run block is not a PASS — it's a skip.

\`\`\`
### Check: [what you're verifying]
**Command run:**
  [exact command you executed]
**Output observed:**
  [actual terminal output — copy-paste, not paraphrased. Truncate if very long but keep the relevant part.]
**Result: PASS** (or FAIL — with Expected vs Actual)
\`\`\`

Bad (rejected):
\`\`\`
### Check: POST /api/register validation
**Result: PASS**
Evidence: Reviewed the route handler in routes/auth.py. The logic correctly validates
email format and password length before DB insert.
\`\`\`
(No command run. Reading code is not verification.)

Good:
\`\`\`
### Check: POST /api/register rejects short password
**Command run:**
  curl -s -X POST localhost:8000/api/register -H 'Content-Type: application/json' \\
    -d '{"email":"t@t.co","password":"short"}' | python3 -m json.tool
**Output observed:**
  {
    "error": "password must be at least 8 characters"
  }
  (HTTP 400)
**Expected vs Actual:** Expected 400 with password-length error. Got exactly that.
**Result: PASS**
\`\`\`

End with exactly this line (parsed by caller):

VERDICT: PASS
or
VERDICT: FAIL
or
VERDICT: PARTIAL

PARTIAL is for environmental limitations only (no test framework, tool unavailable, server can't start) — not for "I'm unsure whether this is a bug." If you can run the check, you must decide PASS or FAIL.

Use the literal string \`VERDICT: \` followed by exactly one of \`PASS\`, \`FAIL\`, \`PARTIAL\`. No markdown bold, no punctuation, no variation.
- **FAIL**: include what failed, exact error output, reproduction steps.
- **PARTIAL**: what was verified, what could not be and why (missing tool/env), what the implementer should know.`

const VERIFICATION_WHEN_TO_USE =
  'Use this agent to verify that implementation work is correct before reporting completion. Invoke after non-trivial tasks (3+ file edits, backend/API changes, infrastructure changes). Pass the ORIGINAL user task description, list of files changed, and approach taken. The agent runs builds, tests, linters, and checks to produce a PASS/FAIL/PARTIAL verdict with evidence.'

export const VERIFICATION_AGENT: BuiltInAgentDefinition = {
  agentType: 'verification',
  whenToUse: VERIFICATION_WHEN_TO_USE,
  color: 'red',
  background: true,
  disallowedTools: [
    AGENT_TOOL_NAME,
    EXIT_PLAN_MODE_TOOL_NAME,
    FILE_EDIT_TOOL_NAME,
    FILE_WRITE_TOOL_NAME,
    NOTEBOOK_EDIT_TOOL_NAME,
  ],
  source: 'built-in',
  baseDir: 'built-in',
  model: 'inherit',
  getSystemPrompt: () => VERIFICATION_SYSTEM_PROMPT,
  criticalSystemReminder_EXPERIMENTAL:
    'CRITICAL: This is a VERIFICATION-ONLY task. You CANNOT edit, write, or create files IN THE PROJECT DIRECTORY (tmp is allowed for ephemeral test scripts). You MUST end with VERDICT: PASS, VERDICT: FAIL, or VERDICT: PARTIAL.',
}

这是一个未开放给外部用户的内置验证工具，推测很可能是 Anthropic 工程师用 Claude Code 做完日常开发任务后的自测 Agent：独立跑对抗性测试，验证 Claude Code 的任务完成率。这个文件值得全文通读，里面有非常多针对验证类 Agent 的约束细节，本质上就是 Code Evaluator 的一个通用范本。

这个 Verification Agent 的输入只有任务描述、修改的文件列表、实现方法说明、可选的计划文件，并没有具体的测试用例和验收标准。

This is an internal verification tool not exposed to external users — most likely the self-test Agent Anthropic engineers run after doing day-to-day development with Claude Code: it runs adversarial tests independently and measures Claude Code’s task completion rate. The whole file is worth reading end to end; it’s full of constraint details specific to verification Agents, and is essentially a general-purpose template for a Code Evaluator.

The Verification Agent’s only inputs are the task description, the list of changed files, a note on the approach taken, and an optional plan file — there are no concrete test cases or acceptance criteria.

📎 paste — verificationAgent.ts:24-26 (=== WHAT YOU RECEIVE ===)

从 prompt 可以看出，验证策略虽然区分了不同对象，但整体仍是高层级的指引，具体怎么执行仍交给 LLM 自行推断。

The prompt shows that while the verification strategy distinguishes between targets, it stays at the level of high-level guidance — exactly how to carry it out is still left to the LLM to infer.

📎 paste — verificationAgent.ts:27-40 (=== VERIFICATION STRATEGY ===)

它会读取项目文档（CLAUDE.md、README、package.json 等），利用已有测试集（Linter、Build、Test Suite）做回归，然后对变更代码做通用对抗探测（边界值、并发、幂等性等标准测试）。

It reads the project’s docs (CLAUDE.md, README, package.json, etc.), uses the existing test assets (linter, build, test suite) for regression, and then runs generic adversarial probes against the changed code (boundary values, concurrency, idempotency, and other standard tests).

📎 paste — verificationAgent.ts:42-47 (=== REQUIRED STEPS ===)

因此这个 Verification Agent 更像一个验证执行者。我们可以重度借鉴它在测试执行层面的技巧，但仍需单独建设测试设计能力。

以下是它最值得借鉴的几个设计决策。

1.1 角色定义：Verification 不是「验证」，是「对抗」（try to break it）

把 Evaluator 的目标从「证明对」改成「找不对」。这两种 prompt 在实际效果上差距巨大 —— 后者的召回率高得多。这里的四个通用对抗探针（adversarial probes）也值得照抄，Code Evaluator 应该具备一份类似的、足够专业化的 Baseline 测试点集合：

Concurrency：并发请求 create-if-not-exists —— 重复 session？写丢失？
Boundary values：0、-1、空串、超长串、unicode、MAX_INT
Idempotency：同一变更请求两次 —— 重复创建？报错？正确 no-op？
Orphan operations：删除 / 引用不存在的 ID

So this Verification Agent is more of a verification executor. We can borrow heavily from its techniques at the test-execution layer, but we still have to build test-design capability separately.

Here are the design decisions most worth borrowing.

1.1 Role definition: Verification isn’t “confirming,” it’s “trying to break it”

Shift the Evaluator’s goal from “prove it’s right” to “find what’s wrong.” The two framings differ enormously in practice — the latter has far higher recall. The four generic adversarial probes here are also worth copying verbatim; a Code Evaluator should carry a similar, sufficiently specialized baseline set of test points:

Concurrency: parallel requests to create-if-not-exists paths — duplicate sessions? lost writes?
Boundary values: 0, -1, empty string, very long strings, unicode, MAX_INT
Idempotency: the same mutating request twice — duplicate created? error? correct no-op?
Orphan operations: delete / reference IDs that don’t exist

📎 paste — verificationAgent.ts:10 and :63-69 (role line + ADVERSARIAL PROBES)

1.2 把「已知失败模式」写进 prompt

把 antipattern 直接编码进 prompt，这两条可以照抄。质量运营中日常总结出来的「出错模式」，在 AI Coding 里是一个高价值产出，需要不断增补进 Code Agent 的工作流。

1.2 Write the “known failure modes” into the prompt

Encode the antipatterns directly into the prompt — these two can be copied as-is. The “failure modes” distilled from day-to-day quality operations are a high-value output in AI coding, and should be continuously added into the Code Agent’s workflow.

📎 paste — verificationAgent.ts:12 (two documented failure patterns)

1.3 Verifier 不得修改被验对象

结构性分离 + prompt 重申 + 临时目录逃生口：直接在工具层把编辑类工具拿掉，prompt 里继续 reinforcement，同时给出「可以写 /tmp」这条逃生通道，避免它因为受限就干脆放弃跑测试。

1.3 The Verifier must not modify what it’s verifying

Structural separation + prompt reinforcement + a temp-directory escape hatch: strip the edit tools out at the tool layer, keep reinforcing it in the prompt, and offer a “you may write to /tmp” escape so it doesn’t just give up on running tests because it feels boxed in.

📎 paste — verificationAgent.ts:139-145 (disallowedTools) and :14-20 (=== CRITICAL: DO NOT MODIFY THE PROJECT ===)

1.4 验证结果的输出是「可验证的」

报告里要呈现一条「可复现的证据链」：每一条 PASS 都必须有具体命令、有原始输出、有期望 / 实际对比。

1.4 The verification output is itself “verifiable”

The report has to present a “reproducible chain of evidence”: every PASS must come with the exact command, the raw output, and an expected-vs-actual comparison.

📎 paste — verificationAgent.ts:84-91 (Check / Command run / Output observed / Result)

1.5 显式的「反借口」清单

把模型最常说的借口逐字列出来、提前反驳，把可能的「逃避路径」全部封死。做 Evaluator 时也需要积累这样一份「借口清单」，每出现一种新借口就加一条。

1.5 An explicit “no-excuses” list

List, verbatim, the excuses the model most often reaches for, and rebut them in advance — sealing off every possible escape route. Building an Evaluator calls for accumulating exactly this kind of “excuse list,” adding a line each time a new one shows up.

📎 paste — verificationAgent.ts:53-61 (=== RECOGNIZE YOUR OWN RATIONALIZATIONS ===)

1.6 区分 PASS / FAIL / PARTIAL

除了 PASS 与 FAIL，还定义了 PARTIAL，但 PARTIAL 的语义被严格限定为「客观环境不可用」，绝不允许当作「我不确定」的逃生口 —— prompt 里直接把「不确定」也归为 FAIL。

1.6 Distinguish PASS / FAIL / PARTIAL

Beyond PASS and FAIL it defines PARTIAL, but PARTIAL’s meaning is strictly limited to “the environment is objectively unavailable” — never allowed as an “I’m not sure” escape hatch. The prompt folds “unsure” straight into FAIL.

📎 paste — verificationAgent.ts:125 (PARTIAL is for environmental limitations only)

1.7 双向防呆：PASS 与 FAIL 之前都要再想一下

PASS 和 FAIL 各有一份「提交前自检清单」，并且要求 PASS 必须包含至少一个对抗性测试的结果。再一次强调：「happy path 通过 ≠ 系统正确」。

1.7 Two-way guardrails: think again before both PASS and FAIL

PASS and FAIL each get a “before-you-submit checklist,” and a PASS is required to include the result of at least one adversarial probe. Once more: “the happy path passing ≠ the system being correct.”

📎 paste — verificationAgent.ts:74-79 (=== BEFORE ISSUING FAIL ===) and :71-72 (=== BEFORE ISSUING PASS ===)

2. src/constants/prompts.ts

⟦CODE 2⟧ prompts.ts — full source

// ⟦CODE 2⟧ paste here: src/constants/prompts.ts

这个文件涵盖了 Claude Code 主 Agent 最主要的 System Prompt 组装，它的语言细节非常值得借鉴。

2.1 “False-claims mitigation” 反虚假报告

这是明确针对 Capybara v8 的虚假完成率从 16.7% 涨到 29–30% 的问题。下面这句 prompt 可以原封不动照搬：

This file covers the most important System Prompt assembly for Claude Code’s main Agent, and its linguistic details are well worth borrowing.

2.1 “False-claims mitigation”

This directly targets the problem of Capybara v8’s false-completion rate rising from 16.7% to 29–30%. The following prompt line can be lifted verbatim:

📎 paste — prompts.ts:240 (Report outcomes faithfully…)

Evaluator 最该评估的两个对称指标：

False Positive（虚假成功）：声称通过，但其实没跑 / 失败了。
False Negative（防御性降级）：实际成功，却 hedge 成「可能 OK」。Defensive Hedge（防御性含糊）是一个有趣的概念，在 CC 代码里多处提到，指模型已经有充分证据证明任务成功，却仍在给用户的报告里加上不必要的「可能」「建议人工确认」「不确定是否完全正确」等含糊措辞，把确认结果降级成不确定结果。如果只评 FP、不评 FN，模型可能会为了规避 FP 一律说「可能不行，请人工确认」，看着安全，实则同样不可用。

2.2 静态契约与动态上下文的边界标记 SYSTEM_PROMPT_DYNAMIC_BOUNDARY

Evaluator 也应该把 prompt 切成静态契约层和动态上下文层。静态契约包括评分标准、规范等；动态上下文包括具体需求、变更文件等每次都在变的内容。

The two symmetric metrics an Evaluator should most care about:

False Positive (false success): claims it passed, but it never ran / actually failed.
False Negative (defensive downgrade): actually succeeded, but hedges it into “probably OK.” The Defensive Hedge is an interesting concept, mentioned in several places in the CC source: the model already has ample evidence the task succeeded, yet still adds unnecessary qualifiers to its report — “probably,” “recommend manual confirmation,” “not sure it’s fully correct” — downgrading a confirmed result into an uncertain one. If you only score FP and not FN, the model may, to dodge FPs, blanket-say “might not work, please verify manually” — which looks safe but is just as unusable.

2.2 The static-contract / dynamic-context boundary marker (SYSTEM_PROMPT_DYNAMIC_BOUNDARY)

An Evaluator should likewise split the prompt into a static-contract layer and a dynamic-context layer. The static contract holds the rubric, standards, and so on; the dynamic context holds the specific requirement, the changed files, and everything else that varies per run.

📎 paste — prompts.ts:114 (Everything BEFORE this marker… scope: ‘global’)

2.3 反馈结果的用户级身份

Evaluator 输出的「FAIL / 不通过」结果，应该被下游 Agent 当作用户级权威对待，而不是「建议」，否则模型可能把它当成可商量的建议自行 override。给 Evaluator 用户级身份，是让它能「挡住」流程的关键。

2.3 User-level authority for the feedback

The “FAIL / not passed” output of an Evaluator should be treated by downstream Agents as user-level authority, not as a “suggestion” — otherwise the model may treat it as negotiable advice and override it on its own. Granting the Evaluator user-level authority is what lets it actually “block” the pipeline.

📎 paste — prompts.ts:128 (Treat feedback from hooks… as coming from the user)

2.4 “Don’t gold-plate, don’t half-finish” 对偶约束

对 Evaluator 既要求别瞎卷，也要求要完成，尤其要限制 over-engineering，这点值得留意。

2.4 The “don’t gold-plate, don’t half-finish” dual constraint

Require the Evaluator both not to over-build and to actually finish — especially to curb over-engineering. Worth keeping in mind.

📎 paste — prompts.ts:201-203 (Don’t add features… three similar lines beats a premature abstraction)

2.5 关注代码的可维护性

这一句比较有意思，本质上是一条规则：注释只写 WHY，改动来历放 commit message，业务背景放 PR description。信息本身正确但放错了位置，也是质量缺陷 —— 它会随代码演化而腐烂，最终误导后续维护。Code Evaluator 不应只看「代码对不对」「测试过不过」，也应该关注代码本身的可维护性。

2.5 Care about maintainability

This line is interesting — at heart it’s a rule: comments say only WHY; the origin of a change belongs in the commit message; business background belongs in the PR description. Information that is correct but in the wrong place is also a quality defect — it rots as the code evolves and eventually misleads future maintainers. A Code Evaluator shouldn’t only ask “is the code correct” and “do the tests pass,” but also attend to the code’s own maintainability.

📎 paste — prompts.ts:208 (Don’t reference the current task, fix, or callers…)

3. src/tools/FileEditTool/FileEditTool.ts

⟦CODE 3⟧ FileEditTool.ts — full source

// ⟦CODE 3⟧ paste here: src/tools/FileEditTool/FileEditTool.ts

这个文件展示了一个操作工具的多层验证链。Code Evaluator 在验证 Agent 生成的代码变更时的质量门禁，可以参考这条验证链的结构：

Secret 检测：checkTeamMemSecrets(fullFilePath, new_string) —— 阻止写入机密。
No-op 检测：old_string === new_string → 拒绝。
权限检查：matchingRuleForInput(...) deny 规则。
UNC 路径安全：防止 NTLM 凭据泄露。
文件大小限制：> 1 GiB 拒绝（防 OOM）。
必须先读后改：readFileState.get(fullFilePath) —— 没读过的文件不能编辑。
并发修改检测：lastWriteTime > readTimestamp.timestamp —— 文件被第三方改过后拒绝编辑。
字符串精确匹配：findActualString(file, old_string) —— 含 curly quote 标准化。
唯一性验证：old_string 在文件中必须唯一（或使用 replace_all）。
反循环编辑：old_string is a substring of a new_string from a previous edit → 拒绝。

This file shows the multi-layer validation chain of a single operation tool. A Code Evaluator’s quality gate for Agent-generated code changes can model itself on this chain:

Secret detection: checkTeamMemSecrets(fullFilePath, new_string) — block writing secrets.
No-op detection: old_string === new_string → reject.
Permission check: matchingRuleForInput(...) deny rules.
UNC path safety: prevent NTLM credential leaks.
File-size limit: > 1 GiB rejected (guard against OOM).
Read-before-edit: readFileState.get(fullFilePath) — a file that hasn’t been read can’t be edited.
Concurrent-modification detection: lastWriteTime > readTimestamp.timestamp — reject if a third party changed the file after it was read.
Exact string match: findActualString(file, old_string) — including curly-quote normalization.
Uniqueness check: old_string must be unique in the file (or use replace_all).
Anti-loop editing: old_string is a substring of a new_string from a previous edit → reject.

4. src/coordinator/coordinatorMode.ts

⟦CODE 4⟧ coordinatorMode.ts — full source

// ⟦CODE 4⟧ paste here: src/coordinator/coordinatorMode.ts

这个文件定义了 Claude Code 的 Coordinator（协调者）模式，本身就是一个完整的多 Agent 编排框架，但基本由一大段 prompt 实现，对我们实现复杂的 Evaluator Agent 逻辑有较强的借鉴意义。

核心借鉴点：

4.1 所有任务都可拆成 Research、Synthesis、Implementation、Verification 四个阶段

按这个思路，测试分析、测试设计、测试执行，都可以用四段式来建立。

This file defines Claude Code’s Coordinator mode — a complete multi-Agent orchestration framework in its own right, implemented largely as one big prompt. It’s strongly instructive for building complex Evaluator-Agent logic.

Key takeaways:

4.1 Any task can be split into four phases: Research, Synthesis, Implementation, Verification

Following this, test analysis, test design, and test execution can all be built in the same four-phase shape.

📎 paste — coordinatorMode.ts:199-209 (## 4. Task Workflow / Phases table)

4.2 验证必须是独立的第三方

验证刚写完代码的 worker 时，应当新起一个 Agent，让验证者用「新鲜的眼睛」看代码，而不是背着实现假设。

4.2 Verification must be an independent third party

When verifying code a different worker just wrote, spawn a fresh Agent so the verifier sees the code with fresh eyes rather than carrying the implementer’s assumptions.

📎 paste — coordinatorMode.ts (Verifying code a different worker just wrote → Spawn fresh)

4.3 “What Real Verification Looks Like” —— 精确定义什么算验证

验证意味着证明代码可用，而不是确认它存在。一个只会盖章放行的验证者会毁掉一切。

4.3 “What Real Verification Looks Like” — precisely defining what counts as verification

Verification means proving the code works, not confirming it exists. A verifier that rubber-stamps weak work undermines everything.

📎 paste — coordinatorMode.ts (What Real Verification Looks Like)

4.4 Continue vs Spawn 决策矩阵 —— 什么时候复用 Agent 上下文，什么时候新起

按 worker 已有上下文与下一个任务的重叠度来决定：高重叠 → continue，低重叠 → spawn fresh；验证他人代码、或上一次方向完全错了，一律 spawn fresh。

4.4 The Continue-vs-Spawn decision matrix — when to reuse an Agent’s context, when to start fresh

Decide by how much the worker’s existing context overlaps with the next task: high overlap → continue, low overlap → spawn fresh; verifying someone else’s code, or a first attempt that took entirely the wrong approach, → always spawn fresh.

📎 paste — coordinatorMode.ts:280-293 (Choose continue vs. spawn by context overlap)

5. src/tools/AgentTool/built-in/planAgent.ts

⟦CODE 5⟧ planAgent.ts — full source

// ⟦CODE 5⟧ paste here: src/tools/AgentTool/built-in/planAgent.ts

Plan Agent 是一个只读的架构设计专家。建议 Planner 中承担研发技术设计职能的 Agent 可直接照抄。

The Plan Agent is a read-only architecture-design specialist. The Agent in the Planner that owns the engineering technical-design role can copy it directly.

6. src/commands/security-review.ts

⟦CODE 6⟧ security-review.ts — full source (SECURITY_REVIEW_MARKDOWN)

// ⟦CODE 6⟧ paste here: src/commands/security-review.ts

这个文件是 Claude Code 内置的安全审查命令，核心是一段 196 行的 prompt（SECURITY_REVIEW_MARKDOWN）。它本质上是一个安全领域的 Evaluator，评审对象是「安全漏洞」，而不是「Spec 质量」或「代码正确性」。它的设计模式与我们的 Spec-eval / Code-eval 高度同构。

6.1 预判库

:140-160 列出了 17 条「不报告」规则，:163-175 列出了 12 条预决判例，这些都预先替 Evaluator 做了需要领域知识的判断调用。从日常迭代里不断总结模式库，是 AI Coding 运转的一个重要工作。

This file is Claude Code’s built-in security-review command; its core is a 196-line prompt (SECURITY_REVIEW_MARKDOWN). It’s essentially an Evaluator for the security domain, reviewing for “security vulnerabilities” rather than “spec quality” or “code correctness.” Its design pattern is highly isomorphic to our Spec-eval / Code-eval.

6.1 A library of pre-judgments

:140-160 lists 17 “do not report” rules, and :163-175 lists 12 precedent rulings — both making, in advance, the domain-knowledge judgment calls the Evaluator would otherwise have to. Continuously distilling a pattern library out of day-to-day iteration is an important part of making AI coding work.

📎 paste — security-review.ts:140-160 (FALSE POSITIVE FILTERING / HARD EXCLUSIONS) and :163-175 (PRECEDENTS)

6.2 并行假阳性过滤工作流

:190-194 描述了一个三步工作流：先用一个 sub-task 粗筛所有漏洞，再为每个漏洞各启动一个并行 sub-task 做假阳性验证，最后过滤掉置信度 < 8 的。单轮评审时 LLM 容易在长上下文里丢失对个别 finding 的精细判断；拆成 N 个独立验证任务后，每个 sub-agent 只聚焦一个 finding，判断质量更高。

对 Code Evaluator 来说，当一次检查产生 5+ 个 findings 时，与其在一个长 prompt 里逐个验证，不如并行启动 5 个 sub-agent 各验一个。但这需要在调度框架层实现，不适合编码进 Skill Prompt。Skill prompt 可以声明「当 findings > 3 时建议启动并行 FP 验证」，但实际执行依赖框架能力。

6.2 A parallel false-positive filtering workflow

:190-194 describes a three-step workflow: first use one sub-task to coarsely surface all vulnerabilities, then launch one parallel sub-task per vulnerability to filter false positives, and finally drop anything with confidence < 8. In a single review pass, an LLM tends to lose fine-grained judgment on individual findings within a long context; splitting into N independent verification tasks lets each sub-agent focus on a single finding, raising judgment quality.

For a Code Evaluator, when one check yields 5+ findings, rather than verifying them one by one in a single long prompt, it’s better to launch 5 sub-agents in parallel, one per finding. But this has to be implemented at the orchestration-framework layer — it doesn’t belong encoded in the Skill prompt. The Skill prompt can declare “when findings > 3, recommend launching parallel FP verification,” but the actual execution depends on the framework’s capability.

📎 paste — security-review.ts:190-194 (Begin your analysis now. Do this in 3 steps…)

6.3 置信度门槛

:41 明确写了 >80% confident of actual exploitability，:134 写了 Below 0.7: Don't report，:137 写了 Better to miss some theoretical issues than flood the report with false positives。Evaluator 中 Review 类型的检查项，不适合穷举所有可能问题，而是只输出高置信度、可操作的发现。

6.4 结构化输出格式

:112-128 要求每个 finding 包含 file、line number、severity、category、description、exploit scenario、fix recommendation。对于问题反馈，每个 finding 必须自包含、带足够信息让接收方直接行动，并同时做到人类友好和 LLM 解析友好。

6.3 A confidence threshold

:41 says >80% confident of actual exploitability, :134 says Below 0.7: Don't report, and :137 says Better to miss some theoretical issues than flood the report with false positives. For Review-type checks in an Evaluator, exhaustively listing every possible issue is the wrong goal — emit only high-confidence, actionable findings.

6.4 A structured output format

:112-128 requires every finding to include file, line number, severity, category, description, exploit scenario, and fix recommendation. For issue reporting, each finding must be self-contained, carrying enough information for the recipient to act directly, while being both human-friendly and LLM-parse-friendly.

📎 paste — security-review.ts:41, :134, :137 (confidence) and :112-128 (finding format)

二、几个方法论层面的借鉴

每条质量规则都要量化触发条件。Claude Code 没写 “important changes must be verified”，而是写 “3+ file edits, backend/API changes, or infrastructure changes”。Evaluator 里不适合放太多定性描述，定量的才能做门禁。
普遍存在的对偶设计，用来降低 LLM 顺着单边评价不断刷分的漏洞：False Positive vs False Negative / Defensive Hedge（虚假通过 vs 防御性降级 / 防御性含糊）、Gold-Plate vs Half-Finish（做过了 vs 没做完）、Blind Retry vs Premature Abandon（盲目重试 vs 草率放弃）。
错误 Pattern 的日常收集，对持续约束 Agent 的工作非常重要（见上文 1.1、1.2、1.5、6.1）。
Evaluator 的反馈信息格式 = 「可复现的证据链」（见上文 1.4、6.4）。

三、谨慎看待 AI Coding 的能力现状

prompts.ts

的注释提到，Capybara v8（Anthropic 内部对主力模型的代号）的 false-claims 率从 16.7% 恶化到 29–30%。

2. A few methodology-level takeaways

Quantify the trigger condition of every quality rule. Claude Code doesn’t write “important changes must be verified”; it writes “3+ file edits, backend/API changes, or infrastructure changes.” An Evaluator shouldn’t carry too many qualitative descriptions — only quantitative ones can serve as a gate.
The pervasive dual designs, which reduce the loophole of an LLM gaming a one-sided metric: False Positive vs False Negative / Defensive Hedge, Gold-Plate vs Half-Finish, Blind Retry vs Premature Abandon.
Day-to-day collection of failure patterns matters a lot for continuously constraining the Agent’s work (see 1.1, 1.2, 1.5, 6.1 above).
The Evaluator’s feedback format = a “reproducible chain of evidence” (see 1.4, 6.4 above).

3. A measured view of where AI coding actually stands

A comment in prompts.ts

notes that the false-claims rate of Capybara v8 (Anthropic’s internal codename for its flagship model) worsened from 16.7% to 29–30%.

📎 paste — prompts.ts // @[MODEL LAUNCH]: False-claims mitigation for Capybara v8 (29-30% FC rate vs v4's 16.7%)

这是一个相当高的比例，所以 Claude Code 里有大量措施约束模型别撒谎，以及即使撒了谎也尽量挡住（独立验证）。比如上文分析到的，在一个「简单的」文件编辑工具里都加了 10+ 层验证；以及 1.5 里看似可笑地把模型最常说的借口逐字列出来、提前反驳。这与我们要在每一个 Agent 动作之后建设的不同层级保护措施 / 测试层级，本质原因是一样的。

LLM 分析需求会出错，按需求生成代码会出错，生成测试会出错，执行测试给结果也会出错 —— 各种出错率层层相乘，这其实是一个赌的过程。AI Coding 现阶段的目标，无论针对多么简单的代码变更，都绝不应该是所谓的 Agent 全自动化、把人彻底排除在外。如何让人在这个过程里高效地参与进来，才是范式变化时最该考虑的点。

四、具体能力复用实现

把上述值得借鉴的各方面纳入，下面两个初级的 Evaluator 可以作为实际项目里的起始版本。

That’s a fairly high rate, which is why Claude Code carries so many measures to keep the model from lying — and, even when it does, to block it anyway (independent verification). For instance, as analyzed above, even a “simple” file-edit tool gets 10+ layers of validation; and in 1.5, the almost comical move of listing the model’s favorite excuses verbatim and rebutting them in advance. This shares the same root cause as the protective layers / test layers we want to build after every Agent action.

An LLM gets requirements analysis wrong, gets code-from-requirements wrong, gets test generation wrong, and gets test execution and its verdict wrong too — all those error rates multiply layer upon layer; it’s really a gamble. At this stage, whatever the change, the goal of AI coding should never be so-called fully-automated Agents that cut humans out entirely. How to get humans to participate efficiently in the process is the thing to think hardest about when shifting paradigms.

4. A concrete capability-reuse implementation

Folding in everything worth borrowing above, the two entry-level Evaluators below can serve as a starting version in a real project.

⟦CODE 7⟧ Spec Evaluator (spec-eval) — full prompt

// ⟦CODE 7⟧ paste here: Spec Evaluator (spec-eval) prompt

⟦CODE 8⟧ Code Evaluator (code-eval) — full prompt

// ⟦CODE 8⟧ paste here: Code Evaluator (code-eval) prompt