chat-v2 quality eval

2026-06-11 08:50:04 · build 2026-06-08-aiprompts-panel · judge anthropic/claude-sonnet-4-5 · target https://nqz.ai · incl PAID · free 3× / paid 1×

flows passed5/5

threshold70

flows5

score by flow (bar = mean, whisker = min–max across samples)

distribution

90–100

70–89

40–69

0–39

conversational-greeting

97/100 spread 95–100 · 3× PASS

prompt hi, what can you help me with? · no tool

rubric A warm orientation to the tool. MUST: present real capabilities (leads, email campaigns, SEO audits, AI-visibility) in plain language — a short scannable grouped/bulleted menu is GOOD and should NOT be marked down for length as long as it is well-structured; end by inviting a next step. MUST NOT: dump raw model/engine/function slugs (e.g. "openai/gpt-4o", "search_leads"), print code-like tokens, leave placeholders, or be one undifferentiated wall of prose with no structure.

95 sample 1

Excellent structured menu with real capabilities clearly grouped, warm tone, and clear next-step invitation; no slugs, placeholders, or fluff—only minor emoji density prevents a perfect score.

100 sample 2

The response is warm, well-structured with clear headings and bullets, presents real capabilities in plain language, avoids raw slugs/placeholders, and ends with an invitation to proceed.

95 sample 3

Excellent structured menu with real capabilities clearly grouped, warm tone, invites next step, no slugs or placeholders; minor deduction for slight length, though well-organized.

conversational-capability

82/100 spread 75–85 · 3× PASS

prompt what can you do for my seo? · no tool

rubric Explains the SEO/AEO surface concretely: on-page audit, index coverage, AI-visibility/citations, content ideas, rank tracking. MUST be specific and actionable, not generic marketing fluff. MUST NOT invent features or print internal tool names/slugs as-is.

85 sample 1

Response is specific and actionable with concrete features (on-page audit, GSC data, SERP checks, indexing, AEO pillars), but the emoji rendering appears broken (shows '?' instead of icons) and 'Share of Model' is slightly jargony without a brief plain-English explanation.

85 sample 2

Response is specific and actionable, covering on-page audit, GSC data, SERP rankings, index status, and AEO/AI visibility; minor deduction for slightly promotional tone and 'share of model' which may be a branded term, but overall avoids fluff and placeholders.

75 sample 3

Response is specific and actionable with concrete features (on-page audit, GSC data, SERP analysis, indexing checks, GA4 traffic), but uses emoji fluff and slightly generic phrasing like 'just say the word' that weakens the professional tone; no invented features or internal slugs detected.

geo-visibility-synthesis paid

90/100 PASS

prompt Run AI visibility check with: chatgpt:openai/gpt-4o||PROMPTS||["best cryptographic identity for AI agents"] · tool seo_geo_visibility

rubric A GEO citation result. MUST: state whether the site was cited for the tested prompt, reference the actual engine (ChatGPT) and the prompt, and give a concrete next action. Numbers/citation facts must come from the result, not be invented. MUST NOT be empty, an error, or generic SEO advice unrelated to the citation check.

90 sample 1

States kakunin.ai was not cited by ChatGPT (gpt-4o) for the tested prompt, provides concrete citation matrix and gap analysis, and implies next action (addressing the citation gap); minor deduction for not explicitly naming 'ChatGPT' in the summary sentence.

god-mode-plan paid

80/100 PASS

prompt run god mode · tool google_god_mode_report

rubric A prioritized technical-SEO/diagnostics action plan (this is the plan section of a larger report card). MUST: be ordered highest-impact-first, cite concrete numbers/findings from the diagnostics (LCP/INP/CLS, indexation counts, clicks, positions), and give specific fixes. MUST NOT: be generic advice with no numbers. Judge ONLY the action plan itself; do not penalise it for being preceded by data tables in the source card.

80 sample 1

The action plan cites concrete numbers (LCP 3451ms mobile, 621ms desktop, 450ms savings, 22 indexed pages, 3 URLs with issues, specific queries with clicks/impressions/positions, CLS values, regional data), is ordered by impact, and provides specific fixes (reduce unused JS, fix indexing for named U

lead-search paid

100/100 PASS

prompt find me 10 AI founders from US · tool search_leads

rubric A delivered lead-search result. MUST: confirm a real batch of prospects was found AND saved (to a named list), name several of them with role/company, and convey verification quality. Numbers come from the result. MUST NOT: fabricate contacts, claim a save that did not happen, paste raw JSON/ids, or error. If the batch genuinely had not resolved yet, an honest "still loading" reply is acceptable; an honest "none found" is acceptable. Fabricated or over-claimed names = fail.

100 sample 1

Response confirms a real batch (99 leads) was found and saved to a named list ('AI Founders US'), names 5 specific founders with emails and companies, and indicates verification quality ('mostly email-verified'); no fabrication, placeholders, or raw JSON detected.

Scores are Claude-as-judge (anthropic/claude-sonnet-4-5) ratings of response prose against a fixed per-flow rubric, 0–100. Chat runs at temperature, so free flows are sampled 3× and averaged; the whisker shows run-to-run spread. Mechanics (right tool/picker fired, no error) are covered separately by an L1 smoke harness. Build 2026-06-08-aiprompts-panel. This is a public, automated quality report — raw model responses are omitted; the judge's reasoning is shown verbatim.