chat-v2 quality eval

2026-06-11 08:50:04 · build 2026-06-08-aiprompts-panel · judge anthropic/claude-sonnet-4-5 · target https://nqz.ai · incl PAID · free 3× / paid 1×
90 / 100 mean
flows passed5/5
threshold70
flows5

score by flow (bar = mean, whisker = min–max across samples)

pass 70 conversational-greeting 97 conversational-capability 82 geo-visibility-synthesis 90 god-mode-plan 80 lead-search 100

distribution

3
90–100
2
70–89
0
40–69
0
0–39
conversational-greeting
97/100 spread 95–100 · 3× PASS
prompt hi, what can you help me with? · no tool
rubric A warm orientation to the tool. MUST: present real capabilities (leads, email campaigns, SEO audits, AI-visibility) in plain language — a short scannable grouped/bulleted menu is GOOD and should NOT be marked down for length as long as it is well-structured; end by inviting a next step. MUST NOT: dump raw model/engine/function slugs (e.g. "openai/gpt-4o", "search_leads"), print code-like tokens, leave placeholders, or be one undifferentiated wall of prose with no structure.
95 sample 1
Excellent structured menu with real capabilities clearly grouped, warm tone, and clear next-step invitation; no slugs, placeholders, or fluff—only minor emoji density prevents a perfect score.
100 sample 2
The response is warm, well-structured with clear headings and bullets, presents real capabilities in plain language, avoids raw slugs/placeholders, and ends with an invitation to proceed.
95 sample 3
Excellent structured menu with real capabilities clearly grouped, warm tone, invites next step, no slugs or placeholders; minor deduction for slight length, though well-organized.
conversational-capability
82/100 spread 75–85 · 3× PASS
prompt what can you do for my seo? · no tool
rubric Explains the SEO/AEO surface concretely: on-page audit, index coverage, AI-visibility/citations, content ideas, rank tracking. MUST be specific and actionable, not generic marketing fluff. MUST NOT invent features or print internal tool names/slugs as-is.
85 sample 1
Response is specific and actionable with concrete features (on-page audit, GSC data, SERP checks, indexing, AEO pillars), but the emoji rendering appears broken (shows '?' instead of icons) and 'Share of Model' is slightly jargony without a brief plain-English explanation.
85 sample 2
Response is specific and actionable, covering on-page audit, GSC data, SERP rankings, index status, and AEO/AI visibility; minor deduction for slightly promotional tone and 'share of model' which may be a branded term, but overall avoids fluff and placeholders.
75 sample 3
Response is specific and actionable with concrete features (on-page audit, GSC data, SERP analysis, indexing checks, GA4 traffic), but uses emoji fluff and slightly generic phrasing like 'just say the word' that weakens the professional tone; no invented features or internal slugs detected.
geo-visibility-synthesis
90/100 PASS
prompt Run AI visibility check with: chatgpt:openai/gpt-4o||PROMPTS||["best cryptographic identity for AI agents"] · tool seo_geo_visibility
rubric A GEO citation result. MUST: state whether the site was cited for the tested prompt, reference the actual engine (ChatGPT) and the prompt, and give a concrete next action. Numbers/citation facts must come from the result, not be invented. MUST NOT be empty, an error, or generic SEO advice unrelated to the citation check.
90 sample 1
States kakunin.ai was not cited by ChatGPT (gpt-4o) for the tested prompt, provides concrete citation matrix and gap analysis, and implies next action (addressing the citation gap); minor deduction for not explicitly naming 'ChatGPT' in the summary sentence.
god-mode-plan
80/100 PASS
prompt run god mode · tool google_god_mode_report
rubric A prioritized technical-SEO/diagnostics action plan (this is the plan section of a larger report card). MUST: be ordered highest-impact-first, cite concrete numbers/findings from the diagnostics (LCP/INP/CLS, indexation counts, clicks, positions), and give specific fixes. MUST NOT: be generic advice with no numbers. Judge ONLY the action plan itself; do not penalise it for being preceded by data tables in the source card.
80 sample 1
The action plan cites concrete numbers (LCP 3451ms mobile, 621ms desktop, 450ms savings, 22 indexed pages, 3 URLs with issues, specific queries with clicks/impressions/positions, CLS values, regional data), is ordered by impact, and provides specific fixes (reduce unused JS, fix indexing for named U
lead-search
100/100 PASS
prompt find me 10 AI founders from US · tool search_leads
rubric A delivered lead-search result. MUST: confirm a real batch of prospects was found AND saved (to a named list), name several of them with role/company, and convey verification quality. Numbers come from the result. MUST NOT: fabricate contacts, claim a save that did not happen, paste raw JSON/ids, or error. If the batch genuinely had not resolved yet, an honest "still loading" reply is acceptable; an honest "none found" is acceptable. Fabricated or over-claimed names = fail.
100 sample 1
Response confirms a real batch (99 leads) was found and saved to a named list ('AI Founders US'), names 5 specific founders with emails and companies, and indicates verification quality ('mostly email-verified'); no fabrication, placeholders, or raw JSON detected.
Scores are Claude-as-judge (anthropic/claude-sonnet-4-5) ratings of response prose against a fixed per-flow rubric, 0–100. Chat runs at temperature, so free flows are sampled 3× and averaged; the whisker shows run-to-run spread. Mechanics (right tool/picker fired, no error) are covered separately by an L1 smoke harness. Build 2026-06-08-aiprompts-panel. This is a public, automated quality report — raw model responses are omitted; the judge's reasoning is shown verbatim.