How We Made AI Persona Scoring Trustworthy

Every Stackra scan runs three AI persona reviews in parallel. A CMO perspective on conversion and trust. An SEO specialist on discoverability and structure. A CTO on performance, security, and accessibility. Each one contributes 30% to its respective pillar score: Conversion & Trust, Search Visibility, and Technical Confidence. The other 70% is deterministic: Lighthouse, Cheerio, link checking, schema detection. Hard data.

Thirty percent is meaningful influence. A 20-point swing between what the data says and what a persona scores moves the final pillar result by 6 points. Across hundreds of scans, that adds up. So the question I had to answer honestly was: are these personas reliable enough to justify the weight we give them?

30% AI influence is only defensible if you can prove the AI earns it on every scan.

The model stack

The three personas (CMO, SEO, CTO) run on GPT-4o, in parallel, on every complete scan. GPT-4o handles the structured analysis: reading the data pack for a site, identifying genuine strengths and weaknesses, and outputting a numeric score alongside the review. Separately, visual analysis of homepage screenshots runs on GPT-5.4, which is better suited to interpreting rendered designs than raw HTML. Business intelligence classification (industry, business model, and offerings) runs on GPT-4o-mini, where speed and cost matter more than depth. The report PDF layout analysis uses GPT-5.2. Each model is doing a different job, and the choice reflects that.

What we saw early

The first version of the persona prompts produced output that looked reasonable on the surface but had real problems underneath:

Generic output: recommendations that could apply to any website regardless of what the data showed. Add testimonials. Improve your CTA. No reference to anything we actually detected on the site.
Quantity blindness: recommending adding client logos on a site that already had seven. Recommending more social proof on a site with fourteen reviews.
Contradiction: a persona listing strong trust signals as a strength, then recommending adding trust signals in the same review.
Priority inflation: three or four recommendations all marked high priority, which makes none of them actually high priority.
Domain crossover: the CMO flagging Lighthouse scores, the SEO recommending a contact form, the CTO commenting on brand messaging. Each opining on territory it doesn't own.

Building the quality rubric

Before we could fix anything, we needed a way to measure it. We built a quality rubric that evaluates every persona output after it's generated. Each review is scored against weighted criteria on a 0-2 scale. Universal criteria apply to all three personas: evidence-based statements (are specific numbers being cited?), balanced assessment (are there genuine strengths and weaknesses?), and specificity (are there generic boilerplate phrases that could apply to any site?). Then each persona gets domain-specific checks. The CMO is evaluated on whether it references actual trust scores and CTA counts. The SEO on whether it cites content metrics and technical SEO signals. The CTO on whether it anchors to Lighthouse scores, core web vitals, and security grades. And critically, each persona is checked for whether it stays in its own lane.

The guardrails we built from what we found

The rubric told us where things were going wrong. The guardrails are what we built to stop it from happening in the first place.

Data-first grounding: every claim must trace to a field in the data pack. If a data field says None detected, that feature does not exist. The persona cannot claim otherwise.
Detected features cross-check: before every recommendation, the persona checks whether the feature is already present. If it's there, no recommendation to add it. Only suggestions to improve or reposition it.
Quantity thresholds: when three or more testimonials, CTAs, or client logos are present, adding more is explicitly banned. Quality improvements only.
Hard scope boundaries: banned terminology lists per persona. CMO cannot mention LCP, Lighthouse, or security headers. SEO cannot mention conversion rates or alt text coverage. CTO cannot mention brand messaging or marketing funnels. Violations score zero on the rubric.
Self-consistency check: if a feature appears in strengths, it cannot appear in weaknesses or recommendations. The prompt requires re-reading before finalising.
Priority calibration: a maximum of one or two high-priority recommendations per persona. If three or more are marked high, the prompt instructs downgrading the least critical ones.
Permission to say nothing: explicitly built in. If a site is genuinely strong in a domain, the persona should say so and not manufacture recommendations to fill space.

What the data shows

We ran the quality rubric across our scan history to establish a baseline, then tracked improvement as we shipped each guardrail iteration. The CMO had the most variance and the hardest job, since marketing judgment is more subjective than a technical audit. Its share of scans returning a score below 50 dropped from 14% across all-time history to effectively zero in the last 50 scans. Average CMO scores moved from 68 to 73. Median from 74 to 75. The SEO persona now returns scores above 70 in 92% of recent scans, up from 75% historically. The CTO eliminated below-50 outputs entirely in recent scans.

Zero low-confidence outputs across all three personas in 246 production scans. The rubric hasn't caught a single one.

What this doesn't solve

The quality rubric is a monitoring and improvement tool. It tells us when a persona output is weak and feeds back into prompt iteration. It doesn't currently gate the score. A persona that scores poorly on the rubric still contributes its 30% weight. The mathematical protection is the 70/30 anchor itself: even a poorly calibrated persona can only move a pillar score by roughly 15 points in the worst case, and defaults to a neutral 50 if it fails entirely. But closing the loop (using rubric grades to automatically discount low-quality persona contributions) is a logical next step. The infrastructure is already there.

Why this matters for the score

A score that users can trust has to be built on components that earn their weight. The 70% deterministic foundation is stable by design. Lighthouse results are Lighthouse results. A missing meta description is a missing meta description. The 30% persona contribution is only defensible if the personas are consistently grounded in the data, staying in their domains, and not manufacturing findings. The quality rubric and the guardrails are how we made that true. The scan data is how we know it's working.

See how the score works →
Pillar weights, grade thresholds, and the 70/30 data-to-AI split explained.
Run a free scan →
See the personas in action on your own site.