Early AI outputs sounded confident. "Your mobile experience needs work." So I asked: "Based on what?" I was right. The AI was generating plausible-sounding recommendations with no evidence behind them.

What I enforced

Every recommendation must trace to evidence through a strict 3-step chain: First, a signal is detected from the actual crawl data. Second, a finding is generated from that signal. Third, a recommendation is written based on the finding. If the chain is broken at any point, the recommendation doesn't ship.

  • If I didn't test it, I say "Not tested"
  • If I got blocked, I say "Unable to verify"
  • If I'm uncertain, I say so

Why honesty builds trust

Users don't need perfect scores. They need to know what's real. A report that says "We couldn't test your checkout flow because the page required authentication" is more valuable than one that invents a score. Transparency about limitations actually increases confidence in the findings you can support.

Users don't need perfect scores. They need to know what's real.