Skip to main content
StackraStackra
Case Study
10 min readJune 25, 2026Verified against codebase

We Scanned 11.3 Million Real Websites to Build Our Benchmarks. Here's What We Found.

Every benchmark Stackra shows you, CWV pass rates, schema adoption, tool gaps by platform, comes from a corpus we built ourselves: 11.3 million real sites from the HTTP Archive, classified into business types, joined with real Chrome field data. Here is how we built it, what broke along the way, and the findings that surprised us.

Industry
B2B SaaS for small and mid-sized businesses
Stack
BigQuery, HTTP Archive, Chrome UX Report (CrUX), Vertex AI / Gemini, TypeScript
Outcome
An 11.3M-row corpus of real websites with platform detection, Core Web Vitals, schema/tool adoption, and AI-corrected business-type classification, refreshed under $5 a build and exported as a static JSON file the product reads at zero runtime cost.

At a glance

Our benchmarks needed to compare a scanned site against real peers on its actual platform, not an industry-wide average pulled from someone else's blog post.

11,310,433
Sites in the corpus
May 2026 HTTP Archive mobile crawl, after the inclusion gate, junk trim, and per-row cleanup
326
Platform cohorts shipped
after the n >= 100 floor; 539 raw before the floor
~$4.92
Build cost
one BigQuery scan, ~807 GB processed
19.94% -> 2.30%
US subset, business-type unclassified rate
after the Vertex AI / Gemini re-classification pass
2.8% vs 99.9%
Schema adoption gap, worst vs. best platform
Weebly has JSON-LD on 2.8% of sites; Squarespace, 99.9%
89.1%
robots.txt presence, live US sites
after excluding dead/unreachable origins from the denominator

Three pain points, by site type

Tap the one that sounds like your site to read the full story.

1

An industry-wide CWV average tells a Squarespace user nothing useful

Affects: Any benchmarking feature that compares a scanned site to a single blended average instead of a platform- or business-type-specific peer group.

If this is you

If you're told your site's Largest Contentful Paint is 'below average,' your next question is below average compared to what? A custom-coded enterprise site and a Weebly template are not playing the same game, and grading them against the same bar is either falsely reassuring or falsely alarming.

What it looks like

Early benchmark drafts used one global median across every platform. A Wix site with a 1.1s median LCP looked unremarkable next to the all-platform median. A Weebly site at 3.2s looked catastrophic next to the same number, when in reality it's catastrophic next to its own peers and unremarkable next to nothing.

What we tried first

We considered buying a third-party benchmark dataset. Most either don't break out by platform, sample only top-traffic sites (which skews everything toward sites that already have engineering teams), or cost more per month than the entire feature was worth building.

What worked

Build our own corpus from a source that's free, current, and platform-labeled: the HTTP Archive's monthly crawl, joined with the Chrome UX Report (CrUX) for real field performance data, not lab simulations.

Result

326 platform cohorts, each with at least 100 real sites, each carrying its own Core Web Vitals pass rate, schema adoption rate, and tool-adoption mix. A Weebly site gets compared to other Weebly sites.

2

CrUX only covers popular sites, so an inner join quietly throws away the long tail

Affects: Any pipeline joining a broad crawl dataset to a real-user-monitoring dataset like CrUX, RUM, or any panel-based field data source.

If this is you

If your benchmarking tool's sample skews toward sites with enough traffic to show up in Chrome's telemetry, it's quietly excluding most small businesses, the exact audience that needs the benchmark most.

What it looks like

CrUX only reports field data for origins with enough Chrome traffic to be statistically meaningful. That's a minority of the long tail. An inner join between the crawl and CrUX collapses the corpus back down toward only-popular sites.

What we tried first

Nothing instead, this one we caught before shipping, by checking what fraction of our candidate sites had a CrUX match.

What worked

A LEFT JOIN, not an INNER JOIN. Every qualifying site stays in the corpus. Sites with a CrUX match get real field CWV data; sites without get null field columns but still count toward adoption-rate and structural benchmarks (schema, tools, canonical tags).

Result

Each cohort carries both an n (every qualifying site) and an n_crux (the subset with field performance data). CWV-derived stats compute over the CrUX subset only; everything else uses the full cohort, so small sites without enough traffic to register in Chrome still get counted everywhere it's valid to count them.

3

A SQL CASE statement can't tell a dental practice from a law firm

Affects: Any classification pipeline trying to bucket millions of arbitrary website titles and descriptions into a fixed taxonomy.

If this is you

If you're trying to compare your site to others like it, the classification step has to actually work, or every 'peer benchmark' downstream is comparing you to the wrong peers.

What it looks like

An early keyword-matching pass left 19.94% of the 1.2M-row US subset as unclassified, and a chunk of what was classified was wrong in predictable ways: a `tools_maps` signal that produced 600K false positives, a `Person` schema heuristic that produced 738K, an assumption that all of one platform's sites were a certain business type that turned out to be wrong for 233K of them.

What we tried first

Patching the keyword classifier with more rules. It works until the next platform-specific or industry-specific edge case shows up, and the false-positive list kept growing.

What worked

A two-layer fix. First, the real production classifier (the same `detectBusinessType()` Stackra uses on a live scan, not a separate SQL CASE statement) runs against text sourced from the HTTP Archive's rendered page bodies, so the corpus and a live scan agree by construction. Second, for the US subset specifically, a full re-run through Gemini (`ML.GENERATE_TEXT` in BigQuery) using a batched-numbered-list prompt pattern, with the old classification kept alongside the new one so every change is auditable.

Result

Unclassified rate on the US subset dropped from 19.94% to 2.30%. The corrected table keeps the old label as `business_type_previous`, so we can see exactly what changed and why, not just trust the new number blind.

About Stackra

Stackra is a website audit tool for small and mid-sized businesses. The product's whole pitch is honest, specific feedback instead of generic SEO advice, which only works if the comparisons underneath it are grounded in real data. This corpus is that grounding.

Where we started

Stackra's benchmarks originally leaned on published industry averages and a handful of manually researched numbers. That's fine for a rough sense of scale, but it can't answer the question a scanned site actually needs answered: how do sites like mine, on my platform, in my industry, actually perform? Answering that meant building a real dataset, not finding a better blog post to cite.

The pipeline, step by step

The build is one large BigQuery job, run once per refresh, exported to a static JSON file the product reads at zero runtime cost. No live BigQuery queries happen when a user runs a scan.

  • Source: the HTTP Archive's May 2026 mobile root-page crawl, joined with the Chrome UX Report (CrUX) for real field performance data via a LEFT JOIN on origin, so the long tail of low-traffic sites isn't silently dropped.
  • Inclusion gate: content floor, plus either a recognized business platform (CMS, website builder, or ecommerce tag) or a 'vibecoded' signature (JS framework, PaaS, or static site generator with no platform tag), minus wikis and blog-only hosts. That gate kept 12.01M of 15.17M candidate sites.
  • Junk trim: our own host/TLD filter removes institutional domains (.gov/.edu/.mil), PaaS preview hosts (vercel.app, netlify.app), and blog/social hosting domains (blogspot, medium, substack). That trim removed another 349K sites; after platform detection and per-row cleanup, the committed corpus settles at the ~11.3M figure above. The full-corpus funnel is informative context, the rigorous focus is the US subset.
  • Platform detection: a Wappalyzer category-priority system (CMS > website builders > static site generators > ecommerce), not a hardcoded platform name list, so new platforms get caught automatically as the underlying data updates.
  • Business-type classification: runs in TypeScript using Stackra's actual production classifier against text sourced from rendered page bodies (title, H1, meta description), so the corpus and a live scan use identical logic.
  • Export: aggregation queries collapse the 11.3M rows into per-platform cohorts (a 100-site floor keeps any cohort statistically meaningful), written to server/data/corpus-benchmarks.json and committed to the repo.

Tools we used and why

Every piece was picked to keep the build cheap, current, and re-runnable without a standing infrastructure bill.

Tooling and what it's for
ToolRole in the pipeline
HTTP Archive (BigQuery public dataset)The crawl itself: millions of real sites, monthly cadence, free to query (you pay for the bytes you scan).
Chrome UX Report (CrUX)Real Chrome field performance data (actual visitor experience), not a lab simulation. Joined on origin.
Wappalyzer technology tags (bundled in HTTP Archive)Platform and tool detection: what CMS, what booking widget, what analytics tag, per site.
BigQueryRuns the whole build as one large CTE chain in a single statement, avoiding intermediate storage cost.
Vertex AI / Gemini (gemini-2.5-flash-lite, via ML.GENERATE_TEXT in BigQuery)Re-classified business type on the full 1.2M-row US subset where the keyword classifier left too much unclassified or wrong.
Overture Maps (free public BigQuery dataset)Geo/address enrichment, joined separately, never fed into classification to avoid inheriting a third party's own classification mistakes.
TypeScript (Stackra's own classifier)The actual business-type detection logic, run identically against the corpus and against a live scan.

Cleaning the data: the mistakes worth naming

A few specific failures shaped the final pipeline. Each one cost real query budget to find.

  • structured_data carries two different escaping regimes in one text blob: Open Graph tags are unescaped with value before key, JSON-LD is escaped with standard JSON quoting. A regex written for one regime silently returns near-empty results on the other; a prior build matched Open Graph at 6.2% and @type at 0.0% before this was caught.
  • The markup column has no title, meta description, or H1 fields, despite early planning assuming it did. Those fields turned out to live in a different column entirely (custom_metrics.wpt_bodies, the rendered page body), which is why business-type classification was dead in earlier builds.
  • third_parties, a metric we expected to expose for a 'how many trackers does this site load' benchmark, is null for effectively 100% of rows in the current crawl. The column is still computed for forward compatibility but never exposed, rather than shipping a benchmark built on nulls.
  • Borrowed third-party signals make bad classifier inputs. A tools_maps heuristic produced 600K false positives, a Person schema heuristic produced 738K, and an assumption that one platform's sites were always one business type was wrong for 233K sites. The fix was to trust our own detector logic over inherited tags, every time.

How Stackra uses this

The corpus isn't a research artifact sitting in BigQuery. It powers a live feature.

  • Peer benchmarks in scan recommendations: a scanned site's Core Web Vitals, schema adoption, and tool usage get compared against its actual platform cohort, not a global average.
  • Tool-gap recommendations: if a site is missing analytics, booking, or chat that's standard for its platform and business type, that's flagged with real adoption-rate context, not a generic 'you should add live chat' suggestion.
  • Zero runtime cost: the entire corpus build happens once, gets exported to a static JSON file, and ships in the deploy. A user's scan never triggers a BigQuery query.

What we found: the platform gaps were bigger than expected

Pulling real numbers across the major platforms surfaced gaps that don't show up in marketing copy for any of these builders.

Core Web Vitals pass rate and schema adoption by platform
PlatformSites in cohortCWV pass rateMedian CLSJSON-LD adoptionSEO tool adoption
Shopify72,06389.9%0 (passing)73.8%2.9%
GoDaddy Website Builder22,18689.6%0 (passing)57.1%0.0%
Wix100,06188.5%0 (passing)97.9%0.2%
Webflow19,83581.2%0 (passing)38.2%4.9%
Squarespace89,19680.9%0 (passing)99.9%0.2%
Drupal12,28479.9%0 (passing)24.4%1.7%
Joomla9,98568.9%0 (passing)28.3%0.6%
WordPress505,52161.7%0 (passing)81.1%74.3%
Weebly19,67430.8%0.4 (4x Google's 0.1 target)2.8%0.1%

CWV pass rate is the share of sites meeting Google's Core Web Vitals thresholds across the CrUX-matched subset. WordPress's high SEO-tool adoption (Yoast, All in One SEO) doesn't translate to the best CWV pass rate, plugin weight has a real cost. Weebly is the outlier on every column: worst CWV pass rate, worst schema adoption, and the only platform where median CLS misses Google's target outright.

What we found: most platforms are missing the same handful of tools

Tool adoption (booking, live chat, reviews, forms) varies enormously by platform, and not in the direction you'd guess from each platform's marketing.

Tool adoption by platform (selected categories)
PlatformAnalyticsLive chatBooking/scheduling
Weebly98.8%1.7%1.2%
Shopify85.5%29.6%1.4%
Drupal88.0%5.7%0.7%
WordPress78.8%6.7%2.6%
GoDaddy Website Builder38.4%34.9%1.3%
Squarespace50.0%2.4%2.5%
Wix49.6%2.5%1.5%

Booking/scheduling adoption is low everywhere, under 3% on every major platform, which tracks with how often this shows up as a real recommendation gap on scanned sites, not a platform-specific quirk.

What we found: robots.txt presence, corrected for dead sites

A live HTTP crawl (GET-with-abort, never a HEAD request, since 10-15% of servers handle HEAD inconsistently) checked robots.txt presence across 361,450 origins in the US corpus. The naive pass rate, 78.1%, undercounted: it included origins that were simply dead, not origins that chose to skip the file.

  • A liveness-reconciliation pass re-checked every origin whose first request threw an error rather than returning any HTTP status, using a longer timeout and an http-to-https fallback for old corpus entries recorded on the wrong scheme.
  • Of 320,658 deduplicated origins, 318,848 (99.4%) turned out to be alive; only 1,810 were genuinely dead.
  • Corrected presence rate among live sites: 89.1%. That's the number that actually answers 'do real, currently-operating small business sites have a robots.txt file,' not a number diluted by sites that no longer exist.

robots.txt isn't required, search engines crawl a site fine without one, but it's still worth having if you want to keep bots out of admin pages and checkout flows, reduce server strain from aggressive crawlers on heavy pages, or point search engines straight at your sitemap.

What we'd tell another small team building something like this

A few lessons that didn't make it into the pain points above:

  • LEFT JOIN your field-data source, don't inner join it. The popular-site bias from an inner join is easy to miss until someone asks why every benchmark looks suspiciously fast.
  • Don't trust a third party's classification tags as your own classifier's input. Borrowed signals carry borrowed false positives, and they compound.
  • Dry-run every BigQuery cost before the real run. An $807GB scan and a $0.05 scan look identical in the SQL editor until you've actually checked.
  • When a crawl produces an error, don't assume it means what you think it means. 'Threw an exception' and 'doesn't have what we're looking for' are different findings, and conflating them quietly poisons your denominator.

Methodology and verification

Every number in this case study comes from real BigQuery job history and the committed corpus-benchmarks.json export.

LB
Luke Beck
Founder, Stackra
Last verified June 25, 2026
Corpus build numbers verified against BigQuery job history and server/data/corpus-benchmarks.json. robots.txt figures verified against stackra_corpus.robots_txt_audit and stackra_corpus.robots_txt_liveness in BigQuery.

Want to see what bots see on your site?