About Stackra
Stackra is a website audit tool for small and mid-sized businesses. The product's whole pitch is honest, specific feedback instead of generic SEO advice, which only works if the comparisons underneath it are grounded in real data. This corpus is that grounding.
Where we started
Stackra's benchmarks originally leaned on published industry averages and a handful of manually researched numbers. That's fine for a rough sense of scale, but it can't answer the question a scanned site actually needs answered: how do sites like mine, on my platform, in my industry, actually perform? Answering that meant building a real dataset, not finding a better blog post to cite.
The pipeline, step by step
The build is one large BigQuery job, run once per refresh, exported to a static JSON file the product reads at zero runtime cost. No live BigQuery queries happen when a user runs a scan.
- •Source: the HTTP Archive's May 2026 mobile root-page crawl, joined with the Chrome UX Report (CrUX) for real field performance data via a LEFT JOIN on origin, so the long tail of low-traffic sites isn't silently dropped.
- •Inclusion gate: content floor, plus either a recognized business platform (CMS, website builder, or ecommerce tag) or a 'vibecoded' signature (JS framework, PaaS, or static site generator with no platform tag), minus wikis and blog-only hosts. That gate kept 12.01M of 15.17M candidate sites.
- •Junk trim: our own host/TLD filter removes institutional domains (.gov/.edu/.mil), PaaS preview hosts (vercel.app, netlify.app), and blog/social hosting domains (blogspot, medium, substack). That trim removed another 349K sites; after platform detection and per-row cleanup, the committed corpus settles at the ~11.3M figure above. The full-corpus funnel is informative context, the rigorous focus is the US subset.
- •Platform detection: a Wappalyzer category-priority system (CMS > website builders > static site generators > ecommerce), not a hardcoded platform name list, so new platforms get caught automatically as the underlying data updates.
- •Business-type classification: runs in TypeScript using Stackra's actual production classifier against text sourced from rendered page bodies (title, H1, meta description), so the corpus and a live scan use identical logic.
- •Export: aggregation queries collapse the 11.3M rows into per-platform cohorts (a 100-site floor keeps any cohort statistically meaningful), written to server/data/corpus-benchmarks.json and committed to the repo.
Tools we used and why
Every piece was picked to keep the build cheap, current, and re-runnable without a standing infrastructure bill.
| Tool | Role in the pipeline |
|---|---|
| HTTP Archive (BigQuery public dataset) | The crawl itself: millions of real sites, monthly cadence, free to query (you pay for the bytes you scan). |
| Chrome UX Report (CrUX) | Real Chrome field performance data (actual visitor experience), not a lab simulation. Joined on origin. |
| Wappalyzer technology tags (bundled in HTTP Archive) | Platform and tool detection: what CMS, what booking widget, what analytics tag, per site. |
| BigQuery | Runs the whole build as one large CTE chain in a single statement, avoiding intermediate storage cost. |
| Vertex AI / Gemini (gemini-2.5-flash-lite, via ML.GENERATE_TEXT in BigQuery) | Re-classified business type on the full 1.2M-row US subset where the keyword classifier left too much unclassified or wrong. |
| Overture Maps (free public BigQuery dataset) | Geo/address enrichment, joined separately, never fed into classification to avoid inheriting a third party's own classification mistakes. |
| TypeScript (Stackra's own classifier) | The actual business-type detection logic, run identically against the corpus and against a live scan. |
Cleaning the data: the mistakes worth naming
A few specific failures shaped the final pipeline. Each one cost real query budget to find.
- •structured_data carries two different escaping regimes in one text blob: Open Graph tags are unescaped with value before key, JSON-LD is escaped with standard JSON quoting. A regex written for one regime silently returns near-empty results on the other; a prior build matched Open Graph at 6.2% and @type at 0.0% before this was caught.
- •The markup column has no title, meta description, or H1 fields, despite early planning assuming it did. Those fields turned out to live in a different column entirely (custom_metrics.wpt_bodies, the rendered page body), which is why business-type classification was dead in earlier builds.
- •third_parties, a metric we expected to expose for a 'how many trackers does this site load' benchmark, is null for effectively 100% of rows in the current crawl. The column is still computed for forward compatibility but never exposed, rather than shipping a benchmark built on nulls.
- •Borrowed third-party signals make bad classifier inputs. A tools_maps heuristic produced 600K false positives, a Person schema heuristic produced 738K, and an assumption that one platform's sites were always one business type was wrong for 233K sites. The fix was to trust our own detector logic over inherited tags, every time.
How Stackra uses this
The corpus isn't a research artifact sitting in BigQuery. It powers a live feature.
- •Peer benchmarks in scan recommendations: a scanned site's Core Web Vitals, schema adoption, and tool usage get compared against its actual platform cohort, not a global average.
- •Tool-gap recommendations: if a site is missing analytics, booking, or chat that's standard for its platform and business type, that's flagged with real adoption-rate context, not a generic 'you should add live chat' suggestion.
- •Zero runtime cost: the entire corpus build happens once, gets exported to a static JSON file, and ships in the deploy. A user's scan never triggers a BigQuery query.
What we found: the platform gaps were bigger than expected
Pulling real numbers across the major platforms surfaced gaps that don't show up in marketing copy for any of these builders.
| Platform | Sites in cohort | CWV pass rate | Median CLS | JSON-LD adoption | SEO tool adoption |
|---|---|---|---|---|---|
| Shopify | 72,063 | 89.9% | 0 (passing) | 73.8% | 2.9% |
| GoDaddy Website Builder | 22,186 | 89.6% | 0 (passing) | 57.1% | 0.0% |
| Wix | 100,061 | 88.5% | 0 (passing) | 97.9% | 0.2% |
| Webflow | 19,835 | 81.2% | 0 (passing) | 38.2% | 4.9% |
| Squarespace | 89,196 | 80.9% | 0 (passing) | 99.9% | 0.2% |
| Drupal | 12,284 | 79.9% | 0 (passing) | 24.4% | 1.7% |
| Joomla | 9,985 | 68.9% | 0 (passing) | 28.3% | 0.6% |
| WordPress | 505,521 | 61.7% | 0 (passing) | 81.1% | 74.3% |
| Weebly | 19,674 | 30.8% | 0.4 (4x Google's 0.1 target) | 2.8% | 0.1% |
CWV pass rate is the share of sites meeting Google's Core Web Vitals thresholds across the CrUX-matched subset. WordPress's high SEO-tool adoption (Yoast, All in One SEO) doesn't translate to the best CWV pass rate, plugin weight has a real cost. Weebly is the outlier on every column: worst CWV pass rate, worst schema adoption, and the only platform where median CLS misses Google's target outright.
What we found: most platforms are missing the same handful of tools
Tool adoption (booking, live chat, reviews, forms) varies enormously by platform, and not in the direction you'd guess from each platform's marketing.
| Platform | Analytics | Live chat | Booking/scheduling |
|---|---|---|---|
| Weebly | 98.8% | 1.7% | 1.2% |
| Shopify | 85.5% | 29.6% | 1.4% |
| Drupal | 88.0% | 5.7% | 0.7% |
| WordPress | 78.8% | 6.7% | 2.6% |
| GoDaddy Website Builder | 38.4% | 34.9% | 1.3% |
| Squarespace | 50.0% | 2.4% | 2.5% |
| Wix | 49.6% | 2.5% | 1.5% |
Booking/scheduling adoption is low everywhere, under 3% on every major platform, which tracks with how often this shows up as a real recommendation gap on scanned sites, not a platform-specific quirk.
What we found: robots.txt presence, corrected for dead sites
A live HTTP crawl (GET-with-abort, never a HEAD request, since 10-15% of servers handle HEAD inconsistently) checked robots.txt presence across 361,450 origins in the US corpus. The naive pass rate, 78.1%, undercounted: it included origins that were simply dead, not origins that chose to skip the file.
- •A liveness-reconciliation pass re-checked every origin whose first request threw an error rather than returning any HTTP status, using a longer timeout and an http-to-https fallback for old corpus entries recorded on the wrong scheme.
- •Of 320,658 deduplicated origins, 318,848 (99.4%) turned out to be alive; only 1,810 were genuinely dead.
- •Corrected presence rate among live sites: 89.1%. That's the number that actually answers 'do real, currently-operating small business sites have a robots.txt file,' not a number diluted by sites that no longer exist.
robots.txt isn't required, search engines crawl a site fine without one, but it's still worth having if you want to keep bots out of admin pages and checkout flows, reduce server strain from aggressive crawlers on heavy pages, or point search engines straight at your sitemap.
What we'd tell another small team building something like this
A few lessons that didn't make it into the pain points above:
- •LEFT JOIN your field-data source, don't inner join it. The popular-site bias from an inner join is easy to miss until someone asks why every benchmark looks suspiciously fast.
- •Don't trust a third party's classification tags as your own classifier's input. Borrowed signals carry borrowed false positives, and they compound.
- •Dry-run every BigQuery cost before the real run. An $807GB scan and a $0.05 scan look identical in the SQL editor until you've actually checked.
- •When a crawl produces an error, don't assume it means what you think it means. 'Threw an exception' and 'doesn't have what we're looking for' are different findings, and conflating them quietly poisons your denominator.
Methodology and verification
Every number in this case study comes from real BigQuery job history and the committed corpus-benchmarks.json export.