Classifying 1.2 Million US Websites by Business Type

We Classified 1.2 Million US Websites by Business Type. Here's What Broke.

We pulled 1.2 million real US websites and ran them through Stackra's business-type classifier. One in five came back unclassified, and a chunk of what did classify was wrong in ways that were easy to miss. Here's where the data came from, what broke, and how we fixed it.

Industry

Internal data engineering

Stack

BigQuery, HTTP Archive, Vertex AI / Gemini, TypeScript

Outcome

A 1.2 million-site US corpus with business types we trust. Unclassified dropped from 1 in 5 sites to about 1 in 40, for less than the cost of a coffee.

At a glance

We wanted a dataset of real US business websites labeled by type, sorted with our own logic, that we could actually trust.

1,219,994

Sites in the corpus

pulled from 1.25 million raw US sites

19.94%

Unclassified before cleanup

almost 1 in 5 sites, first pass

2.30%

Unclassified after cleanup

after a targeted pass on what was left

~$3.29

Cost of the cleanup pass

Gemini, ballpark from published pricing

~1.6M

Bad labels caught and ripped out

three separate false-positive sources, all from trusting someone else's tags

Three pain points, by site type

Tap the one that sounds like your site to read the full story.

A keyword list can't tell a dental office from a law firm

Affects: Any setup trying to sort website titles and descriptions into categories with keyword rules.

If this is you

If the labels underneath a comparison are wrong, everything built on top of them is wrong too, and you won't know it.

What it looks like

Our first pass left one in five sites unclassified, and a chunk of what it did label was wrong in specific, traceable ways: a maps-tool signal, a Person schema heuristic, and a bad assumption about one platform together mislabeled around 1.6 million rows.

What we tried first

Adding more keyword rules to patch the gaps. It works for a while, then the next edge case shows up and the bad-label list grows again instead of shrinking.

What worked

Stop trusting borrowed signals, then run a smaller, targeted AI pass on just the leftover unclassified sites instead of patching rules forever.

Result

Unclassified dropped from 19.94% to 2.30%. We kept the old label on every row instead of overwriting it, so we can check our own work.

An AI model will guess wrong rather than say it doesn't know

Affects: Any setup that hands an AI model a fixed list of categories and expects it to stay inside it.

If this is you

If you're trusting AI output to land inside a known list of options, that's worth checking before you trust it at scale.

What it looks like

On a small test batch, the model made up a category that wasn't even on the list, rather than picking the closest real option.

What we tried first

Nothing, this is the kind of mistake you only catch by checking before you commit, not after.

What worked

Run a small test batch first, every time, before paying for the full run.

Result

Caught it early, confirmed it was rare enough not to matter, and moved on without burning the whole budget finding out the hard way.

A plausible-looking signal can mislabel a million rows before you notice

Affects: Any pipeline that borrows detection signals from third-party tools, schema presence, or heuristics without independently validating them.

If this is you

If a signal looks reasonable when you add it, you tend not to question it again. At a few hundred sites that's fine. At a million it's a liability.

What it looks like

Three separate signals each seemed defensible in isolation. Together they accounted for around 1.6 million wrong labels: a maps-tool fingerprint that tagged non-local sites as local businesses, a Person schema heuristic that tagged agency sites as personal portfolios, and a platform detection rule that bled across similar CMS patterns.

What we tried first

Adding more signals to compensate for the gaps each bad one left. That just introduced more assumptions that could go wrong the same way.

What worked

Treat every borrowed signal as a hypothesis, not a fact. Spot-check a sample before it goes into the full pipeline, not after you've already paid to process a million rows.

Result

Removing all three signals and rerunning the affected rows cut the mislabel rate significantly before the AI cleanup pass even ran.

What we did

We took the US slice of a website crawl and ran it through Stackra's own business-type classifier. One in five sites came back unclassified, and a chunk of what did get labeled was wrong. We fixed both: cut the bad signals causing the wrong labels, then ran a smaller AI cleanup pass on what was left unclassified.

Where the data came from

The HTTP Archive, a free public crawl of real websites, queried through BigQuery. We pulled the US slice: 1.25 million sites with enough real content to be worth classifying.

How we did it

Two passes. First, run the same business-type classifier Stackra uses on a live scan against each site's title, headline, and description. Second, for what that pass couldn't sort, run a small batch of those through an AI model with a fixed list of allowed categories, checking a handful of results first before running the rest. Every old label stayed on the record instead of getting overwritten, so we can always see what changed.

Tools we used and why

Nothing exotic. The HTTP Archive because it's free and real. BigQuery because it can do the filtering and the AI calls in one place. Gemini because it's cheap enough at this volume that the AI cleanup pass cost less than $5 total. Our own classifier, not a separate one, so the corpus and a live scan agree by construction.

Where this stands

This corpus isn't plugged into a live Stackra scan yet. Right now it's a clean, labeled dataset we built to validate the approach. Treat it as internal groundwork, not a shipped feature.

See for yourself

Run a free Stackra audit

See how Stackra's own classifier reads your site.