What we did
We took the US slice of a website crawl and ran it through Stackra's own business-type classifier. One in five sites came back unclassified, and a chunk of what did get labeled was wrong. We fixed both: cut the bad signals causing the wrong labels, then ran a smaller AI cleanup pass on what was left unclassified.
Where the data came from
The HTTP Archive, a free public crawl of real websites, queried through BigQuery. We pulled the US slice: 1.25 million sites with enough real content to be worth classifying.
How we did it
Two passes. First, run the same business-type classifier Stackra uses on a live scan against each site's title, headline, and description. Second, for what that pass couldn't sort, run a small batch of those through an AI model with a fixed list of allowed categories, checking a handful of results first before running the rest. Every old label stayed on the record instead of getting overwritten, so we can always see what changed.
Tools we used and why
Nothing exotic. The HTTP Archive because it's free and real. BigQuery because it can do the filtering and the AI calls in one place. Gemini because it's cheap enough at this volume that the AI cleanup pass cost less than $5 total. Our own classifier, not a separate one, so the corpus and a live scan agree by construction.
Where this stands
This corpus isn't plugged into a live Stackra scan yet. Right now it's a clean, labeled dataset we built to validate the approach. Treat it as internal groundwork, not a shipped feature.