Skip to content
Aivex
All work

Industrial supply · Lead generation platform

Generating 1.4M licensed contractor leads with an AI scraping pipeline

We built an industrial-supply client a self-improving lead-generation platform that found and verified more than 1.4 million licensed contractors across the United States.

Generating 1.4M licensed contractor leads with an AI scraping pipeline — screenshot

Outcome

1.4M+ verified licensed contractors. 400,000 added in a single week once AI agents handled source discovery and scraper writing.

The brief: a sales team selling tools to contractors needed cold-call leads worth calling — licensed contractors, real phone numbers, organized by trade. The lists they could buy from data brokers were junk. Half-disconnected numbers, dead businesses, the same record sold to fifty other shops.

We built them a different way to source leads.

What we built

A lead-generation platform that pulls licensed-contractor data directly from state and local licensing databases, normalizes it, enriches it with verified phone numbers, and lands it in a CRM their sales team could filter, search, assign, and export from.

The pipeline:

  1. An AI agent searches the internet for new licensing sources, verifies them, and tests their page structure.
  2. The agent writes Python scrapers for each source — Selenium for JavaScript-rendered portals, BeautifulSoup for the older static ones.
  3. Data gets normalized into a single schema. Every state structures licensing differently; one format comes out the other end.
  4. Phone enrichment fills in missing contact info through parallelized search.
  5. Cleaned records import into the production database.
  6. A Next.js + Convex CRM lets the team filter, search, assign, and export to CSV.
  7. Per-batch invoicing runs through Stripe.

The honest version: this was a year of grinding through edge cases.

The hard parts

Every source is its own format. Texas organizes by county. Florida by license category. New York City alone has three separate databases in different departments. North Carolina's search requires querying all 855 ZIP codes with a 2-mile radius. Some sources offer modern APIs. Others are 1990s ASP.NET forms with server-side pagination. A few have CAPTCHA. Every source needs its own scraper with its own logic — but every scraper has to output the same JSONL format so the import pipeline can handle them.

Sites that don't want you there. Anti-scraping measures meant VPN rotation, careful rate limiting, and browser automation that behaves like a real human. Hit a site too fast: blocked. Too slow: the scrape takes weeks.

The phone-number problem. Licensing data often includes business name and address but not phone numbers. For cold calling, no phone means no lead. We built a search-based enrichment layer that runs in parallel workers. Hit rates vary wildly by source, but the result is verified phone numbers attached to verified businesses. No phone, no record.

Deduplication. The same contractor often appears in multiple databases — a general contractor operating in three counties shows up as three records. The dedup pipeline matches each new import against existing records on business name, address, and phone before it lands in production. Not glamorous. The difference between clean data and garbage.

The numbers

The first 100,000 leads took about a month of building scrapers by hand. Once AI agents were layered into source discovery and scraper writing, 400,000 leads were added in a single week. The platform now holds more than 1.4 million licensed contractors, organized by trade and ready for export.

The CRM behind it is a full production application: Next.js for the frontend, Convex for the real-time backend, Clerk for auth, Stripe for invoicing, and role-based access for admin, manager, and sales roles. A small team's worth of moving parts, built and maintained by one developer with AI agents handling the heavy implementation work.

Why this kind of work matters

This is the part of AI integration we keep coming back to with clients: AI does not replace the developer or the operator. It compresses the part of the work that used to require a team. Source discovery, scraper writing, edge-case debugging, CRM scaffolding — none of those got easier. They got faster.

The same infrastructure could power a SaaS product, a vertical CRM, or a market-data feed. For the client, it was a tool that gave their sales team something brokers could not: leads they could trust.

That is the work we like. A real bottleneck. A system that holds up under production load. A result a team can actually use.

Working on something similar?

Let's see if we can help.

Request a discovery call