How We Built a 15,000-Entry AI Tools Directory (Tech Stack & Lessons)
The full technical story behind Skiln.co โ a 15,000+ entry AI tools directory. Next.js 15, Convex, Vercel, 11 data sources, Fuse.js search, SEO strategy, and monetization. No fluff.

How We Built a 15,000-Entry AI Tools Directory (Tech Stack & Lessons)
Matty Reid ยท Founder & Lead Engineer ยท March 26, 2026 ยท 15 min read
TL;DR โ The Stack
| Layer | Technology | Why We Chose It |
|---|---|---|
| ------- | ----------- | ---------------- |
| Frontend | Next.js 15 (App Router) | Static generation, ISR, React Server Components |
| Database | Convex | Real-time, serverless functions, TypeScript-native |
| Hosting | Vercel | Instant deploys, edge caching, ISR support |
| Search | Fuse.js | Client-side fuzzy search, zero API cost |
| Styling | Tailwind CSS | Rapid iteration, dark theme consistency |
| Data Pipeline | Python scrapers | 11 sources, deduplication, automated weekly runs |
| Blog | MDX in Next.js | SEO-first static generation |
| Store | Convex + Stripe | Digital product delivery, no third-party platform fees |
Total build time from zero to 15,000+ entries: 10 days. Total monthly hosting cost: under $20.
Table of Contents
- Why We Built This
- The Data Problem: 11 Sources, One Directory
- The Scraping Pipeline
- Database Architecture: Why Convex
- Frontend: Next.js 15 and the ISR Decision
- Search: Fuse.js Over Everything Else
- SEO Strategy: How We Rank for 2,000+ Keywords
- Monetization: Store, Sponsorships, and Affiliates
- Mistakes We Made (and You Should Avoid)
- Frequently Asked Questions
Why We Built This {#why-we-built-this}
I could not find what I was looking for. That is the honest origin story.
In early 2026, the Claude Code ecosystem was exploding. There were 60,000+ published skills, 12,000+ MCP servers, thousands of agents, commands, and hooks. But there was no single place that cataloged all of them. If you wanted to find a Terraform skill or a Postgres MCP server, you had to search across GitHub, npm, Smithery, PulseMCP, awesome-lists, and Twitter threads.
The existing directories were fragmented. Smithery focused on MCP servers only. SkillsDirectory covered skills only. The awesome-mcp-servers GitHub repo was a flat markdown list with no search, no categories, and no metadata beyond a name and URL.
I wanted one directory that had everything โ skills, MCP servers, agents, commands, hooks โ with proper categorization, search, and enough metadata to make informed decisions. And I wanted it to be fast. Not "load a huge JSON file and hope for the best" fast, but "statically generated with sub-100ms page loads" fast.
So I built it. Here is exactly how.
The Data Problem: 11 Sources, One Directory {#the-data-problem}
The first challenge was data acquisition. The AI tools ecosystem is not neatly organized. Tools are scattered across:
- awesome-mcp-servers (GitHub) โ The largest community-maintained MCP list. Markdown format with basic metadata.
- Smithery โ A curated MCP registry with structured data (descriptions, install commands, categories).
- MCP.so โ Another MCP directory with its own categorization.
- npm registry โ Many MCP servers and skills are published as npm packages.
- PyPI โ Python-based MCP servers and tools.
- LobeHub โ Agent and plugin registry with detailed metadata.
- PulseMCP โ MCP-focused directory with compatibility information.
- awesome-claude-skills (GitHub) โ Community skills list, markdown format.
- SkillsDirectory โ Dedicated skills catalog.
- OneSKILL โ Another skills catalog with different coverage.
- GitHub API search โ Direct searches for repositories tagged with
mcp-server,claude-skill,claude-code-agent, etc.
Each source has different data quality, different schemas, and different update frequencies. Some give you a name and URL. Others give you a full description, install command, GitHub stars, last updated date, and compatibility information.
The goal was to ingest everything, deduplicate aggressively, and normalize into a consistent schema that the frontend could consume.
The Scraping Pipeline {#the-scraping-pipeline}
The pipeline is Python. I considered Node.js (to keep everything in one ecosystem) but Python's scraping libraries are just better. BeautifulSoup, httpx, and the GitHub API client made the work straightforward.
The Pipeline Architecture
[11 Data Sources]
โ
[Source-specific scrapers] โ Each source has its own parser
โ
[Normalization layer] โ Maps to unified schema
โ
[Deduplication engine] โ URL matching, fuzzy name matching
โ
[Enrichment layer] โ GitHub stars, last commit, README extraction
โ
[Convex upload] โ Batch upsert to database
Source-Specific Scrapers
Each source gets its own scraper because every source has a different format:
- GitHub awesome-lists: Parse markdown, extract links and descriptions from list items
- Smithery/MCP.so/PulseMCP: HTTP requests to their APIs or scrape their HTML
- npm/PyPI: Package registry API queries with keyword filters
- GitHub API: Search repositories by topic tags, extract README content
The scrapers are deliberately simple. Each one outputs a list of dictionaries with the same keys: name, url, description, source, category, install_command, github_url, github_stars, last_updated.
Deduplication
This is where most of the complexity lives. The same tool often appears in multiple sources with slightly different names, URLs, and descriptions.
The deduplication engine uses three strategies:
- Exact URL match: If two entries point to the same GitHub repo or npm package, they are the same tool. Merge metadata, keep the richest description.
- Normalized name match: Strip prefixes like
@modelcontextprotocol/,mcp-server-,claude-skill-. If the remaining name matches, flag as potential duplicate for manual review. - Fuzzy name match (Levenshtein): Catch cases like
postgres-mcpvspostgresql-mcp-server. Threshold of 0.85 similarity triggers a manual review flag.
Automated merges handle about 70% of duplicates. The remaining 30% go through a manual review queue โ a simple CLI tool that shows me two entries side by side and lets me merge or skip.
After deduplication, 15,000+ entries remained from roughly 25,000 raw scraped entries. That is a 40% duplicate rate across the ecosystem, which tells you something about how fragmented the tooling landscape is.
Enrichment
After deduplication, the enrichment layer adds metadata that no single source provides:
- GitHub stars and forks (from GitHub API)
- Last commit date (staleness indicator)
- README first paragraph (used as description if the source description was empty)
- Package download counts (from npm/PyPI)
- License type (from GitHub API)
This enrichment runs weekly to keep data fresh. The entire pipeline โ scrape, normalize, deduplicate, enrich, upload โ takes about 45 minutes on a single machine.
Database Architecture: Why Convex {#database-architecture}
I chose Convex over Supabase, PlanetScale, and raw Postgres. Here is why.
The Requirements
- TypeScript-native schema. The frontend is Next.js with TypeScript. I wanted the database schema to live in TypeScript, not SQL, so that type safety flows from database to UI without an ORM translation layer.
- Real-time subscriptions. When I update entries from the scraping pipeline, the directory pages should update without a full rebuild. Convex subscriptions handle this natively.
- Serverless functions. I did not want to build and deploy a separate API layer. Convex functions run on their infrastructure and are called directly from the frontend.
- Full-text search. Convex has built-in search indexes. I ended up using Fuse.js for the primary search UX, but Convex search is useful for admin queries and filtering.
The Schema
// Simplified schema
defineSchema({
tools: defineTable({
name: v.string(),
slug: v.string(),
type: v.union(
v.literal("skill"),
v.literal("mcp"),
v.literal("agent"),
v.literal("command"),
v.literal("hook")
),
description: v.string(),
url: v.string(),
githubUrl: v.optional(v.string()),
githubStars: v.optional(v.number()),
installCommand: v.optional(v.string()),
category: v.string(),
tags: v.array(v.string()),
lastUpdated: v.string(),
source: v.string(),
featured: v.boolean(),
})
.index("by_type", ["type"])
.index("by_category", ["category"])
.index("by_slug", ["slug"])
.searchIndex("search_name_desc", {
searchField: "name",
filterFields: ["type", "category"],
}),
})
Five indexes cover every query pattern the frontend uses: type-filtered browsing (all skills, all MCPs), category pages, individual tool pages by slug, and search.
The Cost Reality
Convex's free tier handles the current data volume comfortably. At 15,000 entries with an average of 500 bytes per document, the total storage is about 7.5MB. Read queries are in the low thousands per day. Function invocations are well within free tier limits.
If traffic scales 10x, the Pro plan at $25/month covers it. If it scales 100x, we are looking at $100-200/month โ still far cheaper than running a separate database and API server.
Frontend: Next.js 15 and the ISR Decision {#frontend-architecture}
Why Not Server-Side Rendering
This is the single most important architectural decision we made, and it was driven by a painful lesson.
Early in development, I used force-dynamic on several routes so that pages would always show the latest data. On a site with 15,000+ pages, this meant every page request hit the Convex database. Vercel charges for serverless function invocations, and the bill climbed to $4.20 in a single day during a traffic spike.
I ripped out every force-dynamic directive and switched to static generation with ISR (Incremental Static Regeneration). Here is how it works:
- Category pages (
/skills,/mcps,/agents, etc.) are statically generated at build time and revalidate every hour - Individual tool pages (
/skills/superpowers,/mcps/github, etc.) use ISR with a 1-hour revalidation window - Blog posts are fully static โ no revalidation needed since content does not change after publish
- The home page revalidates every 30 minutes for fresh "featured" and "trending" sections
This approach keeps the Vercel bill under $20/month while serving pages in under 100ms globally.
The Route Structure
/ โ Home (featured tools, search)
/skills โ All skills (filterable, searchable)
/mcps โ All MCP servers
/agents โ All agents
/commands โ All commands
/hooks โ All hooks
/skills/[slug] โ Individual skill page
/mcps/[slug] โ Individual MCP page
/blog โ Blog index
/blog/[slug] โ Blog post
/store โ Digital products store
/store/[slug] โ Product page
Eleven route patterns generate 15,000+ pages. Every page has unique metadata, Open Graph tags, and JSON-LD schema markup. The sitemap is generated at build time and submitted to Google Search Console.
Search: Fuse.js Over Everything Else {#search-architecture}
Why Client-Side Search
For a directory with 15,000 entries, you have three search options:
- Server-side full-text search (Convex search, Algolia, ElasticSearch) โ fast, powerful, costs money per query
- API-mediated search โ server function that queries the database, adds latency and cost
- Client-side fuzzy search โ download the search index to the client, search locally, zero API cost
I chose option 3. Here is why it works at this scale.
The search index contains only three fields per entry: name, description (first 100 characters), and type. At 15,000 entries, this index is approximately 1.2MB โ well within acceptable page weight, especially when gzip-compressed to about 350KB.
Fuse.js runs the fuzzy matching entirely in the browser. Search results appear as the user types, with zero network round-trips. It handles typos, partial matches, and ranking by relevance.
The Tradeoff
Client-side search breaks at around 50,000-100,000 entries. If the directory grows beyond that, I will switch to Algolia or Convex's search index. But at current scale, Fuse.js delivers the fastest possible search UX โ zero latency, zero server cost.
SEO Strategy: How We Rank for 2,000+ Keywords {#seo-strategy}
The directory alone is not enough for SEO. Google does not rank thin directory pages well โ a page with a name, description, and install command does not satisfy search intent for most queries.
The strategy is a funnel:
Blog posts (deep, long-form content)
โ Link to directory pages (browse by category)
โ Link to individual tool pages (specific details)
โ Link to store (buy skill packs)
Blog Content Types
We publish four types of blog posts, each targeting different keyword clusters:
- "Best X for Y" listicles โ Best Claude skills for developers, best MCP servers for data engineers, best Claude skills for DevOps. These rank for high-intent search queries and link heavily to directory pages.
- "What is X" guides โ What are Claude skills, what is MCP. These rank for informational queries and establish topical authority.
- "How to" tutorials โ How to install Claude skills, how to build an MCP server. These rank for problem-solving queries and demonstrate expertise.
- Industry analysis โ State of AI agent tools 2026, ecosystem comparisons. These build authority and attract backlinks.
Every blog post includes internal links to relevant directory pages (skiln.co/skills, skiln.co/mcps) and to other blog posts. This creates a tightly interlinked content cluster that signals topical authority to Google.
Technical SEO
Every page on Skiln has:
- Unique
and - JSON-LD schema markup (Article for blog posts, SoftwareApplication for directory entries)
- Open Graph tags with unique images
- Canonical URLs
- Proper heading hierarchy (single H1, logical H2/H3 structure)
- Mobile-responsive layout (directory cards reflow to single column)
- Core Web Vitals: LCP under 1.5s, CLS under 0.05, INP under 100ms
We submit the XML sitemap to Google Search Console and use IndexNow (supported by Bing, Yandex, and integrated with Vercel) for instant indexing of new pages. When we added 2,106 URLs in a single batch, IndexNow processed them within 48 hours.
Monetization: Store, Sponsorships, and Affiliates {#monetization}
Building a directory is a vanity project unless it generates revenue. Here is how Skiln monetizes.
Digital Store
The store sells curated bundles:
- Skill packs ($29-$79): Pre-configured skill collections for specific roles (DevOps, frontend, data engineering). Each pack includes 5-10 skills, a setup guide, and CLAUDE.md templates.
- MCP starter kits ($49-$149): Complete MCP server configurations for common stacks (full-stack web, data pipeline, mobile dev). Each kit includes server configs, environment templates, and troubleshooting guides.
- Claude Code power user kits ($79-$149): Everything โ skills, MCP configs, hooks, commands, and CLAUDE.md templates โ bundled for specific workflows.
We launched with 37 products in late March 2026. The key insight: developers will pay for curation and configuration. The individual skills and MCP servers are free and open source. What people pay for is the assembly โ someone who has tested the combinations, written the configs, and documented the gotchas.
Featured Listings
Commercial tools in the directory can pay for featured placement. Featured listings appear at the top of category pages with a "Featured" badge. Pricing is simple: $50/month per category. At current traffic levels, this is a cost-effective acquisition channel for developer tool companies.
Affiliate Partnerships
Some directory entries link to platforms with affiliate programs. When a user clicks through to NowPayments, Supabase, or Vercel from a directory entry or blog post, Skiln earns a referral commission. This is disclosed in the footer.
Mistakes We Made (and You Should Avoid) {#mistakes}
Mistake 1: force-dynamic on Vercel
Already covered above, but it bears repeating. On a content site with thousands of pages, force-dynamic will drain your wallet. Use static generation with ISR. Always.
Mistake 2: Not Deduplicating Early Enough
The first version of the pipeline dumped all 25,000 raw entries into Convex. The directory had obvious duplicates โ three entries for the same GitHub MCP server, each from a different source. This looked unprofessional and confused users. Deduplication should be the first pipeline step, not an afterthought.
Mistake 3: Over-Engineering the Search
I spent two days building a Convex-powered search with server-side ranking, filters, and faceted navigation. Then I replaced it with Fuse.js in an afternoon. The client-side search was faster (zero latency vs. 100-200ms round-trip) and eliminated an entire category of server costs. For datasets under 50K entries, client-side search wins.
Mistake 4: Neglecting Category Pages
The first version had a single flat directory page with all 15,000 entries. It was overwhelming. Adding dedicated category pages (/skills, /mcps, /agents, /commands, /hooks) with type-specific filters transformed the UX. Each category page also became a distinct SEO target.
Mistake 5: Not Starting the Blog Sooner
The directory launched a week before the blog. In that week, we got almost zero organic traffic โ Google does not rank thin directory pages. The day we published the first batch of blog posts with internal links to directory pages, traffic started climbing. Content is the distribution channel. The directory is the product. They need each other.
What We Would Do Differently
If I were starting over today:
- Blog first, directory second. Write 10 pillar blog posts, build the audience, then launch the directory as the natural next step.
- Start with 5 data sources, not 11. The long tail sources added complexity but minimal unique entries. GitHub, Smithery, and npm cover 80% of the ecosystem.
- Build the store from day one. Having products ready at launch would have captured early traffic that had no conversion path.
- Use a monorepo from the start. The scraping pipeline and the Next.js app are in separate repos. They should be in one Turborepo monorepo for easier shared types and deployment coordination.
Frequently Asked Questions {#faq}
What tech stack does Skiln.co use?
Skiln.co is built on Next.js 15 with the App Router, Convex as the real-time database, Vercel for hosting and deployment, Fuse.js for client-side fuzzy search, and Tailwind CSS for styling. The data pipeline uses Python scrapers pulling from 11 different sources including GitHub repos, npm registries, and community-maintained lists.
How does Skiln collect data for 15,000+ AI tools?
Skiln pulls from 11 data sources using automated scrapers: the awesome-mcp-servers GitHub repo, Smithery's MCP registry, the MCP.so directory, npm package registry, PyPI, LobeHub, PulseMCP, the awesome-claude-skills repo, SkillsDirectory, OneSKILL, and direct GitHub API searches. Each source feeds into a deduplication pipeline that normalizes entries, detects duplicates via URL and name matching, and merges metadata.
How does Skiln handle SEO for 15,000+ pages?
Skiln uses static generation with Incremental Static Regeneration (ISR) for directory pages โ not server-side rendering. Category pages are statically generated at build time. Individual tool pages use ISR with a 1-hour revalidation window. The blog uses fully static generation. This approach keeps hosting costs under $20/month on Vercel's Pro plan while maintaining strong Core Web Vitals.
What database does Skiln use and why?
Skiln uses Convex, a real-time database with built-in TypeScript functions. The choice was driven by three factors: real-time subscriptions (directory entries update without page refreshes), integrated serverless functions (no separate API layer needed), and a generous free tier that handles the current data volume. Convex also provides full-text search, though Skiln uses Fuse.js for the primary search experience.
How does Skiln monetize the directory?
Skiln monetizes through three channels: a digital store selling Claude Code skill packs and MCP starter kits ($29-$149 price points), featured/sponsored listings in the directory for commercial tools, and affiliate partnerships. The store launched in late March 2026 with 37 products. The directory traffic is primarily organic from SEO-targeted blog posts that funnel readers into the directory and store.
