Crawl4AI

Open-source web-to-markdown library optimized for LLM knowledge ingestion. 42K+ GitHub stars. The wiki’s canonical “scrape a docs site for an agent” primitive — feeds any downstream RAG pipeline (Dify, custom Postgres+PGVector setups, CAG pre-load caches, Context7-style curated indexes).

  • Repo: github.com/unclecode/crawl4ai
  • Stars: 42K+ (per Cole’s recording)
  • License: open-source
  • Output: markdown optimized for LLM understanding (not raw HTML)

The three crawl strategies

Most “scrape a docs site” tasks fall into one of three patterns. Crawl4AI handles all three:

StrategyWhen to useHow it works
Sitemap-basedSite exposes sitemap.xmlBatch-parallel crawl, 5-20 URLs at a time
RecursiveNo sitemapDepth-limited explore from a root URL
llms.txt single-pageSite provides curated llms.txtFetch the one file

Cole Medin’s Pydantic AI + Chroma DB template auto-detects the right strategy from the URL.

Demo scale (per the source)

  • Crawl4AI’s own docs → 457 chunks
  • Pydantic docs → 2,420 chunks
  • LangGraph docs → 788 chunks

All processed in seconds-to-minutes with parallel batching.

Where it fits in the wiki

Crawl4AI is the upstream primitive for nearly every RAG / knowledge-base pattern in the wiki:

Crawl4AI → markdown chunks → choose your downstream
     ├── Vector store ([[supabase]] PGVector / Qdrant / Chroma) → traditional RAG
     ├── Long-context model ([[gemini]] 2.0 Flash) → [[context-augmented-generation|CAG]]
     ├── Curated MCP server → [[context7]]-style
     └── This wiki's own ingest pipeline (manual curation)

Sources

See Also

  • CAG — natural downstream consumer
  • Context7 — curated-corpus alternative for the same use case
  • RAG vs Wiki
  • Dify — no-code RAG that consumes Crawl4AI-style markdown
  • Cole Medin