Crawl4AI

Open-source web-to-markdown library optimized for LLM knowledge ingestion. 42K+ GitHub stars. The wiki’s canonical “scrape a docs site for an agent” primitive — feeds any downstream RAG pipeline (Dify, custom Postgres+PGVector setups, CAG pre-load caches, Context7-style curated indexes).

Repo: github.com/unclecode/crawl4ai
Stars: 42K+ (per Cole’s recording)
License: open-source
Output: markdown optimized for LLM understanding (not raw HTML)

The three crawl strategies

Most “scrape a docs site” tasks fall into one of three patterns. Crawl4AI handles all three:

Strategy	When to use	How it works
Sitemap-based	Site exposes `sitemap.xml`	Batch-parallel crawl, 5-20 URLs at a time
Recursive	No sitemap	Depth-limited explore from a root URL
`llms.txt` single-page	Site provides curated llms.txt	Fetch the one file

Cole Medin’s Pydantic AI + Chroma DB template auto-detects the right strategy from the URL.

Demo scale (per the source)

Crawl4AI’s own docs → 457 chunks
Pydantic docs → 2,420 chunks
LangGraph docs → 788 chunks

All processed in seconds-to-minutes with parallel batching.

Where it fits in the wiki

Crawl4AI is the upstream primitive for nearly every RAG / knowledge-base pattern in the wiki:

Crawl4AI → markdown chunks → choose your downstream
     ├── Vector store ([[supabase]] PGVector / Qdrant / Chroma) → traditional RAG
     ├── Long-context model ([[gemini]] 2.0 Flash) → [[context-augmented-generation|CAG]]
     ├── Curated MCP server → [[context7]]-style
     └── This wiki's own ingest pipeline (manual curation)

Sources

Turn ANY Website into LLM Knowledge — EVOLVED (Cole Medin, 2025-04-30)

AI For Dev

Explorer

crawl4ai

Crawl4AI

The three crawl strategies

Demo scale (per the source)

Where it fits in the wiki

Sources

See Also

Graph View

Table of Contents

Backlinks