Crawl4AI
Open-source web-to-markdown library optimized for LLM knowledge ingestion. 42K+ GitHub stars. The wiki’s canonical “scrape a docs site for an agent” primitive — feeds any downstream RAG pipeline (Dify, custom Postgres+PGVector setups, CAG pre-load caches, Context7-style curated indexes).
- Repo: github.com/unclecode/crawl4ai
- Stars: 42K+ (per Cole’s recording)
- License: open-source
- Output: markdown optimized for LLM understanding (not raw HTML)
The three crawl strategies
Most “scrape a docs site” tasks fall into one of three patterns. Crawl4AI handles all three:
| Strategy | When to use | How it works |
|---|---|---|
| Sitemap-based | Site exposes sitemap.xml | Batch-parallel crawl, 5-20 URLs at a time |
| Recursive | No sitemap | Depth-limited explore from a root URL |
llms.txt single-page | Site provides curated llms.txt | Fetch the one file |
Cole Medin’s Pydantic AI + Chroma DB template auto-detects the right strategy from the URL.
Demo scale (per the source)
- Crawl4AI’s own docs → 457 chunks
- Pydantic docs → 2,420 chunks
- LangGraph docs → 788 chunks
All processed in seconds-to-minutes with parallel batching.
Where it fits in the wiki
Crawl4AI is the upstream primitive for nearly every RAG / knowledge-base pattern in the wiki:
Crawl4AI → markdown chunks → choose your downstream
├── Vector store ([[supabase]] PGVector / Qdrant / Chroma) → traditional RAG
├── Long-context model ([[gemini]] 2.0 Flash) → [[context-augmented-generation|CAG]]
├── Curated MCP server → [[context7]]-style
└── This wiki's own ingest pipeline (manual curation)
Sources
- Turn ANY Website into LLM Knowledge — EVOLVED (Cole Medin, 2025-04-30)
See Also
- CAG — natural downstream consumer
- Context7 — curated-corpus alternative for the same use case
- RAG vs Wiki
- Dify — no-code RAG that consumes Crawl4AI-style markdown
- Cole Medin