AI 爬虫核武器!Crawl4AI 横空出世,数据采集只需一行代码

新闻 · 发表于 2025-4-25 19:33

作者：微信文章
推荐一个大模型周边项目
一、项目简介

Crawl4AI 是一款专为大语言模型（LLM）和 AI 应用设计的开源网页爬虫与数据抓取工具。它不仅能高效采集网页数据，还能直接输出结构化、干净的 Markdown 内容，非常适合用于 RAG（检索增强生成）、AI 微调、知识库建设等场景。
二、核心亮点

三、安装与快速上手

pip install crawl4ai
crawl4ai-setup # 一键配置浏览器环境

如遇浏览器相关问题，可手动安装 Playwright：
python -m playwright install --with-deps chromium

import asyncio
from crawl4ai import *

async def main():
async with AsyncWebCrawler() as crawler:
      result = await crawler.arun(
         url="[https://www.nbcnews.com/business",](https://www.nbcnews.com/business",)
      )
      print(result.markdown)

if __name__ == "__main__":
asyncio.run(main())

# 基础爬取并输出 Markdown
crwl [https://www.nbcnews.com/business](https://www.nbcnews.com/business) -o markdown

# 深度爬取，BFS 策略，最多 10 页
crwl [https://docs.crawl4ai.com](https://docs.crawl4ai.com) --deep-crawl bfs --max-pages 10

# 调用 LLM 按问题抽取
crwl [https://www.example.com/products](https://www.example.com/products) -q "提取所有商品价格"
四、典型应用场景

构建 AI 知识库、FAQ、企业内网检索自动化采集新闻、论坛、商品信息支持自定义抽取策略，适配各类结构化/半结构化数据结合 LLM 做智能问答、信息抽取
五、进阶用法示例

自定义内容过滤与 Markdown 生成
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

asyncdef main():
browser_config = BrowserConfig(headless=True, verbose=True)
run_config = CrawlerRunConfig(
      cache_mode=CacheMode.ENABLED,
      markdown_generator=DefaultMarkdownGenerator(
         content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
      )
)
asyncwith AsyncWebCrawler(config=browser_config) as crawler:
      result = await crawler.arun(
         url="[https://docs.micronaut.io/4.7.6/guide/",](https://docs.micronaut.io/4.7.6/guide/",)
         config=run_config
      )
      print(result.markdown.raw_markdown)

自定义 Schema 结构化抽取
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json

asyncdef main():
schema = {
      "name": "课程信息",
      "baseSelector": "section.charge-methodology .w-tab-content > div",
      "fields": [
         {"name": "section_title", "selector": "h3.heading-50", "type": "text"},
         {"name": "course_name", "selector": ".text-block-93", "type": "text"},
         {"name": "course_icon", "selector": ".image-92", "type": "attribute", "attribute": "src"}
      ]
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(headless=False, verbose=True)
run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy, cache_mode=CacheMode.BYPASS)
asyncwith AsyncWebCrawler(config=browser_config) as crawler:
      result = await crawler.arun(
         url="[https://www.kidocode.com/degrees/technology",](https://www.kidocode.com/degrees/technology",)
         config=run_config
      )
      companies = json.loads(result.extracted_content)
      print(json.dumps(companies, indent=2))

END

未闻 Code·知识星球开放啦！

一对一答疑爬虫相关问题

职业生涯咨询

面试经验分享

每周直播分享

......

未闻 Code·知识星球期待与你相见~

账号		自动登录	找回密码
密码			注册

萍聚头条

AI 爬虫核武器!Crawl4AI 横空出世,数据采集只需一行代码

本帖子中包含更多资源

浏览过的版块