#web-scraping #web-crawler #llm #mcp #firecrawl

crw-extract

HTML extraction and markdown conversion engine for the CRW web scraper

35 releases (10 breaking)

Uses new Rust 2024

new 0.14.0 Jun 8, 2026
0.10.0 May 20, 2026
0.2.1 Mar 29, 2026

#1 in #web-scraping

Download history 1/week @ 2026-03-30 1/week @ 2026-04-06 13/week @ 2026-04-27 32/week @ 2026-05-04 14/week @ 2026-05-11 2/week @ 2026-05-18 4/week @ 2026-05-25 51/week @ 2026-06-01

75 downloads per month
Used in 6 crates (5 directly)

AGPL-3.0

435KB
9K SLoC

crw-extract

HTML content extraction and format conversion engine for the CRW web scraper.

crates.io docs.rs license

Overview

crw-extract converts raw HTML into clean, structured output formats for LLM consumption, RAG pipelines, and data extraction.

  • Markdown — High-fidelity HTML→Markdown via htmd (Turndown.js port): tables, code blocks, nested lists. Indented code blocks are post-processed into fenced (```) blocks for better LLM compatibility
  • Plain text — Tag-stripped, whitespace-normalized text
  • Cleaned HTML — Boilerplate removal (scripts, styles, nav, footer, ads)
  • Readability — Main-content extraction with text-density scoring and multi-selector fallback
  • CSS selector & XPath — Narrow content to specific DOM elements before conversion
  • Chunking — Split content into sentence, topic (heading-based), or regex-delimited chunks
  • BM25 & cosine filtering — Rank chunks by relevance to a query, return top-K results
  • Structured JSON — LLM-based extraction with JSON Schema validation (Anthropic tool_use + OpenAI function calling)

Installation

cargo add crw-extract

Usage

High-level extraction pipeline

The extract() function runs the full pipeline: clean → select → readability → convert → chunk → filter.

use crw_extract::{extract, ExtractOptions};
use crw_core::types::OutputFormat;

let html = r#"<html><body><article><h1>Hello</h1><p>World</p></article></body></html>"#;

let result = extract(ExtractOptions {
    raw_html: html,
    source_url: "https://example.com",
    status_code: 200,
    rendered_with: None,
    elapsed_ms: 42,
    formats: &[OutputFormat::Markdown, OutputFormat::Links],
    only_main_content: true,
    include_tags: &[],
    exclude_tags: &[],
    css_selector: None,
    xpath: None,
    chunk_strategy: None,
    query: None,
    filter_mode: None,
    top_k: None,
}).unwrap();

println!("{}", result.markdown.unwrap());
// # Hello
//
// World

HTML to Markdown

use crw_extract::markdown::html_to_markdown;

let md = html_to_markdown("<h1>Title</h1><p>Paragraph with <strong>bold</strong> text.</p>");
assert!(md.contains("# Title"));
assert!(md.contains("**bold**"));

HTML to plain text

use crw_extract::plaintext::html_to_plaintext;

let text = html_to_plaintext("<p>Hello <b>world</b></p>");
assert_eq!(text.trim(), "Hello world");

HTML cleaning

Remove boilerplate elements (scripts, styles, nav, footer, ads):

use crw_extract::clean::clean_html;

let html = r#"<html><body><nav>Menu</nav><article><p>Content</p></article><footer>Footer</footer></body></html>"#;
let cleaned = clean_html(html, true, &[], &[]).unwrap();
// nav and footer are stripped, article content is preserved

Filter by tag inclusion/exclusion:

use crw_extract::clean::clean_html;

let html = "<div><p>Keep this</p><span>Remove this</span></div>";
let result = clean_html(html, false, &["p".into()], &[]).unwrap();
assert!(result.contains("Keep this"));

CSS selector extraction

use crw_extract::selector::extract_by_css;

let html = r#"<div><article class="post"><p>Target content</p></article><aside>Sidebar</aside></div>"#;
let result = extract_by_css(html, "article.post").unwrap();
assert!(result.unwrap().contains("Target content"));

XPath extraction

use crw_extract::selector::extract_by_xpath;

let html = "<html><body><h1>Title</h1><p>Text</p></body></html>";
let result = extract_by_xpath(html, "//h1").unwrap();
assert_eq!(result.unwrap(), vec!["Title".to_string()]);

Chunking

Split content into chunks for RAG pipelines:

use crw_extract::chunking::chunk_text;
use crw_core::types::ChunkStrategy;

let text = "# Introduction\nFirst section.\n# Methods\nSecond section.";
let strategy = ChunkStrategy::Topic {
    max_chars: None,
    overlap_chars: None,
    dedupe: None,
};
let chunks = chunk_text(text, &strategy);
assert_eq!(chunks.len(), 2);

Chunk filtering

Rank chunks by relevance using BM25 or cosine similarity:

use crw_extract::filter::filter_chunks;
use crw_core::types::FilterMode;

let chunks = vec![
    "Rust is a systems programming language".to_string(),
    "The weather is sunny today".to_string(),
    "Rust provides memory safety without GC".to_string(),
];
let top = filter_chunks(&chunks, "Rust programming", &FilterMode::Bm25, 2);
assert_eq!(top.len(), 2);
// Chunks mentioning "Rust" are ranked higher

Metadata extraction

Extract title, description, Open Graph metadata, and links:

use crw_extract::readability::{extract_metadata, extract_links};

let html = r#"<html><head><title>My Page</title><meta name="description" content="A page"></head><body><a href="/about">About</a></body></html>"#;
let meta = extract_metadata(html);
assert_eq!(meta.title, Some("My Page".into()));

let links = extract_links(html, "https://example.com");
assert!(links.iter().any(|l| l.contains("/about")));

Part of CRW

This crate is part of the CRW workspace — a fast, lightweight, Firecrawl-compatible web scraper built in Rust.

Crate Description
crw-core Core types, config, and error handling
crw-renderer HTTP + CDP browser rendering engine
crw-extract HTML → markdown/plaintext extraction (this crate)
crw-crawl Async BFS crawler with robots.txt & sitemap
crw-server Firecrawl-compatible API server
crw-cli Standalone CLI (crw binary)
crw-mcp MCP stdio proxy binary

License

AGPL-3.0 — see LICENSE.

Dependencies

~23–38MB
~522K SLoC