#hugging-face #chunking #deduplicate

bin+lib xet-data

Data processing pipeline for chunking, deduplication, and file reconstruction; used in the Hugging Face Xet client tools. Intended to be used through the API in the hf-xet package.

4 releases (stable)

Uses new Rust 2024

1.5.2 Apr 20, 2026
1.5.1 Apr 6, 2026
1.5.0 Apr 3, 2026
0.0.0 Mar 31, 2026

#21 in #chunking

Download history 244/week @ 2026-03-31 6256/week @ 2026-04-07 13348/week @ 2026-04-14 8717/week @ 2026-04-21 11923/week @ 2026-04-28 19067/week @ 2026-05-05 29843/week @ 2026-05-12 33301/week @ 2026-05-19 30555/week @ 2026-05-26 23290/week @ 2026-06-02

122,222 downloads per month
Used in 9 crates

Apache-2.0

1MB
18K SLoC

Data processing pipeline for chunking, deduplication, and file reconstruction, used in the Hugging Face Xet storage tools.

Provides content-defined chunking via gear hashing, deduplication against metadata shards, and file reconstruction from deduplicated chunk references.


xet-data

crates.io docs.rs License

Data processing pipeline for chunking, deduplication, and file reconstruction. Intended to be used through the API in the hf-xet package.

Overview

  • Content-defined chunking — Gear-hash based chunking for deduplication
  • Deduplication — Probe and register chunks against metadata shards
  • File reconstruction — Reassemble files from deduplicated chunk references
  • Progress tracking — Hooks for upload/download progress reporting

This crate is part of xet-core.

License

Apache-2.0

Dependencies

~26–41MB
~692K SLoC