A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
npx skills add https://github.com/kreuzberg-dev/kreuzberg --skill kreuzberg使用 CLI 安装这个技能,并在你的工作区中直接复用对应的 SKILL.md 工作流。
Extract text, metadata, and code intelligence from 97+ file formats and 305 programming languages at native speeds without needing a GPU.
ExtractionResult.code_intelligence with semantic chunkingComplete Documentation | Live Demo | Installation Guides
Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:
Scripting Languages:
JavaScript/TypeScript:
Compiled Languages:
Native:
Containers:
Command-Line:
All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.
Complete architecture coverage across all language bindings:
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| WASM | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | - |
| R | ✅ | ✅ | ✅ | ✅ |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
| Docker | ✅ | ✅ | ✅ | - |
Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.
To use embeddings functionality:
Install ONNX Runtime 1.24+:
brew install onnxruntimeUse embeddings in your code - see Embeddings Guide
Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.
91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
| Category | Formats | Capabilities |
|---|---|---|
| Word Processing | .docx, .docm, .dotx, .dotm, .dot, .odt, .pages |
Full text, tables, lists, images, metadata, styles |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods, .numbers |
Sheet data, formulas, cell metadata, charts |
| Presentations | .pptx, .pptm, .ppsx, .potx, .potm, .pot, .key |
Slides, speaker notes, images, metadata |
.pdf |
Text, tables, images, metadata, OCR support | |
| eBooks | .epub, .fb2 |
Chapters, metadata, embedded resources |
| Database | .dbf |
Table data extraction, field type support |
| Hangul | .hwp, .hwpx |
Korean document format, text extraction |
| Category | Formats | Features |
|---|---|---|
| Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif |
OCR, table detection, EXIF metadata, dimensions, color space |
| Advanced | .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm |
Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection |
| Vector | .svg |
DOM parsing, embedded text, graphics metadata |
| Category | Formats | Features |
|---|---|---|
| Markup | .html, .htm, .xhtml, .xml, .svg |
DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv |
Schema detection, nested structures, validation |
| Text & Markdown | .txt, .md, .markdown, .djot, .mdx, .rst, .org, .rtf |
CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text |
| Category | Formats | Features |
|---|---|---|
.eml, .msg |
Headers, body (HTML/plain), attachments, UTF-16 support | |
| Archives | .zip, .tar, .tgz, .gz, .7z |
Recursive extraction, nested archives, metadata |
| Category | Formats | Features |
|---|---|---|
| Citations | .bib, .ris, .nbib, .enw, .csl |
BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON |
| Scientific | .tex, .latex, .typ, .typst, .jats, .ipynb |
LaTeX, Typst, JATS journal articles, Jupyter notebooks |
| Publishing | .fb2, .docbook, .dbk, .opml |
FictionBook, DocBook XML, OPML outlines |
| Documentation | .pod, .mdoc, .troff |
Perl POD, man pages, troff |
| Feature | Description |
|---|---|
| Structure Extraction | Functions, classes, methods, structs, interfaces, enums |
| Import/Export Analysis | Module dependencies, re-exports, wildcard imports |
| Symbol Extraction | Variables, constants, type aliases, properties |
| Docstring Parsing | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| Diagnostics | Parse errors with line/column positions |
| Syntax-Aware Chunking | Split code by semantic boundaries, not arbitrary byte offsets |
Powered by tree-sitter-language-pack with dynamic grammar download. See TSLP documentation for the full language list.
Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.
Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.
Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.
Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.
Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.
Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.
Install the skill into any project using the Vercel Skills CLI:
npx skills add kreuzberg-dev/kreuzberg
The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.
Contributions are welcome! See CONTRIBUTING.md for guidelines.
Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.