agskills.dev
MARKETPLACE

pdf

Process PDF files - extract text, create PDFs, merge documents. Use when user asks to read PDF, create PDF, or work with PDF files.

shareAI-lab16.8k3.5k

プレビュー

SKILL.md
Metadata
name
pdf
description
Process PDF files - extract text, create PDFs, merge documents. Use when user asks to read PDF, create PDF, or work with PDF files.

PDF Processing Skill

You now have expertise in PDF manipulation. Follow these workflows:

Reading PDFs

Option 1: Quick text extraction (preferred)

# Using pdftotext (poppler-utils) pdftotext input.pdf - # Output to stdout pdftotext input.pdf output.txt # Output to file # If pdftotext not available, try: python3 -c " import fitz # PyMuPDF doc = fitz.open('input.pdf') for page in doc: print(page.get_text()) "

Option 2: Page-by-page with metadata

import fitz # pip install pymupdf doc = fitz.open("input.pdf") print(f"Pages: {len(doc)}") print(f"Metadata: {doc.metadata}") for i, page in enumerate(doc): text = page.get_text() print(f"--- Page {i+1} ---") print(text)

Creating PDFs

Option 1: From Markdown (recommended)

# Using pandoc pandoc input.md -o output.pdf # With custom styling pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in

Option 2: Programmatically

from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas c = canvas.Canvas("output.pdf", pagesize=letter) c.drawString(100, 750, "Hello, PDF!") c.save()

Option 3: From HTML

# Using wkhtmltopdf wkhtmltopdf input.html output.pdf # Or with Python python3 -c " import pdfkit pdfkit.from_file('input.html', 'output.pdf') "

Merging PDFs

import fitz result = fitz.open() for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]: doc = fitz.open(pdf_path) result.insert_pdf(doc) result.save("merged.pdf")

Splitting PDFs

import fitz doc = fitz.open("input.pdf") for i in range(len(doc)): single = fitz.open() single.insert_pdf(doc, from_page=i, to_page=i) single.save(f"page_{i+1}.pdf")

Key Libraries

TaskLibraryInstall
Read/Write/MergePyMuPDFpip install pymupdf
Create from scratchReportLabpip install reportlab
HTML to PDFpdfkitpip install pdfkit + wkhtmltopdf
Text extractionpdftotextbrew install poppler / apt install poppler-utils

Best Practices

  1. Always check if tools are installed before using them
  2. Handle encoding issues - PDFs may contain various character encodings
  3. Large PDFs: Process page by page to avoid memory issues
  4. OCR for scanned PDFs: Use pytesseract if text extraction returns empty