PDF Processing Skill
You now have expertise in PDF manipulation. Follow these workflows:
Reading PDFs
Option 1: Quick text extraction (preferred)
# Using pdftotext (poppler-utils) pdftotext input.pdf - # Output to stdout pdftotext input.pdf output.txt # Output to file # If pdftotext not available, try: python3 -c " import fitz # PyMuPDF doc = fitz.open('input.pdf') for page in doc: print(page.get_text()) "
Option 2: Page-by-page with metadata
import fitz # pip install pymupdf doc = fitz.open("input.pdf") print(f"Pages: {len(doc)}") print(f"Metadata: {doc.metadata}") for i, page in enumerate(doc): text = page.get_text() print(f"--- Page {i+1} ---") print(text)
Creating PDFs
Option 1: From Markdown (recommended)
# Using pandoc pandoc input.md -o output.pdf # With custom styling pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in
Option 2: Programmatically
from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas c = canvas.Canvas("output.pdf", pagesize=letter) c.drawString(100, 750, "Hello, PDF!") c.save()
Option 3: From HTML
# Using wkhtmltopdf wkhtmltopdf input.html output.pdf # Or with Python python3 -c " import pdfkit pdfkit.from_file('input.html', 'output.pdf') "
Merging PDFs
import fitz result = fitz.open() for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]: doc = fitz.open(pdf_path) result.insert_pdf(doc) result.save("merged.pdf")
Splitting PDFs
import fitz doc = fitz.open("input.pdf") for i in range(len(doc)): single = fitz.open() single.insert_pdf(doc, from_page=i, to_page=i) single.save(f"page_{i+1}.pdf")
Key Libraries
| Task | Library | Install |
|---|---|---|
| Read/Write/Merge | PyMuPDF | pip install pymupdf |
| Create from scratch | ReportLab | pip install reportlab |
| HTML to PDF | pdfkit | pip install pdfkit + wkhtmltopdf |
| Text extraction | pdftotext | brew install poppler / apt install poppler-utils |
Best Practices
- Always check if tools are installed before using them
- Handle encoding issues - PDFs may contain various character encodings
- Large PDFs: Process page by page to avoid memory issues
- OCR for scanned PDFs: Use
pytesseractif text extraction returns empty