skillby benbrastmckie
document-converter
Convert between Markdown, DOCX, and PDF formats bidirectionally. Handles text extraction from PDF/DOCX, markdown to document conversion. Use when converting document formats or extracting structured content from Word or PDF files.
Installs: 0
Used in: 1 repos
Updated: 1d ago
$
npx ai-builder add skill benbrastmckie/document-converterInstalls to .claude/skills/document-converter/
# Document Converter Skill Convert documents bidirectionally between Markdown, DOCX, and PDF formats. This skill automatically detects optimal conversion tools, handles batch processing, and ensures quality output with appropriate fallback mechanisms. ## Core Capabilities ### Conversion Modes The skill supports two conversion modes: **Default Mode (API)**: When GEMINI_API_KEY is set, PDF-to-Markdown conversions use Google Gemini API for significantly improved quality (+20-30% fidelity). Other conversions use local tools. **Offline Mode**: Use --no-api flag or set CONVERT_DOCS_OFFLINE=true to disable all API calls. All conversions use local tools only. ### Conversion Directions **TO Markdown** (text extraction from documents): - DOCX → Markdown (MarkItDown or Pandoc) - PDF → Markdown (Gemini API, PyMuPDF4LLM, or MarkItDown) **FROM Markdown** (document generation): - Markdown → DOCX (Pandoc) - Markdown → PDF (Pandoc with Typst or XeLaTeX) **Direct Conversion**: - PDF → DOCX (pdf2docx - direct conversion preserves layout) ### Features - Automatic tool detection and selection - Cascading fallback mechanisms - Batch processing support - Image extraction and embedding - Filename sanitization (spaces to underscores) - Quality validation and reporting - Concurrent conversion support ## Tool Priority Matrix The skill uses intelligent tool selection based on format and quality metrics: ### PDF → Markdown (Mode-Dependent) **Gemini Mode** (when GEMINI_API_KEY is set): 1. **Gemini API** (primary) - 95%+ fidelity - Vision-based understanding of layout - Semantic structure preservation - Code block language detection - 60 req/min, 1000 req/day free tier 2. **PyMuPDF4LLM** (fallback) - 70-75% fidelity 3. **MarkItDown** (fallback) - 65-70% fidelity **Offline Mode** (--no-api or no API key): 1. **PyMuPDF4LLM** (primary) - 70-75% fidelity - Zero configuration required - Perfect Unicode preservation - Good for simple PDFs 2. **MarkItDown** (fallback) - 65-70% fidelity - Consistent quality across document types - Easy to configure ### PDF → DOCX 1. **pdf2docx** (only option) - 80-85% fidelity - Direct conversion (no intermediate format) - Preserves images and layout better than Gemini->Pandoc - Fast processing ### DOCX → Markdown 1. **MarkItDown** (primary) - 75-80% fidelity - Perfect table preservation (pipe-style markdown) - Excellent Unicode/emoji support - Fast processing 2. **Pandoc** (fallback) - 68% fidelity - Reliable baseline conversion - Tables converted to grid format ### Markdown → DOCX 1. **Pandoc** (only option) - 95%+ quality preservation ### Markdown → PDF 1. **Pandoc + Typst** (primary) - Fast, modern PDF engine 2. **Pandoc + XeLaTeX** (fallback) - Traditional LaTeX engine ## Usage Patterns ### Basic Conversion ```bash # Source conversion core library source /home/benjamin/.config/.claude/lib/convert/convert-core.sh # Detect available tools detect_tools # Convert DOCX/PDF to Markdown main_conversion /path/to/documents /path/to/output # Convert Markdown to DOCX main_conversion /path/to/markdown /path/to/output ``` ### Batch Processing The conversion core automatically processes all files in the input directory: - Discovers all convertible files (.docx, .pdf, or .md) - Detects conversion direction automatically - Processes files concurrently (default: 4 parallel conversions) - Generates conversion.log with statistics ### Progress Streaming The conversion script emits PROGRESS markers: ``` [PROGRESS] Converting: file1.docx → file1.md [PROGRESS] Converting: file2.pdf → file2.md (2/10) [SUCCESS] Converted file1.docx → file1.md [FAILED] file3.pdf: Conversion timeout after 300s ``` ## Conversion Workflow ### Phase 1: Tool Detection - Check for MarkItDown availability (`command -v markitdown`) - Check for Pandoc availability (`command -v pandoc`) - Check for PyMuPDF4LLM availability (`python3 -c "import pymupdf4llm"`) - Check for PDF engines (Typst, XeLaTeX) - Set availability flags for tool selection ### Phase 2: File Discovery - Scan input directory for convertible files - Detect conversion direction (TO_MARKDOWN or FROM_MARKDOWN) - Validate mixed-mode errors (cannot mix directions) - Create output directory structure ### Phase 3: Conversion Execution - Process files using optimal tool based on priority matrix - Apply timeout limits (60s DOCX, 300s PDF, 120s MD→PDF) - Handle collisions (overwrite or skip existing files) - Extract/embed images appropriately - Retry with fallback tools on failure ### Phase 4: Validation - Verify output file exists - Check for broken image links - Validate document structure (headings present) - Report any quality issues ### Phase 5: Reporting - Generate conversion.log with statistics - Report success/failure counts by format - List timeout occurrences - Summarize validation issues ## Quality Considerations ### Fidelity Expectations - DOCX→Markdown: 75-80% with MarkItDown (best tables) - PDF→Markdown: Varies by PDF complexity (scan quality critical) - Markdown→DOCX: 95%+ with Pandoc (excellent preservation) - Markdown→PDF: High quality with Typst/XeLaTeX engines ### Known Limitations - **Scanned PDFs**: OCR quality depends on scan resolution - **Complex layouts**: Multi-column or nested tables may degrade - **Embedded fonts**: PDF fonts may affect text extraction accuracy - **Images**: Large images may cause timeout issues ### Best Practices - Use MarkItDown for DOCX/PDF when available (better quality) - Allow longer timeouts for large PDFs (300s default) - Review conversion.log for failure patterns - Test output files for critical conversions ## Error Handling ### Common Errors **Tool not available**: ``` Error: No conversion tool available for DOCX→Markdown Required: markitdown or pandoc ``` → Install required tools **Conversion timeout**: ``` [FAILED] large_document.pdf: Conversion timeout after 300s ``` → Increase TIMEOUT_PDF_TO_MD or use simpler PDF **Validation failure**: ``` [WARNING] output.md: No headings found (possible conversion issue) ``` → Check source document structure ### Recovery Strategies - Failed conversions automatically retry with fallback tool - Timeouts skip to next file (batch processing continues) - Validation warnings don't block workflow (reported only) ## Configuration Options Environment variables to tune conversion behavior: ```bash # Timeout multipliers (seconds) TIMEOUT_MULTIPLIER=1.5 # Increase all timeouts by 50% # Disk usage limits MAX_DISK_USAGE_GB=10 # Abort if output exceeds 10GB MIN_FREE_SPACE_MB=500 # Require 500MB free space # Concurrency MAX_CONCURRENT_CONVERSIONS=4 # Parallel conversion limit ``` ## Integration Examples ### From Claude Code Agents When working within agent contexts, the skill automatically triggers when Claude detects conversion needs: ```markdown User: "Extract text from these PDF reports" → Skill auto-invokes: document-converter → Converts PDFs to Markdown → Returns structured text ``` ### From Slash Commands The `/convert-docs` command delegates to this skill when available: ```bash /convert-docs ./documents ./output → Checks skill availability → Delegates to document-converter skill → Falls back to script mode if skill unavailable ``` ### From Other Skills Skills can compose with document-converter: ```yaml # research-specialist skill dependencies: - document-converter # Auto-loads for PDF analysis ``` ## Script Locations The skill relies on conversion scripts in the project: - **Core orchestration**: `.claude/lib/convert/convert-core.sh` - **DOCX functions**: `.claude/lib/convert/convert-docx.sh` - **PDF functions**: `.claude/lib/convert/convert-pdf.sh` - **Gemini wrapper**: `.claude/lib/convert/convert-gemini.sh` - **Gemini Python**: `.claude/lib/convert/convert_gemini.py` - **Markdown utilities**: `.claude/lib/convert/convert-markdown.sh` Scripts are symlinked in the skill's `scripts/` directory for easy access. ## Testing Test the skill with sample conversions: ```bash # Test DOCX→Markdown /convert-docs ./test/sample.docx ./output # Test PDF→Markdown /convert-docs ./test/report.pdf ./output # Test Markdown→DOCX /convert-docs ./test/document.md ./output # Batch test /convert-docs ./test/documents ./output ``` Verify: - Conversion.log generated with statistics - Output files created with correct extensions - Image directories created when needed - Quality meets expectations (check tables, formatting) ## Troubleshooting ### Skill Not Triggering If the skill doesn't auto-invoke when expected: - Check description includes trigger keywords (convert, document, PDF, DOCX, Markdown) - Test with explicit skill invocation: "Use document-converter skill" - Verify skill is in `.claude/skills/` directory (project-level) ### Tool Installation **MarkItDown** (recommended): ```bash pip install markitdown ``` **PyMuPDF4LLM** (optional): ```bash pip install pymupdf4llm ``` **pdf2docx** (optional, for PDF to DOCX): ```bash pip install pdf2docx ``` **google-genai** (optional, for Gemini API): ```bash pip install google-genai # Set API key (free tier available at https://aistudio.google.com/) export GEMINI_API_KEY="your-api-key" ``` **Pandoc** (required): ```bash # Ubuntu/Debian apt install pandoc # macOS brew install pandoc ``` **Typst** (optional, for MD→PDF): ```bash # Ubuntu/Debian wget https://github.com/typst/typst/releases/latest/download/typst-x86_64-unknown-linux-musl.tar.xz tar -xf typst-*.tar.xz && sudo mv typst-*/typst /usr/local/bin/ # macOS brew install typst ``` ### Performance Issues If conversions are slow: - Reduce MAX_CONCURRENT_CONVERSIONS (lower parallelism) - Increase timeout values for large files - Check disk I/O (slow storage may bottleneck) - Use PyMuPDF4LLM for simple PDFs (faster than MarkItDown) ## See Also - [reference.md](./reference.md) - Detailed tool documentation and metrics - [examples.md](./examples.md) - Usage examples and common patterns - [Convert-Docs Command Guide](../../docs/guides/commands/convert-docs-command-guide.md) - [MarkItDown Documentation](https://github.com/microsoft/markitdown) - [Pandoc Manual](https://pandoc.org/MANUAL.html)
Quick Install
$
npx ai-builder add skill benbrastmckie/document-converterDetails
- Type
- skill
- Author
- benbrastmckie
- Slug
- benbrastmckie/document-converter
- Created
- 5d ago