doc-scraper

Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Scrapes any section of docs.snowflake.com controlled by --base-path.

Installs: 0
Used in: 1 repos
Updated: 5d ago
$npx ai-builder add skill sfc-gh-dflippo/doc-scraper

Installs to .claude/skills/doc-scraper/

# Snowflake Documentation Scraper

Scrapes docs.snowflake.com sections to Markdown with SQLite caching (7-day expiration).

## Usage

**First time setup** (auto-installs uv and doc-scraper):

```bash
python3 .claude/skills/doc-scraper/scripts/doc_scraper.py
```

**Subsequent runs:**

```bash
doc-scraper --output-dir=./snowflake-docs
doc-scraper --output-dir=./snowflake-docs --base-path="/en/sql-reference/"
doc-scraper --output-dir=./snowflake-docs --spider-depth=2
```

## Command Options

| Option           | Default           | Description                           |
| ---------------- | ----------------- | ------------------------------------- |
| `--output-dir`   | **Required**      | Output directory for scraped docs     |
| `--base-path`    | `/en/migrations/` | URL section to scrape                 |
| `--spider-depth` | `1`               | Link depth: 0=seeds, 1=+links, 2=+2nd |
| `--limit`        | None              | Cap URLs (for testing)                |
| `--dry-run`      | -                 | Preview without writing               |

## Output

```sql
output-dir/
├── SKILL.md              # Auto-generated index
├── scraper_config.yaml   # Editable config (auto-created)
├── .cache/               # SQLite cache (auto-managed)
└── en/migrations/*.md    # Scraped pages with frontmatter
```

## Configuration

Auto-created at `{output-dir}/scraper_config.yaml`:

```yaml
rate_limiting:
  max_concurrent_threads: 4
spider:
  max_pages: 1000
  allowed_paths: ["/en/"]
scraped_pages:
  expiration_days: 7
```

## Troubleshooting

| Issue            | Solution                              |
| ---------------- | ------------------------------------- |
| Too many pages   | Lower `--spider-depth` or edit config |
| Missing pages    | Increase `--spider-depth`             |
| Cache corruption | Delete `{output-dir}/.cache/` (rare)  |

Quick Install

$npx ai-builder add skill sfc-gh-dflippo/doc-scraper

Details

Type
skill
Slug
sfc-gh-dflippo/doc-scraper
Created
1w ago