skillby chrislema

add-captions

Burn in bold all-caps captions with black background boxes using whisper transcription and ffmpeg

Installs: 0
Used in: 1 repos
Updated: 0mo ago
$npx ai-builder add skill chrislema/06-add-captions

Installs to .claude/skills/06-add-captions/

## When to use

Use this skill when the user wants to add burned-in captions/subtitles to a video. Triggered by requests like "add captions," "burn in subtitles," "caption this video," or "add-captions."

## How to use

### Prerequisites
- `ffmpeg` must be installed with drawtext/libfreetype support (standard Homebrew `ffmpeg` includes this)
- `whisper-cli` must be installed with a model at `/opt/homebrew/share/whisper-cpp/models/ggml-medium.bin`
- Big Shoulders Display Bold 700 font installed at `~/Library/Fonts/BigShouldersDisplay-Bold.ttf`. The font is resolved via fontconfig by name (`font='Big Shoulders Display'`), not by file path. Always use `font=` instead of `fontfile=`.

### Parameters
The user may optionally specify:
- **input**: Path to the video file (if not provided, ask)
- **output**: Path for the output file. When called from `/process-video`, the pipeline passes `<name>_final.mp4` as the output path. When run standalone, default: strip `_mastered` from the input basename, then append `_final.mp4` — e.g., `mainvideo_mastered.mp4` → `mainvideo_final.mp4`.
- **resolution**: `-HD` (1920x1080), `-4K` (keep source), or `-portrait` (1080x1920). Default: `-HD`
- **max_words**: Maximum words per caption line (default: 6 for landscape, 3 for portrait)
- **target_width**: Target percentage of video width for text (default: 80%)
- **box_opacity**: Opacity of the black background box, 0.0-1.0 (default: 0.70, meaning 30% transparent)

### Process

#### Step 1: Get video dimensions

```python
probe = subprocess.run(["ffprobe", "-v", "error", "-select_streams", "v:0",
    "-show_entries", "stream=width,height", "-of", "csv=p=0", video],
    capture_output=True, text=True)
width, height = map(int, probe.stdout.strip().split(","))
```

#### Step 2: Transcribe with whisper

Use whisper's `-ml` (max segment length) flag to get short, accurately-timed segments directly from whisper. This is critical — do NOT split whisper segments manually and distribute time evenly, as speech is not evenly paced and captions will drift ahead of the audio.

```bash
ffmpeg -y -i <input> -ar 16000 -ac 1 -f wav /tmp/caption_whisper_$$.wav
# -ml 35 for landscape (~6 words), -ml 20 for portrait (~3 words)
whisper-cli -m /opt/homebrew/share/whisper-cpp/models/ggml-medium.bin -ml <max_chars> -f /tmp/caption_whisper_$$.wav
rm -f /tmp/caption_whisper_$$.wav
```

Where `<max_chars>` is **35** for landscape (roughly 6 words) or **20** for portrait (roughly 3 words).

Parse the `[HH:MM:SS.mmm --> HH:MM:SS.mmm] text` lines from output.

#### Step 3: Build caption events

- Each whisper segment becomes **one caption event** — do NOT split a segment into sub-chunks with fabricated timestamps. Splitting a 15s segment into three 5s chunks assumes even pacing, but speech is not evenly paced. This causes cumulative drift, and by the end of the video captions will lag behind or overshoot the audio.
- If a whisper segment has more words than the per-line max (6 landscape, 3 portrait), display all the words using **stacked lines** within that single caption event's time window. Use multiple drawtext filters at offset y-positions (see Step 5), all sharing the same `enable` time range. Never split the time.
- Convert all text to UPPERCASE
- Each segment's start/end time comes directly from whisper — trust these timestamps

#### Step 4: Determine target dimensions and calculate font size

Set target dimensions based on the resolution flag:

```python
# Determine target dimensions
if resolution == "portrait":
    target_w, target_h = 1080, 1920
elif resolution == "4K":
    target_w, target_h = width, height
else:  # HD (default)
    target_w, target_h = 1920, 1080

# Calculate font size from TARGET dimensions
if resolution == "portrait":
    # Portrait: larger relative font for narrow frame, higher position to avoid thumb zone
    font_size = int(target_w * 0.065)   # ~70px at 1080px wide
    box_padding = int(font_size * 0.10)
    y_position = int(target_h * 0.75)   # higher up — avoids mobile UI/thumb zone at bottom
else:
    font_size = int(target_w * 0.0495)  # ~95px at 1920px, ~190px at 3840px
    box_padding = int(font_size * 0.07)
    y_position = int(target_h * 0.80)
```

#### Step 5: Build drawtext filter chain

Each caption event becomes a separate `drawtext` filter with `enable='between(t,start,end)'`:

```python
def escape_drawtext(text):
    # Strip characters that cause rendering issues (tofu/box glyphs)
    text = text.replace("'", "")   # apostrophes — font renders tofu for some quote chars
    text = text.replace('"', '')
    # Escape ffmpeg drawtext special characters
    text = text.replace("\\", "\\\\")
    text = text.replace(":", "\\:")
    text = text.replace("%", "%%")
    return text
```

**Multi-line captions: use stacked drawtext filters, NEVER newlines in text.**

The Big Shoulders Display font renders the newline character (`\n`) as a visible tofu glyph (X-in-a-box) before breaking the line. Instead, split captions with >4 words into separate drawtext filters stacked vertically:

```python
line_height = font_size + (box_padding * 2) + 4  # font + box padding + spacing

for start, end, text in caption_events:
    words = text.split()
    # Split into max 4-word lines per directive
    if len(words) > 4:
        lines = [" ".join(words[i:i+4]) for i in range(0, len(words), 4)]
    else:
        lines = [text]

    # Stack lines centered around y_position
    total_h = len(lines) * line_height
    start_y = y_position - total_h // 2

    for li, line in enumerate(lines):
        escaped = escape_drawtext(line)
        line_y = start_y + li * line_height
        dt = (
            f"drawtext=font='Big Shoulders Display'"
            f":text='{escaped}'"
            f":fontcolor=white"
            f":fontsize={font_size}"
            f":box=1"
            f":boxcolor=black@0.70"
            f":boxborderw={box_padding}"
            f":x=(w-text_w)/2"
            f":y={line_y}"
            f":enable='between(t,{start:.3f},{end:.3f})'"
        )
        filters.append(dt)
```

Chain all drawtext filters with commas, write to a PID-unique temp file (e.g., `/tmp/caption_filter_{os.getpid()}.txt`) to avoid collisions with concurrent runs.

#### Step 6: Build filter chain and render with ffmpeg

If scaling is needed, prepend the appropriate `scale=` filter to the filter chain before the drawtext filters:

```python
if resolution == "portrait":
    # Portrait video is already 1080x1920 from the zoom step, but verify
    if width != 1080 or height != 1920:
        filter_chain = f"scale=1080:1920,{drawtext_filters}"
    else:
        filter_chain = drawtext_filters
elif resolution != "4K" and (width != 1920 or height != 1080):
    filter_chain = f"scale=1920:1080,{drawtext_filters}"
else:
    filter_chain = drawtext_filters
```

Write the full filter chain to the temp file, then render:

```bash
ffmpeg -y -i <input> \
  -filter_complex_script /tmp/caption_filter_$$.txt \
  -c:v libx264 -preset fast -crf 18 \
  -c:a copy \
  <output>
```

**Important**: Use `font='Big Shoulders Display'` (fontconfig name resolution), NOT `fontfile=` (which may be silently ignored and fall back to Verdana). Standard Homebrew `ffmpeg` includes drawtext with libfreetype and libfontconfig — no separate `ffmpeg-full` tap is needed.

### Caption style

- **Font**: Big Shoulders Display, weight 700 (Bold)
- **Case**: ALL CAPS
- **Color**: White text
- **Background**: Black box at 70% opacity (30% transparent), sized to fit the text
- **Position**: Centered horizontally. Landscape: 80% of video height. Portrait: 75% of video height (higher to avoid mobile thumb zone).
- **Max words per line**: 6 (landscape) / 3 (portrait)
- **Target width**: ~80% of video width
- **Font size multiplier**: Landscape: `target_width * 0.0495`. Portrait: `target_width * 0.065` (larger relative to narrow frame).

### Adjustments

**Bigger/smaller text:**
- Adjust the `width * 0.0495` multiplier. Higher = bigger text.

**More/fewer words per line:**
- Change max_words. Fewer words = bigger text per word, more frequent changes.

**Box transparency:**
- `boxcolor=black@0.70` — the number after @ is opacity (1.0 = fully opaque, 0.0 = invisible)

**Position:**
- Adjust `y_position` — `height * 0.80` puts it in the lower fifth. Use `height * 0.85` for lower, `height * 0.70` for higher.

### Important notes
- This skill requires `ffmpeg` with drawtext/libfreetype/libfontconfig support (standard Homebrew `ffmpeg` includes this)
- The font is resolved via fontconfig by name (`font='Big Shoulders Display'`), NOT by file path — `fontfile=` is silently ignored when fontconfig is enabled and falls back to Verdana
- The font file `BigShouldersDisplay-Bold.ttf` must be installed at `~/Library/Fonts/` for fontconfig to find it
- The black box automatically sizes to fit the text — it is not a fixed-width bar
- **Never use newline characters (`\n`) in drawtext text.** Big Shoulders Display renders `\n` as a visible tofu glyph (X-in-a-box). For multi-line captions, use separate stacked drawtext filters — one per line, each with the same `enable` time range but offset `y` positions.
- This skill should run after all other video processing (silence removal, zoom, color, audio mastering)

Quick Install

$npx ai-builder add skill chrislema/06-add-captions

Details

Type
skill
Author
chrislema
Slug
chrislema/06-add-captions
Created
0mo ago