skillby cesaregarza

mechinterp-overview

Quick "first look" overview of SAE features - top tokens, activation stats, weapons, families, sample contexts

Installs: 0
Used in: 1 repos
Updated: 2d ago
$npx ai-builder add skill cesaregarza/mechinterp-overview

Installs to .claude/skills/mechinterp-overview/

# MechInterp Overview

Get a comprehensive first-look overview of an SAE feature before deep investigation. This skill provides a fast summary of key characteristics to help you decide what hypotheses to test.

## ⚠️ CRITICAL: Overview is NOT Findings

**The overview shows CORRELATIONS, not CAUSATION.** It is a starting point for generating hypotheses, NOT a source of conclusions.

| Overview Shows | What It Actually Means |
|----------------|------------------------|
| Top tokens (PageRank) | Tokens that CO-OCCUR with high activation (correlation) |
| Family breakdown | Which ability families appear in high-activation examples |
| Top weapons | Weapons present in high-activation examples |

**You CANNOT conclude from overview alone:**
- That a token "drives" or "causes" activation
- That the feature "detects" a specific ability
- That correlations are meaningful vs spurious

**To make conclusions, you MUST run experiments** (see mechinterp-investigator for deep dive basics).

## Purpose

The overview skill:
- Computes PageRank-weighted top tokens for a feature
- Shows activation statistics (mean, std, median, sparsity)
- Aggregates tokens by ability family
- Lists top weapons associated with the feature
- Provides sample high-activation contexts
- Checks for existing labels and ReLU floor issues

## When to Use

Use this skill when:
1. Starting to investigate a new feature
2. You want a quick summary before running experiments
3. Deciding which feature to label next
4. Checking if a feature has already been labeled

**DO NOT use overview results as final findings.** Always follow up with experiments.

## Output Information

| Section | Description |
|---------|-------------|
| Activation Stats | Mean, std, median, sparsity percentage, example count |
| Top Tokens | PageRank-weighted most important tokens (enhancers) |
| Bottom Tokens | Tokens suppressed in high-activation examples |
| Family Breakdown | Aggregated scores by ability family (SCU, SSU, etc.) |
| Top Weapons | Weapons with most examples for this feature |
| Sample Contexts | 3-5 high-activation example builds |
| Existing Label | Current label if one exists |
| ReLU Floor | Warning if feature is mostly zeros (>50%) |

### Sparsity Definition

**Sparsity = % of examples where feature activation is ZERO**

A high sparsity percentage means the feature fires RARELY (is selective):

| Sparsity | Meaning | Interpretation |
|----------|---------|----------------|
| 95%+ | Very sparse | Fires on only 5% of examples - very specific pattern |
| 80-95% | Moderately sparse | Good discriminative feature (fires on 5-20% of examples) |
| 50-80% | Dense | Fires often (20-50% of examples) - broad pattern |
| <50% | Very dense | Fires on majority of examples - may be baseline feature |

**Common confusion:** "89% sparsity" means "fires on 11% of examples" NOT "fires often."

Think of it as: **Sparsity = how empty/silent the feature usually is.**

**CRITICAL**: Always check the **Bottom Tokens** section! Tokens that rarely appear in high-activation examples reveal what the feature *avoids*, which is often more informative than what it detects.

## Usage

### Command Line

```bash
cd /root/dev/SplatNLP

# Basic overview (markdown output)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra

# JSON output for programmatic use
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --format json

# More top tokens
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --top-k 25

# Full model
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 5432 \
    --model full

# Verbose logging
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --verbose
```

### Extended Analyses

Additional analysis flags provide deeper insights:

```bash
# Token enrichment (enhancers/suppressors)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --enrichment

# Activation region breakdown (anti-flanderization)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --regions

# Binary ability enrichment (main-only abilities)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --binary

# Sub/special weapon breakdown (kit analysis)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --kit

# All extended analyses at once
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --all

# Customize high-activation threshold (default: 0.90 = top 10%)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --enrichment --high-percentile 0.95
```

### Extended Analysis Reference

| Flag | Purpose | Output |
|------|---------|--------|
| `--enrichment` | Token enrichment ratios | Suppressors (<0.8x) and enhancers (>1.2x) |
| `--regions` | Activation regions | Floor/Low/Core/High/Flanderization breakdown |
| `--binary` | Binary ability presence | Enrichment for main-only abilities (Comeback, Stealth Jump, etc.) |
| `--kit` | Sub/special breakdown | Which subs/specials appear in core region |
| `--all` | Enable all above | Combined output |
| `--kit-region` | Region for kit analysis | `core` (default), `high`, or `all` |
| `--high-percentile` | Threshold for "high" | Default: 0.90 (top 10%) |

### Programmatic

```python
from splatnlp.mechinterp.labeling import FeatureOverview, compute_overview
from splatnlp.mechinterp.skill_helpers import load_context

# Load context
ctx = load_context("ultra")

# Compute overview
overview = compute_overview(
    feature_id=18712,
    ctx=ctx,
    top_k_tokens=15,
    n_sample_contexts=5,
)

# Display markdown
print(overview.to_markdown())

# Access fields directly
print(f"Mean: {overview.activation_mean}")
print(f"Top token: {overview.top_tokens[0]}")
print(f"Main family: {max(overview.family_breakdown.items(), key=lambda x: x[1])}")
```

## Sample Output

```markdown
## Feature 18712 Overview (ultra)

### Activation Stats
- Mean: 0.5056
- Std: 0.5163
- Median: 0.3835
- Sparsity: 97.1%
- Examples: 108,163

### Top Tokens (PageRank)
1. `special_charge_up` (0.274)
2. `swim_speed_up` (0.099)
3. `ink_saver_sub` (0.084)
4. `stealth_jump` (0.049)
5. `run_speed_up` (0.048)

### Family Breakdown
- special_charge_up: 31.2%
- swim_speed_up: 11.2%
- ink_saver_sub: 9.6%

### Top Weapons
- weapon_id_5021: 28
- weapon_id_220: 28

### Bottom Tokens (Suppressors)
Tokens rarely present in high-activation examples:
1. `respawn_punisher` (high_rate_ratio=0.00) - Never in high activation
2. `special_saver` (high_rate_ratio=0.16) - 6x less common than baseline
3. `quick_respawn` (high_rate_ratio=0.47) - 2x less common than baseline

### Sample Contexts (High Activation)
1. [weapon_id_1111] special_charge_up_6, special_charge_up_57 (act=0.731)
2. [weapon_id_1111] special_charge_up_6, special_charge_up_51... (act=0.724)
```

## FeatureOverview Dataclass

```python
@dataclass
class FeatureOverview:
    feature_id: int
    model_type: str

    # Activation statistics
    activation_mean: float
    activation_std: float
    activation_median: float
    sparsity: float  # Percentage (0-100)
    n_examples: int

    # PageRank-weighted top tokens
    top_tokens: list[tuple[str, float]]

    # Bottom tokens (suppressors) - tokens excluded from high activation
    bottom_tokens: list[tuple[str, float]]  # (token, high_rate_ratio)

    # Detailed token influence statistics
    token_influences: list[TokenInfluence]

    # Aggregated by family
    family_breakdown: dict[str, float]

    # Weapon breakdown
    top_weapons: list[tuple[str, int]]

    # Sample high-activation contexts
    sample_contexts: list[SampleContext]

    # Diagnostic flags
    relu_floor_rate: float
    existing_label: str | None
```

## Performance

- Typical runtime: 30-60 seconds (dominated by PageRank computation)
- Loads activation data lazily from efficient database
- Caches context between calls in the same session

## Interpretation Tips

1. **High sparsity (>90%)**: Most inputs don't activate this feature. Look at what's special about the ones that do.

2. **ReLU floor warning**: If >50% of examples hit the ReLU floor, the feature may be hard to interpret or require special handling.

3. **Single dominant family**: If one family has >50% of the breakdown, the feature likely responds to that ability family.

4. **Multiple families**: If breakdown is spread across families, look for interactions or common contexts.

5. **Weapon concentration**: If a few weapons dominate, the feature may be weapon-specific rather than ability-specific.

## ⚠️ CRITICAL: Super-Stimuli Detection

**Don't only examine high activations - they may be "super-stimuli"!**

High activation examples can be exaggerated, "flanderized" versions of the true concept. The core region (25-75% of **effective max**) often reveals the actual feature meaning better than the flanderization zone (90%+ of effective max).

**Why "effective max"?** Activation distributions are heavy-tailed. Use `effective_max = 99.5th percentile of nonzero activations` to prevent single outliers from making your core region nearly empty.

### Warning Signs of Super-Stimuli

| Pattern | What It Means |
|---------|---------------|
| 90%+ activations only on 3-4 niche weapons | Flanderization zone = super-stimuli |
| Core region (25-75%) has diverse mainstream weapons | TRUE concept is in core region |
| One weapon spans ALL activation levels continuously | Feature is general, not weapon-specific |

### Activation Region Bins

Use these standard bins (as % of **effective max** = 99.5th percentile) to analyze feature behavior:

| Region | Range (% of effective max) | Typical Interpretation |
|--------|----------------------------|------------------------|
| Floor | ≤1% | Feature not activated |
| Low | 1-10% | Weak signal, early detection |
| Below Core | 10-25% | Emerging pattern |
| Core | 25-75% | **TRUE CONCEPT** (examine carefully!) |
| High | 75-90% | Strong expression |
| Flanderization Zone | 90%+ | Potential super-stimuli |

### Example: Feature 9971

**Initial analysis (looking only at 90%+ activations):**
- Top weapons: Bloblobber, Glooga Deco, Range Blaster, Octobrush
- Conclusion: "SCU stacker on special-dependent weapons"

**After region analysis (examining core 25-75%):**
- Core region: Splattershot (115), Wellstring (65), Sploosh (57)
- Splattershot appears in EVERY region (29→125→83→115→61→19)
- True concept: "General offensive investment (death-averse)"
- Flanderization zone (90%+): "Super-stimuli" version on niche special-dependent weapons

**Key insight**: Label the core-region concept, not the flanderized extreme!

### Coverage Threshold Rule

**When overview shows a dominant token or weapon, CHECK CORE-REGION COVERAGE before treating it as the concept.**

A token can have high enrichment in the tail but be a **tail marker**, not the true concept.

| Metric | Interpretation |
|--------|----------------|
| >50% core coverage | **Primary concept** - safe to use in label |
| 30-50% core coverage | **Significant but not universal** - note in label, don't headline |
| <30% core coverage | **Tail marker / super-stimulus** - NOT the concept |

**Example (Feature 13934):**
```
Overview showed: respawn_punisher with 8.57x tail enrichment
BUT: RP only present in 12% of core-region examples

⚠️ Flag in overview: "respawn_punisher: high enrichment (8.57x) but <30% core coverage - may be tail marker, not core concept"
```

**When to flag:** If any token in top-10 has enrichment >3x but core coverage <30%, add a warning note.

6. **Weapon Outlier Detection**: If a single weapon has >2x the examples of the second weapon, this is a **weapon-dominated feature**:
   - Use **splatoon3-meta** skill to look up the weapon's kit (sub + special)
   - Check if other high-activation weapons share the same sub OR special
   - If they share kit components, the feature may encode kit behavior, not weapon behavior
   - Run **kit_sweep** experiment to analyze activation by sub/special

7. **Check suppressors**: Always examine bottom tokens! If death-mitigation abilities (QR, SS, CB) are suppressed, the feature encodes "death-averse" builds. See **mechinterp-ability-semantics** for semantic groupings.

8. **Enhancers + Suppressors together**: The combination tells the full story. A feature with SCU enhanced AND death-perks suppressed isn't just "SCU detector" - it's "death-averse special builds".

9. **"Weak activation" ≠ "unimportant feature"**: If all scaling effects are weak (max_delta < 0.03), don't immediately label as "weak feature". Check the feature's **decoder weights** to output tokens. Net influence = activation × decoder weight. A feature with low activation effects but high decoder weights may still strongly influence predictions.

## ⚠️ WARNING: Correlation ≠ Causation

**PageRank scores show correlation, NOT causation.** Tokens appearing in the overview may be:
- **True drivers**: Actually cause activation changes
- **Spurious correlations**: Just happen to co-occur with the true driver

### How to Distinguish

1. **Run 1D sweep** for top token (likely primary driver)
2. **If confirmed**, run **2D heatmaps** for other tokens:
   - `PRIMARY × SECONDARY` reveals if secondary has conditional effect
   - If secondary shows effect only at high primary → true interaction
   - If secondary shows NO effect at any primary level → spurious

### Example: Feature 18712

```
Overview showed: SCU (24%), Opening Gambit (17%), SSU (12%)

1D sweeps:
- SCU: strong effect (0.03→0.58) ✅ PRIMARY
- OG: delta ≈ 0 → appears to have no effect
- SSU: delta ≈ 0 → appears to have no effect

BUT WAIT! 1D sweeps for secondary abilities are MISLEADING.

2D heatmaps (SCU × OG, SCU × SSU):
- Both show NO conditional effect at any SCU level
- Conclusion: OG and SSU were SPURIOUS correlations

2D heatmaps (SCU × QR, SCU × SS):
- QR_12+ SUPPRESSES activation by 70-99% at high SCU!
- SS_12+ SUPPRESSES activation by 40-60%!
- Conclusion: Feature is DEATH-AVERSE (not visible in 1D)
```

**Always verify top overview tokens with conditional 2D testing!**

See **mechinterp-investigator** for the full Iterative Conditional Testing Protocol.

## See Also

- **mechinterp-labeler**: Manage labeling workflow and save labels
- **mechinterp-runner**: Run experiments to test hypotheses
- **mechinterp-next-step-planner**: Generate experiment specs based on overview
- **mechinterp-glossary-and-constraints**: Reference for token families and AP rungs
- **mechinterp-ability-semantics**: Ability semantic groupings (check AFTER forming hypotheses)
- **mechinterp-investigator**: Full investigation workflow

Quick Install

$npx ai-builder add skill cesaregarza/mechinterp-overview

Details

Type
skill
Slug
cesaregarza/mechinterp-overview
Created
6d ago