skillby terrylica

vm-infrastructure-ops

Troubleshoot and manage GCP e2-micro VM running eth-realtime-collector. Use when VM service is down, systemd failures occur, real-time data stream stops, or VM network issues arise. Keywords systemd, journalctl, eth-collector, gcloud compute.

Installs: 0
Used in: 1 repos
Updated: 2d ago
$npx ai-builder add skill terrylica/vm-infrastructure-ops

Installs to .claude/skills/vm-infrastructure-ops/

# VM Infrastructure Operations

**Version**: 1.0.0
**Last Updated**: 2025-11-13
**Purpose**: Troubleshoot and manage GCP e2-micro VM running eth-realtime-collector

## When to Use

Use this skill when:

- VM service down, "eth-collector" systemd failures
- Real-time data stream stopped (ClickHouse not receiving blocks)
- VM network issues, DNS resolution failures
- Need to check service status, view logs, or restart services
- Keywords: systemd, journalctl, eth-collector, gcloud compute

## Prerequisites

- GCP project access: `eonlabs-ethereum-bq`
- VM instance: `eth-realtime-collector` in zone `us-east1-b`
- gcloud CLI configured with appropriate credentials

## Workflows

### 1. Check Service Status

Check if eth-collector systemd service is running:

```bash
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
  --command='sudo systemctl status eth-collector'
```

**Expected Output** (healthy):
```
ā— eth-collector.service - Ethereum Real-Time Collector
   Loaded: loaded (/etc/systemd/system/eth-collector.service; enabled)
   Active: active (running) since ...
```

**Alternative** (use provided script):
```bash
.claude/skills/vm-infrastructure-ops/scripts/check_vm_status.sh
```

### 2. View Logs (Live Tail)

Stream real-time logs from the collector service:

```bash
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
  --command='sudo journalctl -u eth-collector -f'
```

**What to Look For**:
- "Block inserted" messages every ~12 seconds (healthy)
- gRPC errors, DNS resolution failures (unhealthy)
- "Connection refused" or "Metadata server unreachable" (network issues)

### 3. View Recent Logs (Last 100 Lines)

```bash
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
  --command='sudo journalctl -u eth-collector -n 100'
```

### 4. Restart Service

Restart the collector service after configuration changes or to recover from errors:

```bash
gcloud compute ssh eth-realtime-collector --zone=us-east1-b \
  --command='sudo systemctl restart eth-collector'
```

**Alternative** (use provided script with pre-checks):
```bash
.claude/skills/vm-infrastructure-ops/scripts/restart_collector.sh
```

**When to Use**:
- After deploying code updates
- Recovering from gRPC metadata validation errors
- After Secret Manager credential updates

### 5. VM Hard Reset

Hard reset the VM instance (use as last resort):

```bash
gcloud compute instances reset eth-realtime-collector --zone=us-east1-b
```

**When to Use**:
- VM network connectivity completely lost
- DNS resolution failures
- Metadata server unreachable
- Service restart doesn't resolve issues

**Warning**: This forcefully restarts the VM. All in-memory state is lost.

### 6. Verify Data Flow

After restarting services, verify data is flowing to ClickHouse:

```bash
cd 
doppler run --project aws-credentials --config prd -- python3 -c "
import clickhouse_connect
import os
client = clickhouse_connect.get_client(
    host=os.environ['CLICKHOUSE_HOST'],
    port=8443,
    username='default',
    password=os.environ['CLICKHOUSE_PASSWORD'],
    secure=True
)
result = client.query('SELECT MAX(timestamp), MAX(number) FROM ethereum_mainnet.blocks FINAL')
print(f'Latest block: {result.result_rows[0][1]:,} at {result.result_rows[0][0]}')
"
```

**Expected Output** (healthy):
```
Latest block: 23,800,000+ at <within last 60 seconds>
```

## Common Failure Modes

See [VM Failure Modes](/.claude/skills/vm-infrastructure-ops/references/vm-failure-modes.md) for detailed troubleshooting guide.

**Quick Reference**:

| Symptom | Likely Cause | Solution |
|---------|--------------|----------|
| Service status: `failed` | gRPC metadata error | Check logs, restart with `.strip()` fix |
| No blocks for >5 minutes | Network connectivity | Check network, reset VM if needed |
| DNS resolution errors | Metadata server unreachable | VM hard reset |
| "Connection refused" | Service not running | Restart service |

## Systemd Commands

See [Systemd Commands Reference](/.claude/skills/vm-infrastructure-ops/references/systemd-commands.md) for complete systemd operations.

**Quick Reference**:

```bash
# Status
sudo systemctl status eth-collector

# Start
sudo systemctl start eth-collector

# Stop
sudo systemctl stop eth-collector

# Restart
sudo systemctl restart eth-collector

# Enable (start on boot)
sudo systemctl enable eth-collector

# Disable (don't start on boot)
sudo systemctl disable eth-collector

# View service logs
sudo journalctl -u eth-collector

# Follow logs live
sudo journalctl -u eth-collector -f
```

## Operational History

**Infrastructure Recovery** (2025-11-10 07:00 UTC):
- VM network failure detected (DNS resolution failed, metadata server unreachable)
- Recovery: VM reset restored network connectivity
- eth-collector service restarted with `.strip()` fix (gRPC metadata validation resolved)
- Real-time data flow confirmed: blocks streaming every ~12 seconds
- Database verified: 23.8M blocks (2015-2025), latest block within seconds

**Maintainability SLO Achievement**: Critical infrastructure failure (VM network down) resolved in <30 minutes (VM reset + service restart + verification).

## Related Documentation

- [ClickHouse Migration ADR](/docs/architecture/decisions/2025-11-25-motherduck-clickhouse-migration.md) - Production database migration
- [Real-Time Collector Deployment Guide](/docs/deployment/realtime-collector.md) - VM deployment
- [Gap Monitor README](/deployment/gcp-functions/gap-monitor/README.md) - Automated gap detection
- [Data Pipeline Monitoring Skill](/.claude/skills/data-pipeline-monitoring/SKILL.md) - Cloud Run Jobs monitoring

## Scripts

- [`check_vm_status.sh`](/.claude/skills/vm-infrastructure-ops/scripts/check_vm_status.sh) - Automated status check via gcloud
- [`restart_collector.sh`](/.claude/skills/vm-infrastructure-ops/scripts/restart_collector.sh) - Safe restart with pre-checks

## References

- [`vm-failure-modes.md`](/.claude/skills/vm-infrastructure-ops/references/vm-failure-modes.md) - Common failure scenarios and solutions
- [`systemd-commands.md`](/.claude/skills/vm-infrastructure-ops/references/systemd-commands.md) - Complete systemd operations reference

Quick Install

$npx ai-builder add skill terrylica/vm-infrastructure-ops

Details

Type
skill
Author
terrylica
Slug
terrylica/vm-infrastructure-ops
Created
6d ago