skillby mikkelkrogsholm
data-science-tools
Documentation of available data science libraries (scipy, numpy, pandas, sklearn) and best practices for statistical analysis, regression modeling, and organizing analysis scripts. **CRITICAL:** All analysis scripts MUST be placed in reports/{topic}/scripts/, NOT in root scripts/ directory.
Installs: 0
Used in: 1 repos
Updated: 2d ago
$
npx ai-builder add skill mikkelkrogsholm/data-science-toolsInstalls to .claude/skills/data-science-tools/
# Data Science Tools Skill
## Purpose
This skill documents the data science ecosystem available in this project, including:
- Which Python libraries are installed and available
- How to use them for statistical analysis and regression
- **WHERE to place analysis scripts** (reports/{topic}/scripts/ - NOT root scripts/)
- Best practices for reproducible data science
## 🚨 CRITICAL: Script Organization Rule
**ALL regression, modeling, and analysis scripts MUST go in:**
```
reports/{topic}_{timestamp}/scripts/
```
**NEVER in:**
```
scripts/ ❌ (root scripts/ is only for reusable utilities)
```
See [Script Organization Best Practices](#script-organization-best-practices) section below.
## Available Libraries
### Installed in `.venv` Virtual Environment
The following data science libraries are installed and ready to use:
| Library | Version | Purpose |
|---------|---------|---------|
| **numpy** | Latest | Numerical computing, arrays, linear algebra |
| **scipy** | 1.16.3+ | Scientific computing, optimization, statistics |
| **pandas** | 2.3.3+ | Data manipulation, DataFrames, time series |
| **scikit-learn** | 1.7.2+ | Machine learning, regression, clustering |
### Activating the Virtual Environment
**All Python scripts must use the virtual environment:**
```bash
source .venv/bin/activate && python scripts/your_script.py
```
Or add shebang to scripts:
```python
#!/usr/bin/env python3
# Then run directly: ./scripts/your_script.py
```
**In Bash tool calls:**
```bash
source .venv/bin/activate && python scripts/analysis.py
```
## Common Use Cases
### 1. Regression Modeling (scipy.optimize.curve_fit)
**Purpose:** Fit non-linear models to data (S-curves, exponential, etc.)
**Example: Logistic Regression**
```python
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
# Define model
def logistic(t, L, k, t0):
"""Logistic S-curve: L / (1 + exp(-k*(t - t0)))"""
return L / (1 + np.exp(-k * (t - t0)))
# Prepare data
years = np.array([1993, 1994, ...]) # Time points
shares = np.array([0.004, 0.005, ...]) # Observed values
t = years - 1993 # Normalize time
# Fit model with bounds
p0 = [80, 0.5, 30] # Initial guess: L=80%, k=0.5, t0=30
bounds = ([50, 0.1, 20], [100, 2.0, 50]) # Parameter bounds
params, covariance = curve_fit(
logistic, t, shares,
p0=p0,
bounds=bounds,
maxfev=10000
)
L, k, t0 = params
# Validate
predictions = logistic(t, L, k, t0)
r2 = r2_score(shares, predictions)
rmse = np.sqrt(np.mean((shares - predictions)**2))
print(f"Fitted parameters: L={L:.2f}, k={k:.4f}, t0={t0:.2f}")
print(f"R² = {r2:.6f}, RMSE = {rmse:.4f}")
```
**⚠️ Important:** Always use `curve_fit` with:
- Initial guess (`p0`)
- Bounds on parameters (prevents unrealistic values)
- `maxfev` to allow sufficient iterations
### 2. Model Comparison
**Compare multiple models to find best fit:**
```python
models = {
'logistic': (logistic, [80, 0.5, 30], ([50, 0.1, 20], [100, 2.0, 50])),
'gompertz': (gompertz, [80, 0.2, 30], ([50, 0.05, 20], [100, 1.0, 50])),
}
results = {}
for name, (func, p0, bounds) in models.items():
params, _ = curve_fit(func, t, shares, p0=p0, bounds=bounds)
pred = func(t, *params)
r2 = r2_score(shares, pred)
results[name] = {'params': params, 'r2': r2}
# Find best
best_model = max(results.items(), key=lambda x: x[1]['r2'])
print(f"Best model: {best_model[0]} (R² = {best_model[1]['r2']:.6f})")
```
### 3. Data Manipulation with Pandas
**Read CSV, filter, aggregate:**
```python
import pandas as pd
# Read data
df = pd.read_csv('data/ev_annual_bil10.csv')
# Filter
recent = df[df['year'] >= 2015]
# Aggregate
yearly_avg = df.groupby('year')['ev_share_pct'].mean()
# Export
df.to_csv('data/results.csv', index=False)
```
### 4. Statistical Analysis
```python
from scipy import stats
# Correlation
corr, p_value = stats.pearsonr(x, y)
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
# T-test
t_stat, p_value = stats.ttest_ind(group1, group2)
```
## Script Organization Best Practices
### Directory Structure
```
dst_skills/
├── scripts/ # Reusable utilities ONLY
│ ├── fetch_and_store.py
│ ├── db/
│ │ └── helpers.py
│ └── utils.py
│
├── data/ # Raw data and databases
│ ├── dst.db
│ └── *.csv
│
└── reports/ # Generated reports
└── {topic}_{timestamp}/
├── report.html
├── visualizations.html
├── data/ # Report-specific intermediate data
│ └── *.csv
└── scripts/ # ⚠️ ALL analysis scripts go HERE
├── README.md
├── fit_models.py
├── validate.py
└── requirements.txt
```
**IMPORTANT:** Do NOT create analysis scripts in root `scripts/` directory.
All regression, modeling, and analysis scripts must be in the report's `scripts/` folder.
### When to Place Scripts in `reports/{topic}/scripts/` ✅ ALWAYS for Analysis
**Use this for ALL report-specific analysis:**
1. **Regression modeling** (curve_fit, forecasting, etc.)
2. **Statistical analysis** (hypothesis tests, correlations, etc.)
3. **Data transformation** specific to this report
4. **Validation** and model comparison
5. **Reproducibility** - reader can re-run your exact analysis
6. **Documentation** - shows exactly what was done
7. **Versioning** - freezes code with report at time of publication
**✅ ALL of these belong in reports/{topic}/scripts/:**
- `fit_ev_models.py` - Regression modeling
- `validate_models.py` - Model validation
- `verify_regression_models.py` - scipy verification
- `forecast_scenarios.py` - Forecasting
- `statistical_tests.py` - Hypothesis testing
**Example structure:**
```
reports/elbiler_danmark_20251031/
├── report.html
├── visualizations.html
├── data/ # Intermediate data for THIS analysis
│ ├── model_fits.csv
│ ├── forecasts.csv
│ └── residuals.csv
└── scripts/ # ✅ ALL analysis scripts here
├── README.md # Explains how to reproduce
├── fit_ev_models.py # Main regression analysis
├── validate_models.py # Cross-validation
├── verify_regression_models.py # scipy verification
└── requirements.txt # Dependencies snapshot
```
### When to Use `scripts/` (Root Level) ⚠️ ONLY for Reusable Utilities
**Root scripts/ is ONLY for infrastructure utilities that are shared across ALL reports:**
1. **Database utilities** (`db/helpers.py`, `db/validate.py`)
2. **Data fetching** (`fetch_and_store.py`)
3. **Generic helpers** (`utils.py`)
4. **NOT for analysis** - no regression, modeling, or statistics
**❌ NEVER put these in root scripts/:**
- Regression models
- Statistical analysis
- Data transformations
- Forecasting
- Model validation
**✅ Root scripts/ should ONLY contain:**
```python
# scripts/db/helpers.py - OK (reusable DB utility)
def safe_numeric_cast(column_name):
"""Helper for casting DST suppressed values."""
return f"CASE WHEN {column_name} != '..' THEN CAST({column_name} AS NUMERIC) ELSE NULL END"
# scripts/utils.py - OK (generic utility)
def format_timestamp():
"""Standard timestamp format for filenames."""
return datetime.now().strftime('%Y%m%d_%H%M%S')
# scripts/fetch_and_store.py - OK (reusable infrastructure)
def fetch_dst_table(table_id, filters):
"""Fetch data from DST API and store in DuckDB."""
# ... implementation
```
**If you're doing curve_fit, forecasting, or statistics → reports/{topic}/scripts/ ✅**
### Template: Report Analysis Script
```python
#!/usr/bin/env python3
"""
EV Adoption Model Fitting and Validation
=========================================
Report: Danmarks Elbilsudvikling 2050
Date: 2025-10-31
Author: Claude Code
Purpose:
Fit multiple regression models to EV adoption data and compare.
Usage:
cd reports/{report_name}/scripts/
source ../../../.venv/bin/activate
python fit_ev_models.py
Outputs:
- ../data/model_parameters.csv
- ../data/forecasts.csv
- stdout: Model comparison table
"""
import sys
import os
import csv
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
def main():
# 1. Load data using relative path from scripts/ directory
script_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.join(script_dir, '../../..')
# Path to project-level data
data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')
print(f"Loading data from {data_path}...")
years = []
shares = []
with open(data_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
years.append(int(row['year']))
shares.append(float(row['ev_share_pct']))
years = np.array(years)
shares = np.array(shares)
# 2. Fit models
print("\nFitting models...")
# ... implementation
# 3. Save results to report's data/ directory
output_dir = os.path.join(script_dir, '../data')
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'model_parameters.csv')
print(f"\nSaving results to {output_path}...")
# ... save implementation
if __name__ == '__main__':
main()
```
**Key points:**
- Use `os.path` for cross-platform compatibility
- Always use relative paths from script's location
- Project data: `../../../data/`
- Report data: `../data/`
- Activate venv before running
### README.md Template for Report Scripts
```markdown
# Analysis Scripts for EV Adoption Report
## Report Details
- **Topic:** Danmarks Elbilsudvikling til 2050
- **Generated:** 2025-10-31
- **Data:** BIL10, BIL52, BIL51 (Danmarks Statistik)
## Reproducibility
### Prerequisites
```bash
# From project root
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn
```
### Run Analysis
```bash
cd reports/elbiler_danmark_20251031/scripts/
python fit_ev_models.py
python validate_models.py
```
### Scripts
- `fit_ev_models.py` - Fits logistic, Gompertz, exponential models
- `validate_models.py` - Cross-validation and residual analysis
- `export_forecasts.py` - Generate 2026-2050 predictions
### Outputs
Results saved to `../data/`:
- `model_parameters.csv` - Fitted parameters (L, k, t0)
- `forecasts.csv` - Year-by-year predictions
- `validation_metrics.csv` - R², RMSE, etc.
## Model Details
See `../report.html` Section 3: Methodology
```
## Common Pitfalls and Solutions
### 1. ModuleNotFoundError
**Problem:**
```bash
ModuleNotFoundError: No module named 'scipy'
```
**Solution:**
```bash
# Always activate venv first
source .venv/bin/activate
python scripts/your_script.py
```
### 2. curve_fit Fails to Converge
**Problem:**
```
OptimizeWarning: Covariance of the parameters could not be estimated
```
**Solutions:**
- Improve initial guess `p0`
- Tighten bounds (e.g., L: [60, 90] instead of [50, 100])
- Increase `maxfev` to 20000
- Normalize/scale your data first
- Try different optimization methods
```python
# Better bounds
bounds = ([65, 0.3, 25], [95, 0.8, 40]) # Tighter
# Or use different method
from scipy.optimize import minimize, differential_evolution
```
### 3. Grid Search vs Optimization
**Bad (inefficient):**
```python
best_r2 = 0
for L in [70, 75, 80, 85, 90, 95]:
for k in np.arange(0.1, 2.0, 0.05):
# ... fit and compare
```
**Good (use scipy):**
```python
params, _ = curve_fit(logistic, t, shares, p0=[80, 0.5, 30])
```
**When grid search is acceptable:**
- Quick prototyping to find good `p0`
- Testing specific scenarios (e.g., compare L=70% vs L=90%)
- Educational purposes
### 4. Overfitting
**Warning signs:**
- R² > 0.999 on historical data
- Model fits noise, not signal
- Poor performance on holdout set
**Solutions:**
```python
# Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, shuffle=False)
# Fit on train, validate on test
params, _ = curve_fit(model, train_t, train_y)
test_pred = model(test_t, *params)
test_r2 = r2_score(test_y, test_pred)
if test_r2 < 0.9:
print("⚠️ Warning: Poor generalization")
```
## Installation and Verification
### Check Installed Packages
```bash
source .venv/bin/activate
pip list | grep -E "(numpy|scipy|pandas|scikit)"
```
Expected output:
```
numpy 1.x.x
pandas 2.3.3
scikit-learn 1.7.2
scipy 1.16.3
```
### Verify scipy.optimize Works
```bash
source .venv/bin/activate
python -c "from scipy.optimize import curve_fit; print('✓ scipy.optimize available')"
```
### Install Missing Packages
```bash
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn
```
## Integration with DST Skills Workflow
### Typical Workflow
1. **Discovery:** `/dst-discover` → Find tables
2. **Fetch:** `/dst-fetch` → Download data to `data/`
3. **Analysis:** `/dst-analyze` → SQL queries, basic calculations
4. **Modeling:** Create script in `reports/{topic}/scripts/` for regression
5. **Visualize:** `/dst-visualize` → Create charts from results
6. **Report:** `/dst-report` → Generate HTML with all findings
### Where Each Step Happens
| Step | Location | Examples |
|------|----------|----------|
| Data fetching | `data/` | dst.db, *.csv |
| SQL queries | Agent (ephemeral) | Aggregations, joins |
| Regression/modeling | `reports/{topic}/scripts/` ✅ | curve_fit, forecasting |
| Results | `reports/{topic}/data/` | model_parameters.csv |
| Report | `reports/{topic}/` | report.html |
### Example: Complete Regression Analysis
**Step 1: Create analysis script in report folder**
File: `reports/elbiler_danmark_20251031/scripts/fit_logistic_model.py`
```python
#!/usr/bin/env python3
"""
Fit logistic regression to EV adoption data.
Usage:
cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py
"""
import csv
import os
import numpy as np
from scipy.optimize import curve_fit
def main():
# Load data from project data/
script_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.join(script_dir, '../../..')
data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')
# 1. Load data
years = []
shares = []
with open(data_path, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
years.append(int(row['year']))
shares.append(float(row['ev_share_pct']))
years = np.array(years)
shares = np.array(shares)
t = years - years.min()
# 2. Define and fit model
def logistic(t, L, k, t0):
return L / (1 + np.exp(-k * (t - t0)))
params, _ = curve_fit(logistic, t, shares,
p0=[80, 0.5, 30],
bounds=([50, 0.1, 20], [100, 2.0, 50]))
L, k, t0 = params
# 3. Forecast
future_years = np.arange(years.max() + 1, 2051)
future_t = future_years - years.min()
forecast = logistic(future_t, L, k, t0)
# 4. Export to report's data/ folder
output_dir = os.path.join(script_dir, '../data')
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'forecast.csv')
with open(output_path, 'w') as f:
writer = csv.writer(f)
writer.writerow(['year', 'predicted_share'])
for year, pred in zip(future_years, forecast):
writer.writerow([year, pred])
print(f"✓ Forecast exported: {output_path}")
print(f" Model: L={L:.1f}%, k={k:.3f}, t0={t0:.1f}")
if __name__ == '__main__':
main()
```
**Step 2: Run from report's scripts/ directory**
```bash
cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py
```
**Step 3: Use results in visualization and report**
The forecast.csv is now in `reports/elbiler_danmark_20251031/data/` and can be used by `/dst-visualize` and `/dst-report`.
**✅ Benefits of this approach:**
- Script stays with report (reproducibility)
- Relative paths work from any machine
- Clear separation: data fetching vs analysis vs reporting
- Easy to version control and share
## References
### Documentation
- **scipy.optimize.curve_fit:** https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
- **sklearn metrics:** https://scikit-learn.org/stable/modules/model_evaluation.html
- **pandas:** https://pandas.pydata.org/docs/
### Regression Theory
- **Logistic growth:** Bass diffusion model, technology adoption
- **Gompertz curve:** Asymmetric S-curve for market saturation
- **Model selection:** AIC, BIC, cross-validation
### Best Practices
- **Script placement:** ALWAYS put analysis scripts in `reports/{topic}/scripts/`
- **Validation:** Use train-test split for model validation
- **Reporting:** Always report R², RMSE, and residual plots
- **Documentation:** Document assumptions and limitations in script docstrings
- **Reproducibility:** Version-control analysis scripts WITH the report they generate
- **Data paths:** Use relative paths with `os.path` for cross-platform compatibility
- **Virtual env:** Always activate `.venv` before running scipy/numpy code
### Quick Reference: Where Does It Go?
| What | Where | Example |
|------|-------|---------|
| **Regression scripts** | `reports/{topic}/scripts/` | `fit_models.py` |
| **Validation scripts** | `reports/{topic}/scripts/` | `verify_regression_models.py` |
| **Forecasting scripts** | `reports/{topic}/scripts/` | `forecast_scenarios.py` |
| **Statistical tests** | `reports/{topic}/scripts/` | `hypothesis_tests.py` |
| **Intermediate results** | `reports/{topic}/data/` | `model_parameters.csv` |
| **Raw data** | `data/` (project root) | `dst.db`, `ev_annual_bil10.csv` |
| **Reusable utilities** | `scripts/` (project root) | `db/helpers.py`, `fetch_and_store.py` |
**Simple rule:** If it uses scipy/curve_fit/statistics → `reports/{topic}/scripts/` ✅Quick Install
$
npx ai-builder add skill mikkelkrogsholm/data-science-toolsDetails
- Type
- skill
- Author
- mikkelkrogsholm
- Slug
- mikkelkrogsholm/data-science-tools
- Created
- 5d ago