skillby ScientiaCapital
unsloth-finetuning
Fine-tune LLMs 2x faster with 80% less memory using Unsloth. Use when the user wants to fine-tune models like Llama, Mistral, Phi, or Gemma. Handles model loading, LoRA configuration, training, and model export.
Installs: 0
Used in: 1 repos
Updated: 2d ago
$
npx ai-builder add skill ScientiaCapital/unsloth-finetuningInstalls to .claude/skills/unsloth-finetuning/
# Unsloth Fine-Tuning
Expert guidance for fine-tuning Large Language Models using Unsloth's optimized library.
## Core Capabilities
- Load models with 4-bit quantization and gradient checkpointing
- Configure LoRA/QLoRA for efficient fine-tuning
- Train on custom or Hugging Face datasets
- Export models to GGUF, Ollama, vLLM, or Hugging Face formats
- Monitor training with progress tracking
- Optimize for different hardware configurations
## Quick Start
### 1. Load a Model
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-1B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
use_gradient_checkpointing="unsloth"
)
```
**Supported Models:**
- Llama 3.3 (70B), 3.2 (1B, 3B), 3.1 (8B)
- Mistral v0.3 (7B), Small Instruct
- Phi 3.5 mini, Phi 3 medium
- Gemma 2 (9B, 27B)
- Qwen 2.5 (7B)
### 2. Apply LoRA
```python
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (8-64)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16, # Scaling factor
use_gradient_checkpointing="unsloth",
random_state=3407,
max_seq_length=2048
)
```
### 3. Configure Training
```python
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
tokenizer=tokenizer,
args=SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=100,
learning_rate=2e-4,
logging_steps=1,
output_dir="./output",
optim="adamw_8bit",
seed=3407
)
)
```
### 4. Train
```python
trainer.train()
```
### 5. Export
```python
# GGUF format
model.save_pretrained_gguf(
"model",
tokenizer,
quantization_method="q4_k_m"
)
# Hugging Face format
model.save_pretrained("./hf_model")
tokenizer.save_pretrained("./hf_model")
```
## Performance Optimization
### Memory Optimization
**Out of Memory? Try:**
1. Reduce `per_device_train_batch_size` to 1
2. Increase `gradient_accumulation_steps` to 8
3. Reduce `max_seq_length` to 1024
4. Use smaller model (1B instead of 3B)
### Speed Optimization
**Training too slow? Check:**
1. GPU is being used: `nvidia-smi`
2. Batch size isn't too small
3. Using `load_in_4bit=True`
4. Using `use_gradient_checkpointing="unsloth"`
### Quality Optimization
**Poor results? Adjust:**
1. Increase `max_steps` to 500-1000
2. Try learning rates: 1e-4, 2e-4, 5e-4
3. Increase dataset quality/size
4. Use larger model if resources allow
## Hardware Requirements
### Minimum (1B models)
- GPU: RTX 3060 (12GB VRAM)
- RAM: 16GB
- Training time: 20-40 min for 100 steps
### Recommended (3B-7B models)
- GPU: RTX 4090 or A100
- RAM: 32GB+
- Training time: 10-30 min for 100 steps
### Budget (Small experiments)
- GPU: RTX 3060 Ti (8GB)
- Use: Llama-1B or Phi-3-mini
- Reduce batch_size=1, max_seq_length=1024
## Common Patterns
### Pattern 1: Quick Prototype
```python
# Minimal setup for fast experimentation
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.2-1B-bnb-4bit",
max_seq_length=1024, # Shorter for speed
load_in_4bit=True
)
model = FastLanguageModel.get_peft_model(model, r=8) # Lower rank
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
tokenizer=tokenizer,
args=SFTConfig(
per_device_train_batch_size=2,
max_steps=50, # Few steps
learning_rate=2e-4,
output_dir="./quick_test"
)
)
```
### Pattern 2: Production Quality
```python
# Full setup for best results
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.1-8B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
use_gradient_checkpointing="unsloth"
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # Standard rank
lora_alpha=16,
use_gradient_checkpointing="unsloth"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
tokenizer=tokenizer,
args=SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
max_steps=500, # More steps
learning_rate=2e-4,
warmup_steps=10,
logging_steps=10,
save_steps=100,
output_dir="./production_model"
)
)
```
### Pattern 3: Large Model (70B)
```python
# Special settings for very large models
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
use_gradient_checkpointing="unsloth"
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Fewer targets
use_gradient_checkpointing="unsloth"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
tokenizer=tokenizer,
args=SFTConfig(
per_device_train_batch_size=1, # Must be 1
gradient_accumulation_steps=8, # Compensate
max_steps=200,
learning_rate=1e-4, # Lower LR
output_dir="./large_model"
)
)
```
## Troubleshooting
### Error: "CUDA out of memory"
**Solution:**
```python
# Reduce memory usage
batch_size = 1
max_seq_length = 1024
gradient_accumulation_steps = 8
# Or use smaller model
```
### Error: "Model not found"
**Solution:**
- Check model name spelling
- Verify internet connection
- Try with Hugging Face token: `export HF_TOKEN=your_token`
### Error: "Training loss not decreasing"
**Solution:**
```python
# Adjust hyperparameters
learning_rate = 5e-4 # Try higher
max_steps = 500 # Train longer
# Or check dataset quality
```
## Best Practices
1. **Always use 4-bit quantization** unless you have >80GB VRAM
2. **Start with small models** (1B) for experimentation
3. **Monitor GPU usage** with `nvidia-smi`
4. **Save checkpoints** every 100 steps
5. **Validate on test set** before exporting
6. **Use appropriate LoRA rank**: 8 for experiments, 16 for production, 32 for complex tasks
## Dataset Format
Unsloth works with Hugging Face datasets. Example format:
```json
{
"text": "### Instruction: Explain quantum computing\n### Response: Quantum computing uses quantum bits..."
}
```
Or instruction format:
```json
{
"instruction": "Explain quantum computing",
"input": "",
"output": "Quantum computing uses quantum bits..."
}
```
## Performance Benchmarks
| Model | VRAM | Speed (vs standard) | Memory Reduction |
| ------------- | ----- | ------------------- | ---------------- |
| Llama 3.2 1B | ~2GB | 2x faster | 80% less |
| Llama 3.2 3B | ~4GB | 2x faster | 75% less |
| Llama 3.1 8B | ~6GB | 2x faster | 70% less |
| Llama 3.3 70B | ~40GB | 2x faster | 75% less |
## Additional Resources
For more advanced topics, see:
- [ADVANCED.md](ADVANCED.md) - Multi-GPU, custom optimizers, advanced LoRA
- [DATASETS.md](DATASETS.md) - Dataset preparation and formatting
- [EXPORT.md](EXPORT.md) - Detailed export options and formats
## Version Compatibility
- Python: 3.10, 3.11, 3.12 (not 3.13)
- CUDA: 11.8 or 12.1+
- PyTorch: 2.0+
- Transformers: 4.37+Quick Install
$
npx ai-builder add skill ScientiaCapital/unsloth-finetuningDetails
- Type
- skill
- Author
- ScientiaCapital
- Slug
- ScientiaCapital/unsloth-finetuning
- Created
- 6d ago