Parameter-Efficient Fine-Tuning (PEFT)
Full fine-tuning updates all parameters in a model. For a 7B parameter model, that means storing and updating 7 billion floats ā requiring massive GPU memory and producing a full copy of the model for each task. PEFT methods solve this by training only a small fraction of parameters.
Why PEFT?
| Approach | Trainable Params | GPU Memory | Storage per Task |
|---|---|---|---|
| Full Fine-Tuning (7B) | 7,000,000,000 | ~28 GB+ | ~14 GB |
| LoRA (rank 16) | ~4,000,000 | ~16 GB | ~16 MB |
| QLoRA (4-bit + LoRA) | ~4,000,000 | ~6 GB | ~16 MB |
The Core Insight of PEFT
LoRA: Low-Rank Adaptation
LoRA is the most popular PEFT method. It freezes the original model weights and injects small, trainable low-rank matrices.
How LoRA Works
For a pre-trained weight matrix W (shape d x d):
So instead of d² parameters, you learn 2 à d à r parameters.
Original: h = Wx
With LoRA: h = Wx + (A Ć B)x
h = Wx + ĪWxWhere W is frozen, only A and B are trained.
LoRA with PEFT Library
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskTypeLoad base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank: controls expressiveness vs efficiency
lora_alpha=32, # Scaling factor (often 2*r)
lora_dropout=0.05, # Dropout on LoRA layers
target_modules=[ # Which layers to apply LoRA to
"q_proj", "v_proj", # Attention query and value projections
"k_proj", "o_proj", # Attention key and output projections
],
bias="none", # Don't train bias terms
)Wrap model with LoRA
model = get_peft_model(model, lora_config)Check trainable parameters
model.print_trainable_parameters()
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
Choosing LoRA Rank
QLoRA: Quantization + LoRA
QLoRA combines 4-bit quantization of the base model with LoRA adapters, dramatically reducing memory usage:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # Nested quantization
)Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)Prepare for k-bit training
model = prepare_model_for_kbit_training(model)Apply LoRA on top
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Now a 7B model fits in ~6GB VRAM!
Prefix Tuning
Prefix tuning prepends trainable "virtual tokens" to the input at every layer. These virtual tokens steer the model's behavior without modifying its weights.
from peft import PrefixTuningConfig, get_peft_modelconfig = PrefixTuningConfig(
task_type="CAUSAL_LM",
num_virtual_tokens=20, # Number of prefix tokens
prefix_projection=True, # Use MLP to project prefix embeddings
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
Prompt Tuning
Simpler than prefix tuning ā adds trainable embeddings only at the input layer:
from peft import PromptTuningConfig, get_peft_model, PromptTuningInitconfig = PromptTuningConfig(
task_type="CAUSAL_LM",
num_virtual_tokens=20,
prompt_tuning_init=PromptTuningInit.TEXT, # Initialize from text
prompt_tuning_init_text="Classify the sentiment of this text: ",
tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
)
model = get_peft_model(base_model, config)
Comparison of PEFT Methods
| Method | Where Applied | Trainable Params | Best For |
|---|---|---|---|
| LoRA | Attention weights | ~0.1% | General fine-tuning |
| QLoRA | Attention (4-bit base) | ~0.1% | Memory-constrained |
| Prefix Tuning | All layers (virtual tokens) | ~0.1% | Generation tasks |
| Prompt Tuning | Input layer only | ~0.01% | Simple classification |
| Adapter Layers | Inserted between layers | ~1-5% | Multi-task serving |
Training with LoRA
from transformers import TrainingArguments, Trainer
from datasets import load_datasetPrepare dataset (same as regular fine-tuning)
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
def format_and_tokenize(examples):
texts = [
f"### Instruction:\n{inst}\n### Response:\n{out}"
for inst, out in zip(examples["instruction"], examples["output"])
]
return tokenizer(texts, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(format_and_tokenize, batched=True)
Training arguments (same as usual, but faster due to fewer params)
training_args = TrainingArguments(
output_dir="./lora-llama",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4, # LoRA can use higher LR
warmup_steps=50,
logging_steps=25,
save_strategy="epoch",
fp16=True,
report_to="none",
)trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
tokenizer=tokenizer,
)
trainer.train()
Saving and Loading PEFT Models
# Save only the adapter weights (small!)
model.save_pretrained("./my-lora-adapter")The saved directory contains:
- adapter_config.json (LoRA configuration)
- adapter_model.safetensors (just the LoRA weights, ~16MB)
Load adapter on top of base model
from peft import PeftModelbase_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
Merging Adapters
You can merge LoRA weights back into the base model for simplified inference:
# Merge LoRA weights into base model
merged_model = model.merge_and_unload()Now it's a regular model ā no adapter overhead during inference
merged_model.save_pretrained("./merged-model")Load as a normal model (no PEFT needed)
model = AutoModelForCausalLM.from_pretrained("./merged-model")