Paper Notes: Parameter-Efficient Fine-Tuning

A collection of lightweight fine-tuning methods.

Rough Comparison of Lightweight Fine-Tuning Methods

  • Methods:

    Method Full Name Core Idea
    LoRA Low-Rank Adaptation Insert low-rank trainable modules into weight matrices
    Adapter Adapter Module Insert small trainable modules between layers
    Prompt Tuning Prompt / Prefix Tuning Optimize prompt vectors to guide model output
    BitFit Bias Term Fine-Tuning Only fine-tune bias terms
    QLoRA Quantized LoRA LoRA fine-tuning on quantized models to save memory
    Delta Tuning Delta Tuning Fine-tune specific modules (e.g., attention)
  • Comparison:

    Method Trainable Params Resource Needs Performance Best Use Case
    LoRA Few Medium High General fine-tuning
    Adapter Few Medium Medium-High Multi-task learning
    Prompt Tuning Very Few Very Low Medium Text generation/classification
    BitFit Very Few Very Low Low-Medium Simple/quick experiments
    QLoRA Few Low High Large models with limited resources
    Delta Tuning Few Medium Medium Fine-tune attention or specific modules

Full Fine-Tuning: Types and Effects

  • Main approaches:

    Type Description Applications
    Standard full fine-tuning Train all parameters Single-task adaptation
    Multi-stage fine-tuning General → specific task Better control & generalization
    Continual fine-tuning Adapt to new data over time Online/iterative learning
    Domain-adaptive FT Transfer pretrained models to domain data Healthcare, law, finance
    Instruction FT Fine-tune on instruction data Multi-task general models (e.g., Alpaca, ChatGPT)
  • Comparison:

    Method Resource Needs Data Needs Generalization Best Use Case
    Standard FT Very High Medium-High Medium Single-task
    Multi-stage FT High High High Multi-task transfer
    Continual FT Medium Growing Medium-High Online learning
    Domain FT Medium-High Domain data High Industry-specific
    Instruction FT Very High Diverse data High General LLMs

Survey on PEFT

  • Latest: Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
  • PEFT: Train only a small subset of parameters, freezing the rest.
  • Key considerations: computation flow in LLMs, PEFT fundamentals.
  • Four categories:
    1. Additive: add parameters or adjust activations, no change to base parameters.
    2. Selective: fine-tune a subset of base parameters (e.g., some layers, heads).
    3. Reparameterization: map parameters into low-dimensional space for training.
    4. Hybrid: combinations of the above.

PEFT taxonomy


Two Key Issues

  • Computation Flow in LLaMA:

    • Pretraining has three parts: Embedding, many Decoder blocks, Output Head.
    • Embedding maps text → vectors; Decoder uses MSA + FFN; final linear + softmax outputs token distribution.
    • Uses RoPE for positional embeddings, SiLU activation in FFN.
    • Softmax produces token probabilities: softmax(xi)=exijexj \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
  • Overview on PEFT:

    • Additive: add modules.
    • Selective: update subsets of parameters.
    • Reparameterized: low-rank updates merged after training.
    • Hybrid: combined approaches.

PEFT overview


Evaluation Tasks

  • General benchmarks: GLUE (CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI)
  • QA benchmarks: OpenBookQA, BoolQ, ARC-easy, ARC-challenge
  • Reasoning & commonsense: PIQA, SocialQA, HellaSwag, WinoGrande
  • Real-world scenarios: SharedGPT, Azure Function Trace, Gamma process

Categories of PEFT

1. Additive PEFT

  • Freeze base model, add/train small modules.
  • Examples:
    • Adapters: bottleneck layers with up/down projection.
    • Soft Prompts / Prefix Tuning: prepend trainable vectors to guide attention.
    • IA³ / SSF: scale and shift layers after MSA/FFN/Norm, minimal overhead.

2. Selective PEFT

  • Train a subset of existing parameters, using masks.
  • Examples:
    • Diff Pruning: train only difference vector $\delta$.
    • FishMask / Fish-Dip: select via Fisher information.
    • BitFit: fine-tune only bias terms.
    • Child-tuning / PaFi / SAM: selective structural pruning.

3. Reparameterized PEFT

  • Low-rank updates merged with base weights after training.
  • Representative: LoRA: h=W0hin+αrWupWdownhin h = W_0 h_{in} + \frac{\alpha}{r} W_{up} W_{down} h_{in}
  • Extensions: DyLoRA (dynamic ran