The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1 B Mathematical Reasoning

Score: 21.0 | Matched keywords: benchmark, fine-tuning, in-context learning, reasoning

Abstract Snapshot

Compressed abstract

Main idea

Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities.

Method signal

In this study, we benchmark five sub-1 B models (135 M-1 B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300 M parameters, often dropping accuracy below zero-shot baselines. This "negative transfer" makes Parameter-Efficient Fine-Tuning (PEFT) not just an efficiency preference, but a stability requirement.

Contribution signal

We find that while Low-Rank Adaptation (LoRA) and Weight-Decomposed LoRA (DoRA) perform comparably, their strengths vary by task; DoRA excels in complex reasoning (GSM8 K), while LoRA dominates pattern matching (OrcaMath). In particular, Full FT is outperformed by LoRA on aligned models (Qwen2.5-0.5 B) and even by simple 5-shot In-Context Learning on the smallest architectures (SmolLM2-135 M).

Original Abstract

Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1 B models (135 M-1 B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300 M parameters, often dropping accuracy below zero-shot baselines. This "negative transfer" makes Parameter-Efficient Fine-Tuning (PEFT) not just an efficiency preference, but a stability requirement. We find that while Low-Rank Adaptation (LoRA) and Weight-Decomposed LoRA (DoRA) perform comparably, their strengths vary by task; DoRA excels in complex reasoning (GSM8 K), while LoRA dominates pattern matching (OrcaMath). In particular, Full FT is outperformed by LoRA on aligned models (Qwen2.5-0.5 B) and even by simple 5-shot In-Context Learning on the smallest architectures (SmolLM2-135 M). Based on these findings, we recommend defaulting to PEFT for all aligned sub-1 B models and caution against Full FT for any architecture smaller than 500 M parameters to prevent catastrophic forgetting. Reproduction of this work can be found at https://github.com/gulguluu/tiny-slm-finetune-compare.