Model Training

On Convergence Dynamics in Fine-tuning Strategies for Large Language Models

Abstract

We present a comprehensive empirical study comparing different fine-tuning approaches for decoder-only, autoregressive Transformer language models. Our investigation encompasses four distinct training regimes: full supervised fine-tuning (SFT), checkpoint continuation strategies, aggressive learning rate configurations, and optimized Low-Rank Adaptation (LoRA) methods. Through systematic analysis of loss convergence patterns, learning dynamics, and downstream task performance, we demonstrate that properly configured LoRA can match (or even exceed) the performance of full fine-tuning while maintaining significant computational efficiency advantages.

Keywords: Large Language Models, Fine-tuning, LoRA, Parameter Efficiency, Training Dynamics

1. Introduction

The fine-tuning of large language models has become essential for domain specialization, enabling adaptation of general-purpose models to specific tasks. We use fine-tuning for domain specialization of LLM models to achieve targeted performance improvements in systems engineering tasks. As model sizes grow, computational requirements for full parameter fine-tuning have driven interest in parameter-efficient alternatives.

Supervised Fine-Tuning (SFT) updates all model parameters but requires substantial resources. Parameter-efficient methods like Low-Rank Adaptation (LoRA) can achieve comparable performance with reduced computational requirements.

1.1 Training Methodologies

Full Supervised Fine-tuning (SFT) involves updating all model parameters during training, typically employing conservative learning rates with appropriate regularization to prevent overfitting. The method allows for maximum model plasticity but requires substantial computational resources.

Checkpoint Continuation strategies involve resuming training from previously saved checkpoints, often with modified hyperparameters to achieve further performance gains. This approach requires careful learning rate scheduling and hyperparamter adjustments to avoid overfitting on previously seen data patterns. We also show how training loss and validation loss can contradict post checkpoint continuation.

Aggressive Learning Rate SFT employs higher learning rates combined with extended sequence lengths to accelerate convergence. This approach requires careful balance between convergence speed and training stability.

Parameter-Efficient LoRA methods decompose weight updates into low-rank matrices, significantly reducing the number of trainable parameters while maintaining model expressiveness. Success depends critically on proper hyperparameter tuning, including learning rate scheduling, regularization, and architectural choices.

2. Methodology

2.1 Experimental Setup

We conducted experiments using a decoder-only, autoregressive Transformer language model across four distinct training regimes:

  1. Full SFT (Initial): Conservative learning rate with extended epochs
  2. Full SFT (Continuation): Checkpoint continuation beyond initial convergence
  3. Full SFT (Aggressive): Higher learning rate with extended sequence processing and low decay.
  4. Optimal LoRA: Parameter-efficient adaptation with optimized hyperparameters and trainable parameters during finetuning.

2.2 Training Configuration

All experiments employed consistent architectural foundations while varying key hyperparameters:

  • Precision: BF16 mixed precision training was common all along.
  • Optimization: AdamW optimizer with varying initial rate with cosine scheduling
  • Regularization: Gradient clipping (max norm = 1.0), weight decay and warmups.
  • Evaluation: Regular validation checkpoints with comprehensive metrics logging.

2.3 Hyperparameter Variations

Learning Rates: Ranged from 1e-05 (conservative) to 1e-04 (LoRA-specific) Sequence Lengths: 1024-2048 tokens depending on regime Batch Configurations: Effective batch sizes from 1-16 through gradient accumulationRegularization: Weight decay values from 0.0-0.1 based on training strategy

3. Results and Analysis

3.1 Loss Convergence Patterns

Figure 1: Training and validation loss trajectories across all four training regimes. The vertical line at step 1000 indicates the continuation point for checkpoint-based training.

The loss convergence analysis reveals distinct patterns across training regimes:

Full SFT (Initial) Training exhibits stable convergence with consistent validation performance throughout.

Full SFT (Continuation) Continuing from checkpoints does lessen loss with careful learning rate calibration, but spikes up the evaluation loss. Therefore, it is wise to implement early-stopping here.

Full SFT (Aggressive) maintains unstable convergence over the longest duration (1950 steps) with 61.23% loss reduction. The extended sequence length (2048 tokens) and aggressive learning rate enable improvement with one destabilization interval, but followed by recovery in the training loss. The convergence patterns are not smooth here.

Optimal LoRA achieves competitive performance with 95.21% loss reduction (1.24 to 0.059) using only a fraction of trainable parameters. The higher learning rate (1e-04) enables efficient convergence within 688 steps.

3.2 Learning Rate Dynamics

Learning Rate Comparison

Figure 2: Learning rate schedules across training regimes - Full SFT (Initial), Full SFT (Continuation), Full SFT (Aggressive), and Optimal LoRA - highlighting the significantly higher rates used for LoRA training and the conservative approach for continuation training.

The learning rate analysis reveals critical insights:

  • LoRA methods successfully employ learning rates 10x or more than full fine-tuning (1e-04 vs 1e-05 to 2.5e-05)
  • Cosine scheduling proves universally effective across all regimes
  • Continuation training benefits from conservative learning rates (1e-05) to maintain stability
  • Warmup strategies (5-10% of total steps) enhance training stability for most configurations

3.3 Training Efficiency Analysis

Efficiency Analysis

Figure 3: Cumulative loss reduction over training steps across all training regimes.

Training efficiency metrics reveal significant differences:

  1. Full SFT (Initial): 0.006954 loss reduction per step (highest efficiency)
  2. Optimal LoRA: 0.001715 loss reduction per step (excellent parameter efficiency)
  3. Full SFT (Continuation): 0.000227 loss reduction per step (continuation efficiency)
  4. Full SFT (Aggressive): 0.000031 loss reduction per step (extended stable training)

3.4 Training Stability and Validation Dynamics

The validation loss patterns provide crucial insights into training stability:

  • Initial SFT maintains consistent validation performance throughout training
  • Continuation training shows no validation loss spikes, indicating successful regularization
  • Aggressive SFT demonstrates sustained stability over extended training periods
  • LoRA training exhibits stable convergence despite higher learning rates

Notably, the continuation training regime does not exhibit the expected validation loss deterioration that might occur when resuming training on previously fitted data, suggesting effective hyperparameter adjustment and regularization strategies.

4. Downstream Task Performance

4.1 Evaluation Methodology

Models were evaluated on downstream systems engineering tasks measuring various aspects of domain understanding and generation capabilities. Performance was assessed using task-specific metrics, with results normalized for comparative analysis across training regimes.

4.2 Comparative Performance Results

Summary Statistics

Figure 4: Low-level Systems C code generation task performance comparison showing how fine-tuning enhances the base model's capabilities to compete with SOTA reasoning models including Gemini 2.5 Flash and Gemini 2.5 Pro (Thinking). Performance scores demonstrate the effectiveness of different training regimes, with our enhanced model checkpoints competing frontier reasoning models on select metrics.

Key Findings:

Human Expert Baseline: Expert-corrected code achieves the highest performance score of 1.54, serving as the gold standard for comparison.

SOTA Reasoning Model Performance: Gemini 2.5 Pro (Thinking) with tools achieves 1.31, while Gemini 2.5 Flash with tools reaches 1.25, representing current state-of-the-art reasoning capabilities.

Best SFT Performance: Full SFT with 1000 steps (Continuation regime) achieved a performance score of 0.66, demonstrating significant enhancement from the base model (0.00) and beating reasoning models on systems code generation metrics.

Optimal LoRA Performance: The optimized LoRA configuration (688 steps, 8 epochs) achieved a performance score of 0.53, showing remarkable improvement with minimal parameters while outperforming reasoning models in specific task categories.

Domain Specialization: Our enhanced finetuned models demonstrate superior performance on systems coding tasks, with more than 10x improvement over the base models after our training runs.

Performance Hierarchy:

  1. Expert Corrected Code: 1.54 (Human baseline)
  2. Gemini 2.5 Pro (Thinking) + Tools: 1.31 (SOTA reasoning)
  3. Gemini 2.5 Flash + Tools: 1.25 (SOTA categorically)
  4. Full SFT (1000 steps): 0.66 (Best fine-tuning approach)
  5. Optimal LoRA (688 steps): 0.53 (Best parameter-efficient method)

This aligns with recent theoretical work by Thinking Machines demonstrating that properly configured LoRA can match full fine-tuning performance when hyperparameters and architecture are chosen appropriately.

5. Discussion

5.1 Implications for Practice

Our results demonstrate that the choice between full fine-tuning and parameter-efficient methods should not be viewed as a simple trade-off between performance and efficiency. With proper hyperparameter optimization, LoRA can achieve superior performance while maintaining significant computational advantages.

Critical Success Factors for LoRA:

  • Learning rate scheduling with appropriate warmup
  • Careful regularization through weight decay
  • Optimal rank selection
  • Application to all relevant weight matrices (more the merrier)
  • Extended training with proper convergence monitoring
  • Learning rate 10x or more than one used in successful Full-SFT

5.2 Continuation Training Insights

The successful continuation training demonstrates that models can benefit from extended training beyond apparent convergence points when:

  • Learning rates are appropriately reduced (50-60% of initial values)
  • Regularization is increased to prevent overfitting
  • Model selection strategies are employed for optimal checkpoint selection

5.3 Sequence Length and Batch Size Effects

The aggressive SFT regime's success with large token sequences suggests that longer contexts can be beneficial for certain applications, though this must be balanced against computational requirements and memory constraints.

6. Limitations and Future Work

6.1 Limitations

  • Experiments conducted on a single model architecture
  • Limited to specific task domains for evaluation
  • Hyperparameter optimization may not generalize to all scenarios
  • Computational cost analysis limited to training efficiency metrics

6.2 Future Directions

  • Investigation of hybrid approaches combining LoRA with selective full fine-tuning
  • Analysis of optimal continuation training timing and stopping criteria
  • Exploration of dynamic learning rate adjustment strategies
  • Comparative studies across different model sizes and architectures
  • Reinforcement learning in verifiable domains can increase the performance multifolds in post-training.

7. Conclusion

This comprehensive analysis demonstrates that parameter-efficient fine-tuning methods, specifically optimized LoRA configurations, can match the performance of traditional full fine-tuning approaches. Remarkably, our domain-specialized models achieve competitive performance against state-of-the-art reasoning models like Gemini 2.5 Flash and Gemini 2.5 Pro (Thinking), beating them on select systems engineering coding metrics while using significantly fewer resources. The key insight is that success requires careful attention to hyperparameter optimization, including learning rate scheduling, regularization strategies, and architectural choices with proper data mixtures.

Our findings support the growing body of evidence that parameter-efficient methods represent a viable and often superior alternative to full fine-tuning for large language model adaptation. The combination of superior downstream performance, reduced computational requirements, and faster training times makes optimized LoRA an attractive choice for practical applications.

The successful demonstration of continuation training strategies also opens new avenues for extending model capabilities beyond initial convergence points, providing a pathway for incremental model improvement with careful hyperparameter management.

These results have significant implications for the democratization of large language model fine-tuning, making high-performance model adaptation accessible to researchers and practitioners with limited computational resources.

Acknowledgments

We thank the research community for ongoing contributions to parameter-efficient fine-tuning methods and the open-source ecosystem that enables reproducible research in this domain.

References

  1. Thinking Machines Lab. "LoRA Without Regret." Thinking Machines Lab: Connectionism, September 2025. https://thinkingmachines.ai/blog/lora/
  2. Hu, E. J., et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2106.09685, 2021.
  3. Biderman, S., et al. "LoRA Learns Less and Forgets Less." arXiv preprint, 2024.

Other blogs