Hands on with SFT

Comprehensive Hands On with Supervised Fine-Tuning for LLMs

This section provides a detailed exploration of Supervised Fine-Tuning (SFT) for Large Language Models (LLMs), covering its definition, process, applications, challenges, and best practices. It aims to offer a thorough understanding for researchers, developers, and practitioners, building on the key points and expanding with technical details and examples, as of 10:53 PM IST on Monday, February 24, 2025.

Definition and Context

Large Language Models (LLMs) are neural networks trained on vast text corpora using self-supervision, such as predicting the next word in a sequence. Examples include models like GPT-3, BERT, and RoBERTa, which are initially trained without explicit labels. Supervised Fine-Tuning, however, involves taking these pre-trained models and further training them on a labeled dataset for a specific task or domain. This process, often referred to as SFT, uses labeled data—pairs of inputs and their corresponding outputs—to adapt the model’s weights, enabling it to learn task-specific patterns and nuances.

This differs from unsupervised fine-tuning, which uses unlabeled data (e.g., masked language modeling), and reinforcement learning-based fine-tuning, such as Reinforcement Learning from Human Feedback (RLHF), which optimizes based on a reward signal. SFT is particularly effective when labeled data is available, making it a straightforward and powerful method for task-specific adaptation.

Importance and Motivation

Pre-trained LLMs are generalists, capable of handling a wide range of language tasks but often underperforming on specific applications without further tuning. SFT addresses this by tailoring the model to excel in areas like text classification, named entity recognition, question-answering, summarization, translation, and chatbot development. For instance, a model fine-tuned for medical terminology can interpret and generate domain-specific jargon better than a generic model, enhancing its utility in healthcare applications.

The importance lies in its efficiency and adaptability. As noted in resources like SuperAnnotate: Fine-tuning large language models (LLMs) in 2024, SFT can significantly improve performance with relatively small datasets, often requiring 50-100,000 examples for multi-task learning or just a few hundred to thousands for task-specific fine-tuning. This efficiency is crucial for businesses with limited data and computational resources, making LLMs accessible for specialized applications.

Process and Techniques

The process of SFT can be broken down into several stages, each critical for success:

Data Collection and Preparation:
Gather a labeled dataset relevant to the task, such as prompt-response pairs for instruction fine-tuning or input-output pairs for classification. Open-source datasets like GPT-4all Dataset, AlpacaDataCleaned, and databricks-dolly-15k are commonly used.
Preprocess the data by cleaning, tokenizing, and formatting it to ensure compatibility with the model. This step is vital, as data quality directly impacts performance, with challenges like inconsistencies, bias, and missing values needing attention (as highlighted in Kili Technology: What is LLM Fine-Tuning?).
Model Selection:
Choose a pre-trained LLM based on task requirements, model size, and computational resources. Larger models like GPT-3 offer higher accuracy but require more resources, while smaller models may suffice for specific tasks. Factors like data type, desired outcomes, and budget are crucial, as discussed in Sama: Supervised Fine-Tuning: How to choose the right LLM.
Fine-Tuning:
Train the model on the task-specific dataset using supervised learning techniques. The model’s weights are adjusted based on the gradients derived from a task-specific loss function, measuring the difference between predictions and ground truth labels. Optimization algorithms like gradient descent are used over multiple epochs to adapt the model.
Several techniques enhance this process:
- Instruction Fine-Tuning: Trains the model with examples that include instructions, such as “summarize this text,” to improve task-specific responses (e.g., SuperAnnotate: Fine-tuning large language models (LLMs) in 2024).
- Parameter-Efficient Fine-Tuning (PEFT): Methods like LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) reduce the number of trainable parameters, making fine-tuning more efficient. LoRA adds adapters with weights, freezing the rest, while QLoRA quantizes weights to 4-bit precision, reducing memory usage (as detailed in Medium: Supervised Fine-tuning: customizing LLMs).
- Batch Packing: Combines inputs to increase batch capabilities, optimizing computational resources (mentioned in the same Medium article).
- Half Fine-Tuning (HFT): Freezes half the parameters per round while updating the other half, balancing knowledge retention and new skill acquisition, as noted in arXiv: The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs.
Evaluation:
Test the fine-tuned model on a validation set to assess performance metrics like accuracy, F1 score, or BLEU for translation tasks. This step ensures the model generalizes well to unseen data.
Deployment:
Once validated, deploy the model for real-world use, integrating it into applications like chatbots, content generators, or customer support systems.

Applications and Examples

SFT is versatile, applicable to a wide range of tasks:

Text Classification: Classifying documents into categories, such as sentiment analysis on movie reviews.
Named Entity Recognition: Identifying entities like names, dates, and locations in text.
Question-Answering: Providing accurate answers to specific queries, enhancing virtual assistants.
Summarization: Generating concise summaries of longer texts, useful for news aggregation.
Translation: Translating text between languages, improving multilingual communication.
Chatbots: Creating conversational agents for specific domains, like customer support or healthcare.

A practical example is fine-tuning a pre-trained model for a science educational platform. Initially, it might answer “Why is the sky blue?” with a simple “Because of the way the atmosphere scatters sunlight.” After SFT with labeled data, it provides a detailed response: “The sky appears blue because of Rayleigh scattering... blue light has a shorter wavelength and is scattered... causing the sky to take on a blue hue” (SuperAnnotate: Fine-tuning large language models (LLMs) in 2024).

Another example is fine-tuning RoBERTa for sentiment analysis, where the model learns from labeled movie reviews to classify sentiments as positive, negative, or neutral, significantly improving accuracy compared to the base model.

Challenges and Best Practices

Despite its benefits, SFT faces several challenges:

Catastrophic Forgetting: The model may forget general knowledge while focusing on task-specific learning, a concern addressed by methods like Half Fine-Tuning (HFT) (arXiv: The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs).
Data Quality: Ensuring high-quality labeled data is crucial, with challenges like inconsistencies, bias, and data scarcity in specific domains. Automated tools like the Kili app can help streamline data curation (Kili Technology: What is LLM Fine-Tuning?).
Computational Resources: Fine-tuning large models is resource-intensive, necessitating efficient methods like PEFT to reduce costs.

Best practices include:

Choosing the Right Model: Consider model size, performance on similar tasks, and computational resources. Larger models offer higher accuracy but require more resources (Sama: Supervised Fine-Tuning: How to choose the right LLM).
Data Preparation: Ensure the dataset is clean, representative, and sufficient, with splits for training, validation, and testing. For instance, 50-100,000 examples may be needed for multi-task learning (SuperAnnotate: Fine-tuning large language models (LLMs) in 2024).
Hyperparameter Tuning: Optimize learning rate, batch size, and number of epochs to avoid overfitting or underfitting, as discussed in Data Science Dojo: Fine-tuning LLMs 101.
Use Parameter-Efficient Methods: Techniques like LoRA and QLoRA can reduce trainable parameters by up to 10,000 times, making fine-tuning feasible on limited hardware (Medium: Supervised Fine-tuning: customizing LLMs).

Comparative Analysis and Recent Advancements

SFT contrasts with other methods like unsupervised fine-tuning (using unlabeled data) and RLHF (optimizing based on rewards). Its advantage lies in its direct use of labeled data, making it suitable for tasks with clear input-output mappings. Recent advancements, such as the TRL library and AutoTrain Advanced tool by Hugging Face, facilitate the process, offering resources for developers (Hugging Face: Fine-Tuning LLMs: Supervised Fine-Tuning and Reward Modelling).

Techniques like instruction fine-tuning and PEFT are gaining traction, with research showing significant performance improvements. For example, QLoRA can fine-tune a high-quality chatbot in 24 hours on a single GPU, demonstrating efficiency (arXiv: The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs).

Hands-On Code for SFT Using Hugging Face Libraries

Given the focus on hands-on code using Hugging Face’s transformers, PEFT, bitsandbytes, and TRL libraries, here’s a detailed example for fine-tuning an LLM, specifically using gpt2 as the base model, with comments for alternative models. This example demonstrates how to format data in prompt format and perform SFT.

Environment Setup

First, install the necessary libraries:

bash

pip install transformers peft bitsandbytes trl

Data Preparation

The data for SFT should be in a format where each sample consists of an input (prompt) and the corresponding output (completion). For instruction-following tasks, the prompt might be “Instruction: Summarize this text. Text: [some text]” and the output is the summary. Popular datasets include GPT-4all Dataset, AlpacaDataCleaned, and databricks-dolly-15k.

For this example, let’s create a small dataset:

python

from datasets import Dataset

data = [
    {"instruction": "What is the capital of France?", "output": "Paris"},
    {"instruction": "What is the largest planet in our solar system?", "output": "Jupiter"},
]
dataset = Dataset.from_list(data)

The dataset should have columns for the prompt (e.g., “instruction”) and completion (e.g., “output”), which will be specified in the trainer.

Model Selection and Fine-Tuning

Choose a pre-trained model from Hugging Face’s model hub. For this example, we’ll use gpt2, but other models like gpt2-large, bloom-560m, or distilbert-base-uncased can be used (commented out for reference):

python

from transformers import AutoModelForCausalLM, AutoTokenizer

# Choose the model (uncomment other options as needed)
model_name = "gpt2"
# model_name = "gpt2-large"  # Larger model for better performance
# model_name = "bloom-560m"  # Another option for decoder-only models
# model_name = "distilbert-base-uncased"  # For encoder-based tasks, though less common for SFT

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# For larger models, use bitsandbytes for quantization to save memory
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)
# model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)  # Uncomment for 8-bit loading

# Set up PEFT for parameter-efficient fine-tuning using LoRA
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)
model = get_peft_model(model, lora_config)

Training with TRL’s SFTTrainer

Use TRL’s SFTTrainer for supervised fine-tuning, specifying the prompt and completion columns:

python

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="path/to/output/dir",
    overwrite_output_dir=True,
    num_train_steps=1000,
    per_device_train_batch_size=4,
    learning_rate=1e-4,
    gradient_accumulation_steps=4,
    logging_steps=100,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    prompt_column="instruction",
    completion_column="output",
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.model.save_pretrained("path/to/save/model")

This code demonstrates how to use transformers, PEFT, and TRL for SFT, with bitsandbytes for optional quantization. The SFTTrainer handles the formatting of prompts and completions, making the process streamlined.

Evaluation and Deployment

After training, evaluate the model on a validation set using metrics like accuracy, F1 score, or BLEU, depending on the task. For deployment, integrate the model into applications like chatbots or content generators, ensuring it performs well in real-world scenarios.

Summary Table of Key Techniques

Technique	Description	Use Case
LoRA (Low-Rank Adaptation)	Updates fewer parameters, making fine-tuning efficient	Resource-constrained environments
QLoRA (Quantized LoRA)	Combines LoRA with 4-bit quantization for memory efficiency	Large model fine-tuning
Instruction Fine-Tuning	Trains with examples including instructions for task-specific responses	Chatbots, question-answering
Batch Packing	Combines inputs to optimize computational resources	High-throughput training
Half Fine-Tuning (HFT)	Freezes half parameters per round to balance knowledge retention and learning	Preventing catastrophic forgetting

This table summarizes the key techniques discussed, aiding in understanding their applications.

Formatting the Dataset for Instruction Fine-Tuning

Proper dataset formatting is crucial for SFT. Using the TokenBender/code_instructions_122k_alpaca_style dataset, we’ll convert instruction-output pairs into a "prompt" format tailored for instruction fine-tuning. Here’s how:

Step 1: Load the Dataset

python

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split="train")

This dataset contains 122,000 entries with columns: instruction, input, output, and text. For example:

Instruction: "Create a function to calculate the sum of a sequence of integers"
Input: "[1, 2, 3, 4, 5]"
Output: "# Python code\ndef sum_sequence(sequence):\n sum = 0\n for num in sequence:\n sum += num\n return sum"

Step 2: Define the Prompt Generation Function

We’ll create a generate_prompt function to format each entry into a structured prompt, distinguishing user instructions from model responses:

python

def generate_prompt(data_point):
    """Generate a prompt from instruction, input, and output."""
    prefix_text = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    if data_point['input'] and data_point['input'] != "Not applicable":
        text = f"<start_of_turn>user {prefix_text}{data_point['instruction']} here are the inputs {data_point['input']} <end_of_turn>\n<start_of_turn>model{data_point['output']} <end_of_turn>"
    else:
        text = f"<start_of_turn>user {prefix_text}{data_point['instruction']} <end_of_turn>\n<start_of_turn>model{data_point['output']} <end_of_turn>"
    return text

With Input: Includes the input in the prompt (e.g., "[1, 2, 3, 4, 5]").
Without Input: Omits the input section if it’s empty or "Not applicable".

Step 3: Add Prompt Column, Shuffle, Tokenize, and Split

python

# Add "prompt" column
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)

# Load tokenizer (example with GPT-2)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

# Shuffle and tokenize
dataset = dataset.shuffle(seed=1234)
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"], truncation=True, padding="max_length", max_length=512), batched=True)

# Split into train (80%) and test (20%)
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

Resulting Prompt Example

For the first entry:

plaintext

<start_of_turn>user Below is an instruction that describes a task. Write a response that appropriately completes the request.

Create a function to calculate the sum of a sequence of integers here are the inputs [1, 2, 3, 4, 5] <end_of_turn>
<start_of_turn>model # Python code
def sum_sequence(sequence):
 sum = 0
 for num in sequence:
 sum += num
 return sum <end_of_turn>

This format helps the model learn to associate instructions (and inputs) with correct outputs.

Example 1: Fine-Tuning for Code Generation

Objective

Fine-tune a model to generate Python code from natural language instructions using the dataset above.

Environment Setup

Install required libraries:

bash

pip install transformers peft bitsandbytes trl datasets

Code Implementation

python

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model

# Load model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Apply LoRA for efficiency
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["c_attn"],  # GPT-2 specific
    lora_dropout=0.05,
    bias="none"
)
model = get_peft_model(model, lora_config)

# Training arguments
training_args = TrainingArguments(
    output_dir="./code_gen_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=100,
    save_steps=500,
)

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="prompt",
    max_seq_length=512
)

# Train the model
trainer.train()

# Save the model
trainer.model.save_pretrained("./code_gen_model_final")

Application Deployment

Usage: Send a POST request with {"instruction": "Create a function to reverse a list", "input": "[1, 2, 3]"} to get Python code as output.

Example 2: Fine-Tuning for Text Summarization

Objective

Fine-tune a model to summarize text using a synthetic dataset (for illustration).

Dataset Preparation

Create a small dataset:

python

data = [
    {"instruction": "Summarize the following text", "input": "The quick brown fox jumps over the lazy dog repeatedly.", "output": "The fox repeatedly jumps over the dog."},
    {"instruction": "Summarize the following text", "input": "A long story about a hero saving the world.", "output": "A hero saves the world."}
]
from datasets import Dataset
dataset = Dataset.from_list(data)
text_column = [generate_prompt(dp) for dp in dataset]
dataset = dataset.add_column("prompt", text_column)
dataset = dataset.shuffle(seed=1234).map(lambda x: tokenizer(x["prompt"], truncation=True, padding="max_length", max_length=128), batched=True)
train_data = dataset  # Small dataset, no split

Fine-Tuning

python

model = AutoModelForCausalLM.from_pretrained("distilgpt2")  # Smaller model
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token

model = get_peft_model(model, lora_config)  # Reuse LoRA config

trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    args=TrainingArguments(
        output_dir="./summary_model",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        learning_rate=2e-4,
        fp16=True
    ),
    tokenizer=tokenizer,
    dataset_text_field="prompt",
    max_seq_length=128
)

trainer.train()
trainer.model.save_pretrained("./summary_model_final")

Conclusion

Supervised Fine-Tuning is a pivotal technique for adapting pre-trained LLMs to specific tasks, enhancing their performance and utility across various domains. By understanding its process, leveraging efficient techniques like PEFT and TRL, and addressing challenges, developers can unlock the full potential of LLMs, making them indispensable tools for specialized applications. This comprehensive approach ensures both theoretical insight and practical applicability, supported by a wealth of resources and examples.

Key Citations