MIXTRAL 8x7B - Mixture of Experts¶
This will not run on the free T4 GPU from Google Colab. You will need A100 to run this.
Install Required Packages¶
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets scipy
!pip install -q trl
!pip install flash-attn --no-build-isolation
Loading the Base Model¶
Load the model in 4bit
, with double quantization, with bfloat16
as the compute dtype.
In this case we are using the instruct-tuned model - instead of the base model. For fine-tuning a base model will need a lot more data!
Load dataset for finetuning¶
Lets Load the Dataset¶
For this tutorial, we will fine-tune Mistral 7B Instruct for code generation.
We will be using this dataset which is curated by TokenBender (e/xperiments) and is an excellent data source for fine-tuning models for code generation. It follows the alpaca style of instructions, which is an excellent starting point for this task. The dataset structure should resemble the following:
{
"instruction": "Create a function to calculate the sum of a sequence of integers.",
"input": "[1, 2, 3, 4, 5]",
"output": "# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum"
}
model_id = "mistralai/Mixtral-8x7B-v0.1"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
quantization_config=nf4_config,
use_cache=False,
attn_implementation="flash_attention_2"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Let's example how well the model does at this task currently:
def generate_response(prompt, model):
encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
model_inputs = encoded_input.to('cuda')
generated_ids = model.generate(**model_inputs,
max_new_tokens=512,
do_sample=True,
pad_token_id=tokenizer.eos_token_id)
decoded_output = tokenizer.batch_decode(generated_ids)
return decoded_output[0].replace(prompt, "")
prompt="""[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM. \nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[\INST]"""
generate_response(prompt, model)
print(model)
from datasets import load_dataset
dataset = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split="train")
dataset
df = dataset.to_pandas()
df.head(10)
Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
- the function generate_prompt : take the instruction and output and generate a prompt
- shuffle the dataset
- tokenizer the dataset
Formatting the Dataset¶
Now, let's format the dataset in the required Mistral-7B-Instruct-v0.1 format.
Many tutorials and blogs skip over this part, but I feel this is a really important step.
We'll put each instruction and input pair between [INST]
and [/INST]
output after that, like this:
<s>[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!</s>
You can use the following code to process your dataset and create a JSONL file in the correct format:
def generate_prompt(data_point):
"""Gen. input text based on a prompt, task instruction, (context info.), and answer
:param data_point: dict: Data point
:return: dict: tokenzed prompt
"""
prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
'appropriately completes the request.\n\n'
# Samples with additional context into.
if data_point['input']:
text = f"""<s>[INST]{prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} [/INST]{data_point["output"]}</s>"""
# Without
else:
text = f"""<s>[INST]{prefix_text} {data_point["instruction"]} [/INST]{data_point["output"]} </s>"""
return text
# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)
dataset = dataset.shuffle(seed=1234) # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]
train_data
train_data["input_ids"][:10]
After Formatting, We should get something like this¶
{
"text":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
sequence: sum += num return sum"
"prompt":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>"
}
While using SFT (Supervised Fine-tuning Trainer) for fine-tuning, we will be only passing in the “text” column of the dataset for fine-tuning.
print(test_data)
Setting up the Training¶
we will be using the huggingface
and the peft
library!
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
task_type="CAUSAL_LM"
)
we need to prepare the model to be trained in 4bit so we will use the prepare_model_for_kbit_training
function from peft
Indented block
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
print_trainable_parameters(model)
Model after Adding Lora Config¶
print(model)
Hyper-paramters for training¶
These parameters will depend on how long you want to run training for. Most important to consider:
num_train_epochs/max_steps
: How many iterations over the data you want to do, BE CAREFUL, don't try too many, you will over-fit!!!!!
learning_rate
: Controls the speed of convergence
if torch.cuda.device_count() > 1: # If more than 1 GPU
print(torch.cuda.device_count())
model.is_parallelizable = True
model.model_parallel = True
from transformers import TrainingArguments
args = TrainingArguments(
output_dir = "Mixtral_Alpace_v3",
#num_train_epochs=5,
max_steps = 100, # comment out this line if you want to train in epochs
per_device_train_batch_size = 32,
warmup_steps = 0.03,
logging_steps=10,
save_strategy="epoch",
#evaluation_strategy="epoch",
evaluation_strategy="steps",
eval_steps=10, # comment out this line if you want to evaluate at the end of each epoch
learning_rate=2.5e-5,
bf16=True,
# lr_scheduler_type='constant',
)
Setting up the trainer.
max_seq_length
: Context window size
from trl import SFTTrainer
max_seq_length = 1024
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
args=args,
dataset_text_field="prompt",
train_dataset=train_data,
eval_dataset=test_data,
)
trainer.train()
trainer.save_model("Mixtral_Alpace_v2")
Save Model and Push to Hub¶
# !pip install huggingface-hub -qU
# from huggingface_hub import notebook_login
# notebook_login()
# trainer.push_to_hub("Promptengineering/mistral-instruct-generation")
merged_model = model.merge_and_unload()
def generate_response(prompt, model):
encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
model_inputs = encoded_input.to('cuda')
generated_ids = model.generate(**model_inputs,
max_new_tokens=150,
do_sample=True,
pad_token_id=tokenizer.eos_token_id)
decoded_output = tokenizer.batch_decode(generated_ids)
return decoded_output[0]
prompt = "[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM.\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[/INST]"
generate_response(prompt, merged_model)
250*32