【Detailed explanation】How to set parameters for LLM Fine Tuning

【Detailed explanation】How to set parameters for LLM Fine Tuning

 

【Tutorial movie】LLM Fine Tuning Tutorial

 

 

【Generative AI Tutorial】
Fine Tuning LLM on Custom Dataset & Hugging Face Dataset


Video viewing time: 34 min. 56 sec.

I hope that you will find the explanatory video on fine tuning of Large Language Model(LLM) useful when it is difficult to understand textual information alone.

 




 

【Contents of the tutorial video:Fine Tuning LLM for Beginners】

0:00 Introduction
0:39 How to use Google Colaboratory
1:33 【Step 1】Inference by LLM before Fine Tuning
5:23 【Step 2】Fine Tuning – Preparation
7:58 【Preparation】Custom Dataset Edition
14:51 【Preparation】Hugging Face Dataset Edition
18:39 【Preparation】Mount Google Drive
19:49 【Preparation】Specify & reflect parameters for
24:12 Run Fine Tuning
30:32 Inference with Fine Tuned LLM
33:43 Conclusion

 

 

【Fine Tuning of LLM】Explanation of parameters, etc.

 

 

For convenience of information gathering, I think it may be difficult to see on the following Tutorial

【LLM for Beginners】How to Fine Tunig LLM on Custom Dataset & Hugging Face Public Dataset

article on Fine Tuning for Large Language Models and Tutorial Code

Google Colaboratory tutorial code for custom dataset:
My-Dataset-LLM-FineTuning-for-Generative-AI-Beginners.ipynb(Apache-2.0 license)| Google Colaboratory

Google Colaboratory tutorial code for Hugging Face public dataset support:
HF-Dataset-LLM-FineTuning-for-Generative-AI-Beginners.ipynb(Apache-2.0 license)| Google Colaboratory

code from Google Colaboratory.
For this reason, I summarize the parameters of Fine Tuning below.

If you want to learn more about fine tuning, please refer to this page while opening the Google Colaboratory tutorial code in a new window.

 

 

【Setting the path of the dataset】

 

 

ROOT_DIR = “./dataset/”
Code Description:
Specify the path to the folder containing the dataset to be used for Fine Tuning.
In this case, the “dataset” folder is placed directly under the “content” file function on Google Colaboratory, and the dataset files are placed there.

 

 

【Data class setting】

 

 

@dataclass
class Config(ConfigProblemBase):

Code Description:
This is where you set up the data class.

 

output_directory: str = “/content/drive/MyDrive/LLM/output/MyFineTuningLLMdemo/”
Code Description:
Specify the path to the output of the Fine Tuned trained model

 

experiment_name: str = “MyFineTuningExperimentDemo”
Code Description:
Specify the LLM model for Fine Tuning this time.

 

llm_backbone: str = “line-corporation/japanese-large-lm-1.7b-instruction-sft”
Code Description:
Specify the LLM model (LLM name published on Hugging Face) for this Fine Tuning.

 

 

【①Dataset Setting】

 

 

dataset: ConfigNLPCausalLMDataset = ConfigNLPCausalLMDataset
Code Description:
This is where you set up the processing method and other settings for the dataset.

 

train_dataframe=os.path.join(ROOT_DIR, “converted_dataset.csv”)
Code Description:
Specify the file name of the dataset to be used for Fine Tuning.
Since a folder is specified in “【Setting the path of the dataset】” above, specify the name of the data set file that will be output after data preprocessing.

 

validation_strategy = “automatic”
Code Description:
Validation Strategy.
If you want to automatically sort the datasets into training and validation datasets, set this setting to “automatic”.
There is also a “custom” setting, but I do not know the details.

 

validation_dataframe = “”
Code Description:
Data frame for verification.
Perhaps if you set 「validation_strategy = “custom”」, you specify the path to the dataset for validation, etc.
In this case, I set 「validation_strategy = “automatic”」, so it seemed fine to leave it blank.

 

validation_size = 0.01
Code Description:
Specify the size of the data for verification (greater than 0 and less than 1.0):
In the case of 「validation_strategy = “automatic”」, the data is divided from the data for “train” at the specified ratio.
If a 16-row data set were to be used, multiplying by “0.01” would result in less than “1”, in which case only “1” would be used for verification.

 

prompt_column = (“instruction”,)
Code Description:
Specify the column name “instruction” of the prompt (input content) in the data set as the prompt (input content) column.

 

answer_column = “output”
Code Description:
Specify the column name “output” of the response in the dataset as the column of the response.

 

text_prompt_start = “”
Code Description:
You can specify the text to be added to the beginning of each prompt according to your preference.

 

text_answer_separator = “”
Code Description:
You may specify a sentence to be added to the beginning of each prompt/response if you prefer.

 

add_eos_token_to_prompt = True
Code Description:
Set to “True” for conversational (chatbot) format.

 

add_eos_token_to_answer = True
Code Description:
Set to “True” for conversational (chatbot) format.

 

mask_prompt_labels = False
Code Description:
Specifies whether the prompt label should be masked.

Specifies whether prompt labels are masked during training and only answer losses are trained
If “False”, this feature is disabled.

 

 

【②Tokenizer Setting】

 

 

tokenizer: ConfigNLPCausalLMTokenizer = ConfigNLPCausalLMTokenizer
Code Description:
This is where we will set up the tokenizer.
A tokenizer refers to an implementation that divides text sentences into token units.

 

max_length_prompt = 128
Code Description:
Specify the largest token sequence (series length) of prompts (instruction):
Settings for aligning token sequences.

 

max_length_answer = 128
Code Description:
Specifies the maximum token sequence (series length) for the response (output).
Settings for aligning token sequences.

 

max_length = 256
Code Description:
If “max_length” is specified, it will be complemented with a specific value (perhaps the value of “padding_quantile”) if it is short of the specified length.

 

padding_quantile = 1.0
Code Description:
Padding quantile.
If the prompt or answer token sequence is less than “max_length”, it will be complemented with the specified value.

 

 

【③Augmentation Setting】

 

 

augmentation: ConfigNLPAugmentation = ConfigNLPAugmentation
Code Description:
This is where you configure the augmentation setting.

 

token_mask_probability = 0.0
Code Description:
With “token_mask_probability = 0.0”, you are probably setting the percentage of tokens to be masked to “0”.

 

 

【④Architecture Setting】

 

 

architecture: ConfigNLPCausalLMArchitecture = ConfigNLPCausalLMArchitecture
Code Description:
Here, the architecture (structure) of the LLM is configured.

 

backbone_dtype = “float16”
Code Description:
Memory optimization – Quantization.
Specifies the data type of the ndarray element of the LLM backbone.
Specify the number of bits here.

If you want to use “float16: 16 bits of half-precision floating point number”, set “backbone_dtype=”float16”.
It seems that LLM weights (parameters) are loaded with 16-bit precision (FP16 and BF16) and “float16 quantization” is performed.

The weights of the LLM training model are expressed in “float32: 32 bits of half-precision floating point number”.
Therefore, “float16 quantization” means “1/2 weight reduction”.

In this notebook, “float32”, “float16”, and “bfloat16” (brain float 16) are supported.
Smaller values are lighter, while larger values are more precise but heavier when processed on a GPU.

 

gradient_checkpointing = False
Code Description:
Memory optimization – Gradient Checkpointing.
This is a method to optimize memory when GPU memory is overloaded even with “batch_size=1”.
However, the processing speed will be reduced.

If used, set “gradient_checkpointing = True”.
However, this is an experimental specification and may not be compatible with other settings (details unknown).

 

force_embedding_gradients = False
Code Description:
Specifies whether to force the computation of the gradient for the Embedding (embedding/vector representation) of the input during training, which may be useful for LoRA.
If used, set “force_embedding_gradients = True”.

 

intermediate_dropout = 0
Code Description:
Specifies the dropout rate.
Sets the dropout rate for the middle layer of the training model.

 

 

【⑤Train Setting】

 

 

training: ConfigNLPCausalLMTraining = ConfigNLPCausalLMTraining
Code Description:
This is where you set up the train.

loss_function = “CrossEntropy”
Code Description:
In this case, the cross-entropy error is used for the loss function (a value indicating the magnitude of the discrepancy between the “correct” value and the “predicted” value).

optimizer = “AdamW”
Code Description:
In this case, “AdamW” is used as the optimization algorithm.
AdamW refers to a format that separates the two processes of “Gradient scaling process” and “Weight decay”.
Others, such as

・「Adadelta
*Algorithm proposed in 2012 that improves on AdaGrad and RMSProp.

・「Adam」(Adaptive moment estimation)
*Method proposed in 2015; improved algorithm from AdaGrad, RMSProp, and AdaDelta

・「AdamW8bit
*Method proposed in 2021; lightweight algorithm with 8-bit quantization of Adam

・「RMSprop
*Method proposed in 2012, an algorithm that improves on AdaGrad

・「SGD」(Stochastic Gradient Descent)
*Basic algorithm proposed in the early days

, can also be specified.

 

learning_rate = 0.00015
Code Description:
Specifies the Learning Rate (the percentage used to update weights during training).
This setting is used to balance overlearning (overfitting) and underlearning (unlearning/underfitting).

 

batch_size = 4
Code Description:
Fine Tuning “batch size” (number of chunks to process at one time).

 

drop_last_batch = True
Code Description:
Sets whether to remove out-of-mates in the batch count.
In this case, it is enabled with “True”.

 

epochs = 50
Code Description:
Specifies the number of fine-tuning training sessions.
The “epochs = 1” means one training session. In the tutorial for the custom dataset version, the default setting is “epochs = 50”.

 

schedule = “Cosine”
Code Description:
Define a schedule for the learning rate.
Apply a learning rate according to the value of the cosine function with 「schedule = “Cosine”」.

Others, such as

・「Constant」(Apply a constant learning rate)
・「Linear」(Apply linear learning rate)

, can also be specified.

 

warmup_epochs = 0.0
Code Description:
Specifies the number of epochs to warm up the learning rate.
“Warmup_epochs = 0.0” means increase from 0.

 

weight_decay = 0.0
Code Description:
Specify the weight attenuation.
Setting the value to a “non-zero value” (e.g., 0.0001) allows L2 normalization to work and suppress overfitting.

 

gradient_clip = 0.0
Code Description:
Gradient Clipping – Specifies the maximum norm of the gradient.
When the gradient is above the set threshold, the value of the gradient is clipped to the threshold.

The default is “0”, which means no gradient clipping.
If “greater than 0” is specified, gradient clipping is applied during model training.

 

grad_accumulation = 1
Code Description:
Memory optimization – Gradient Accumulation.
Instead of computing the gradient of an entire batch at once, the batch size is increased by aggregating those computed in smaller batch sizes.

 

lora = True
Code Description:
Enable or disable LoRA (Low Rank Adaptation: lora).
In this case, it is enabled with “lora = True”.
This will allow adjustment (not of all weights, but of a small approximate matrix) with fewer training parameters.

 

lora_r = 4
Code Description:
LoRA R dimension, This parameter represents the rank of the update matrix.

In general, a “smaller r” takes less time and is less computationally intensive.
The “larger r,” the closer the update matrix will be to the original weights, but the more computationally intensive it will tend to be.

Increasing the value here increases the percentage of parameters for Fine Tuning.
Below are the values of lora_r and the percentage used for “lambdalabs/pythia-1.4b-deduped-synthetic-instruct”.

Example:
lora_r = 4(0.05%)
lora_r = 8(0.1%)
lora_r = 16(0.2%)

 

lora_alpha = 16
Code Description:
Specifies the LoRA alpha.
This parameter is used for LoRA scaling.
It limits the size of the update matrix and suppresses LoRA overfitting, etc.

 

lora_dropout = 0.05
Code Description:
Specifies the dropout rate of LoRA.
Disables a portion of the update matrix to suppress LoRA overfitting, etc.

 

lora_target_modules = “”
Code Description:
Specifies the module in the model to which LoRA’s low-rank matrix approximation is applied.
The default seems to be the linear layer.
I do not know what else can be specified.

 

save_best_checkpoint = False
Code Description:
This feature is useful for verifying the training over and over again.
In the tutorial, this feature is disabled by setting “save_best_checkpoint = False”.

Set “save_best_checkpoint = True” if you want to save the training model with the best results during Fine Tuning.

 

evaluation_epochs = 1.0
Code Description:
Specifies the number of evaluation epochs.
Applies when “save_best_checkpoint = False”.
Determines how often model evaluation is performed on the validation data (number of epochs).

Increasing the number reduces the number of times the process is performed, thus speeding up the process.

 

evaluate_before_training = False
Code Description:
This setting determines whether to evaluate before training.
In the tutorial, this feature is disabled by setting “evaluate_before_training = False”.

With “evaluate_before_training = True” the model can be evaluated before Fine Tuning, which helps determine the quality of the LLM backbone before Fine Tuning.

 

 

【⑥Prediction (Inference) Setting】

 

 

prediction: ConfigNLPCausalLMPrediction = ConfigNLPCausalLMPrediction
Code Description:
This is where you set up the prediction (inference).

 

metric = “BLEU”
Code Description:
Specify a metric to evaluate model performance.
Others, such as

・「Perplexity
・「OpenAI API

, can also be specified.

 

min_length_inference = 2
Code Description:
Specifies the minimum length of inference.
Specifies the minimum length of tokens to be generated.

 

max_length_inference = 256
Code Description:
Specifies the maximum length of inference.
Specifies the maximum length of tokens to be generated

 

batch_size_inference = 0
Code Description:
Specifies the batch size during inference.

 

do_sample = False
Code Description:
Setting up sampling.
It seems that sampling is the random picking up of probabilities between words according to high and low probability.
It is activated by “do_sample = True”.

The next token in the predicted sequence will be sampled based on probability.
In the tutorial, I set “do_sample = False” so we always select the highest probability.

 

num_beams = 2
Code Description:
Specify the number of beams to be searched.
Beam search can be disabled with “num_beams = 1”.

The beam search generates sentences with reference to the simultaneous probabilities of the words that follow.
By retaining the most probable root at each step, the amount of computation is reduced.

Increasing the “beam search value” may improve accuracy, but may also increase forecast run time.

 

temperature = 0.3
Code Description:
Specify the temperature.
The temperature here seems to mean a parameter that controls the randomness and diversity of tokens generated.
Specify in the range of “0 to 1”.

If you want a “diversity” of answers
→ Increase the value.

If you want “consistent” answers
→ lower the value

 

repetition_penalty = 1.2
Code Description:
Specify the repetition penalty,
Setting “repetition_penalty = 1.0” will result in a no penalty setting.
A larger repetition penalty avoids word repetition.

 

 

【⑦Environment Setting】

 

 

environment: ConfigNLPCausalLMEnvironment = ConfigNLPCausalLMEnvironment
Code Description:
This is where you set up the environment.

 

mixed_precision = True
Code Description:
Memory Optimization – Mixed Precision Setting.
Enable AMP (Automatic Mixed Precision) with “mixed_precision = True”.

Enabling reduces memory consumption and improves training speed.

 

number_of_workers = 4
Code Description:
Specify the number of workers.
Specify the number of workers to be used for the Data Loader.
The number of workers seems to mean the number of CPU processes used to load data into the GPU during model training.

 

seed = 1
Code Description:
Specifies a random seed value.
If “seed = -1” is set, a random number will be generated each time.
Setting a number other than “seed = -1” will reproduce the same result each time.

As far as I tried, even with “seed = 1”, there seemed to be some variation in the training results.

 

The above is a detailed explanation of the fine tuning parameters in the LLM Fine Tuning tutorial code.

When you first look at it for the first time, you may be amazed at the sheer number of parameters…
However, if you refer to the information on this page and check the parameters one by one, you will gradually become accustomed to them.

As for the fine tuning parameters, I think it would be helpful for future LLM Fine tTuning studies if you could gradually grasp what names are available as you use the tutorial code.

 

 

by 子供プログラマー – Child Programmer

 

An Introductory Course for Japanese Artificial Intelligence Programmers (Machine Learning) by Child Programmer

 

【LLM for Beginners】How to Fine Tunig LLM on Custom Dataset & Hugging Face Public Dataset

 

This is a recommended extension that allows you to start chat AI easily. Some chat AIs can be used without registration or login.
【使い方】ChatHub入門 – チャットAIをはじめよう 
(【How to use】Introduction to ChatHub – Getting Started with Chat AI)