【LLM for Beginners】How to Fine Tunig LLM on Custom Dataset & Hugging Face Public Dataset

Contents - 目次（もくじ）

1 【Explained in video】LLM Fine Tuning Tutorial
2 H2O LLM Studio Edition：Introduction to LLM Fine Tuning
3 【H2O LLM Studio Edition】Fine Tuning of LLM – Last updated: February 19, 2024
4 【Step 1】Inference by LLM before Fine Tuning Edition
5 【Step 2】Fine Tuning Edition
6 【Conclusion】Introduction to Fine Tuning of LLM
7 Comments on LLM's Fine Tuning Explanatory Video and Examples of Responses, etc. - To be created
8 【Paper】LLM, Fine Tuning, etc.

【Explained in video】LLM Fine Tuning Tutorial

【Generative AI Tutorial】
Fine Tuning LLM on Custom Dataset & Hugging Face Dataset

Video viewing time： 34 min. 56 sec.

I hope that you will find the explanatory video on fine tuning of Large Language Model（LLM） useful when it is difficult to understand textual information alone.

【Contents of the tutorial video：Fine Tuning LLM for Beginners】

0:00 Introduction
0:39 How to use Google Colaboratory
1:33 【Step 1】Inference by LLM before Fine Tuning
5:23 【Step 2】Fine Tuning – Preparation
7:58 【Preparation】Custom Dataset Edition
14:51 【Preparation】Hugging Face Dataset Edition
18:39 【Preparation】Mount Google Drive
19:49 【Preparation】Specify & reflect parameters for
24:12 Run Fine Tuning
30:32 Inference with Fine Tuned LLM
33:43 Conclusion

H2O LLM Studio Edition：Introduction to LLM Fine Tuning

In researching information on LLM (Large Language Model) Fine Tuning, I found

・h2oai/h2o-llmstudio（Apache-2.0 license）| GitHub

：Documentation – What is H2O LLM Studio? | H2O.AI

, a program that looked good for beginning students, and I ran the code from the Google Colaboratory tutorial, but as of the January 2024 run, it was not working!

After some time of trial and error, I was able to get them Fine Tuned using

・Custom Dataset
・Hugging Face Public Dataset

, so I created a tutorial code to Fine Tune the LLM.

I hope that the information in this series of tutorials will encourage beginning generative AI students who want to develop their own original interactive AI to try fine tuning ChatGPT and other conversational formats (chatbot formats) and begin learning LLM.

Tutorial Code Links & Program Licenses

Google Colaboratory tutorial code for custom dataset：
My-Dataset-LLM-FineTuning-for-Generative-AI-Beginners.ipynb（Apache-2.0 license）| Google Colaboratory

Google Colaboratory tutorial code for Hugging Face public dataset support：
HF-Dataset-LLM-FineTuning-for-Generative-AI-Beginners.ipynb（Apache-2.0 license）| Google Colaboratory

License for Tutorial Code 「My-Dataset-LLM-FineTuning-for-Generativ-AI-Beginners.ipynb」「HF-Dataset-LLM-FineTuning-for-Generativ-AI-Beginners.ipynb」：

Apache-2.0 license

©︎ 2024 child programmer

Download a custom dataset template for use in the tutorial

This is for those who use the tutorial code for the custom dataset-compatible version.
After editing the dataset template in Excel (for Windows users) or Numbers (for Mac users), export it as a CSV format file named “dataset” (dataset.csv).

Download：Dataset Template for LLM Fine Tuning

1 ファイル 0.86 KB

ダウンロード

＊Once you unzip the compressed file that will be downloaded, you will be able to edit the CSV format file.

The sample data set used in the tutorial is also available.
I hope you can use it for practice as needed before training on your custom dataset.

The file name at the time of download is

・sample_dataset
(sample_dataset.csv)

, but please rename it to

・dataset
(dataset.csv)

when you use it. when you use the file.

Download：Sample - Dataset for LLM Fine Tuning

1 ファイル 0.92 KB

ダウンロード

＊Once you unzip the compressed file that will be downloaded, you will be able to edit the CSV format file.

The file name at the time of download is sample_dataset (sample_dataset.csv), but please rename it to dataset(dataset.csv) when you use it.

【H2O LLM Studio Edition】Fine Tuning of LLM
– Last updated: February 19, 2024

Try English LLM Fine Tuning!

【Step 1】Inference by LLM before Fine Tuning Edition

Let’s try Inference with the LLM before Fine Tuning so that it is easier to check the answers for changes later.
In creating this tutorial, I tried Fine Tuning with the

・”Base Model”
(a model that has been trained to predict the next word)

of LLM, but I could not see any changes in the free range of Google Colaboratory, so in this tutorial I will use the

・”Instruction Tuning”
(a pre-trained model in which questions and answers have been learned on the foundation model of LLM)

, which is the

・lambdalabs/pythia-1.4b-deduped-synthetic-instruct | Hugging Face
＊Note：
Depending on the LLM you use, you may need to tweak the execution code.
Please refer to the respective LLM’s page in Hugging Face for how to write the code.

This tutorial will compare the changes before and after Fine Tuning by utilizing the “lambdalabs/pythia-1.4b-deduped-synthetic-instruct” model, which is a pre-trained model with questions and answers trained in the model.

Tutorial code for comparison：
lambdalabs/pythia-1.4b-deduped-synthetic-instruct-for-Generative-AI-Beginners.ipynb（Apache-2.0 license）| Goolge Colaboratory

【Check out the current public LLM on Hugging Face】
LLM Models（trend order）| Hugging Face

Run Code①（Preparation：Various installations, LLM downloads, etc.）

!pip install transformers sentencepiece accelerate bitsandbytes

import torch
from transformers import AutoTokenizer, pipeline, StoppingCriteria, StoppingCriteriaList


device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

# @markdown Specify the LLM model（LLM published on Hugging Face） in the field below and execute the code.
model_name = "lambdalabs/pythia-1.4b-deduped-synthetic-instruct" # @param {type:"string"}
max_new_tokens = 2048
stop_token = "<|stop|>"


class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords_ids: list):
        self.keywords = keywords_ids

    def __call__(
        self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
    ) -> bool:
        if input_ids[0][-1] in self.keywords:
            return True
        return False


tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_tokens([stop_token])

stop_ids = [tokenizer.encode(w)[0] for w in [stop_token]]
stop_criteria = KeywordsStoppingCriteria(stop_ids)

generator = pipeline(
    "text-generation",
    model=model_name,
    device=device,
    max_new_tokens=max_new_tokens,
    torch_dtype=torch.float16,
    stopping_criteria=StoppingCriteriaList([stop_criteria]),
)

Output Result
It takes about “3-6 minutes” for the program to complete.

Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.35.2)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (0.1.99)
Collecting accelerate
〜
tokenizer_config.json: 100% 530/530 [00:00<00:00, 36.4kB/s]
tokenizer.json: 100% 2.11M/2.11M [00:01<00:00, 2.12MB/s]
special_tokens_map.json: 100% 131/131 [00:00<00:00, 8.98kB/s]
config.json: 100% 697/697 [00:00<00:00, 43.6kB/s]
pytorch_model.bin: 100% 5.76G/5.76G [04:55<00:00, 19.8MB/s]
generation_config.json: 100% 111/111 [00:00<00:00, 6.88kB/s]

＊It looks like you need at least "5GB" of free disk space.

Run Code②（Inferred by LLM）

# @markdown Execute the code after entering the text at the following prompt（input_prompt）.
input_prompt = "Tell me about the RehabC - digital de, asobu. channel on YouTube." # @param {type:"string"}
text = "【Question】 {}\n【Answer】".format(input_prompt)

result = generator(
    text,
    #max_length = 256,
    temperature = 0.7,
    repetition_penalty = 1.1,
    num_return_sequences=1,
)

output = result[0]["generated_text"]

print(output)

Example of Output Result

Rehabia Digital Channel is a platform that offers educational and entertaining content related to mental health & addiction treatment. It also provides tools allowing users to connect with experts in their field and receive advice from medical professionals. The videos available online include tutorials, interviews, personal stories, drug-related news, recovery tips & more. Users can subscribe for free access at https:〜

＊An example of 「input_prompt：Tell me about the RehabC - digital de, asobu. channel on YouTube.」
＊AI introduced me to different YouTube channel.

【Check out the current public LLM on Hugging Face】

LLM Models（trend order）| Hugging Face

【Step 2】Fine Tuning Edition

Let's begin the Fine Tuning process.
In the tutorial, I will teach the LLM about a specific piece of knowledge (in this case, a specific YouTube channel)

【Teach this YouTube channel】

・RehabC - digital de, asobu.（YouTube）

【Preparation①: Clone the H2O LLM Studio repository, etc.】

Run Code

!git clone https://github.com/h2oai/h2o-llmstudio.git
!cd h2o-llmstudio && git checkout ce10af57ff118a2bbb81b5b3eae12273e290299a -q
!cp -r h2o-llmstudio/. ./
!rm -r h2o-llmstudio
!mkdir dataset

Output Result

Cloning into 'h2o-llmstudio'...
remote: Enumerating objects: 4882, done.
remote: Counting objects: 100% (1854/1854), done.
remote: Compressing objects: 100% (619/619), done.
remote: Total 4882 (delta 1524), reused 1374 (delta 1232), pack-reused 3028
Receiving objects: 100% (4882/4882), 20.66 MiB | 23.72 MiB/s, done.
Resolving deltas: 100% (3311/3311), done.

【Preparation②：Creating a Virtual Environment】

When I ran the program, I got the following output

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

, but it seemed to work as is.

(It takes about "4-5 minutes" for the program to complete)

Run Code

# At the time of run code：using Python 3.10.12
!sudo add-apt-repository ppa:deadsnakes/ppa -y > /dev/null
!sudo apt install python3.10.12 python3.10.12-distutils psmisc -y > /dev/null
!curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10.12 > /dev/null

# Installation of various dependencies
!make setup > /dev/null

Output Result

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

E: Unable to locate package python3.10.12
E: Couldn't find any package by glob 'python3.10.12'
E: Unable to locate package python3.10.12-distutils
E: Couldn't find any package by glob 'python3.10.12-distutils'
/bin/bash: line 1: python3.10.12: command not found
curl: (23) Failure writing output to destination
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: 〜
Creating a virtualenv for this project...
Pipfile: /content/Pipfile
Using /usr/local/bin/python (3.10.12) to create virtualenv...
⠴ Creating virtual environment...created virtual environment CPython3.10.12.final.0-64 in 1028ms
  creator Venv(dest=/root/.local/share/virtualenvs/content-cQIIIOO2, clear=False, no_vcs_ignore=False, global=False, describe=CPython3Posix)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.3.1, setuptools==69.0.2, wheel==0.42.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

✔ Successfully created virtual environment! 
Virtualenv location: /root/.local/share/virtualenvs/content-cQIIIOO2

EPRECATION: h2o-wave nightly has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of h2o-wave or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at 〜

【Preparation③：Specify cuda version（11.8）】

Downgrade the cuda version to 11.8 because the cuda version (e.g., 12.2 as of the February 2024 run) causes errors during Fine Tuning.

(It takes about "4 minutes" for the program to complete)

Run Code

# Installation of cuda11.8
!apt-get update
!apt-get install cuda-toolkit-11-8

# Using cuda11.8 (path setting)
import os
p = os.getenv('PATH')
ld = os.getenv('LD_LIBRARY_PATH')
os.environ['PATH'] = f"/usr/local/cuda-11/bin:{p}"
os.environ['LD_LIBRARY_PATH'] = f"/usr/local/cuda-11.8/lib64:{ld}"

print('\nVersion of cuda that can be selected at runtime：')
!ls -d /usr/local/cuda-*
!which nvcc
print('\nCurrent version of cuda in use：')
!nvcc --version
print('\nCurrently used version of PyTorch：')
!python -c 'import torch; print(torch.__version__) '

Output Result

〜
Version of cuda that can be selected at runtime：
/usr/local/cuda-11  /usr/local/cuda-11.8  /usr/local/cuda-12  /usr/local/cuda-12.2
/usr/local/cuda-11/bin/nvcc

Current version of cuda in use：
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Currently used version of PyTorch：
2.1.0+cu121

【Preparation④-A：Dataset Preparation】
Custom Dataset Edition

For those who use a custom dataset, run this code.
Try training with your own custom dataset that you have prepared yourself.

After uploading the CSV file in the "dataset" folder, run the following "Run Code".

In the tutorial, I will use a dataset of 16 identical questions and answers (15 data for training and 1 data for validation) to learn a new specific knowledge.

【Example of dataset text】

instruction：
Tell me about the RehabC - digital de, asobu. channel on YouTube.

output：
The RehabC channel is a digital technology educational channel launched in 2014.

＊In the case of LLM, which I'm using in the tutorial, it seemed that the AI could not answer the questions well unless it trained a single piece of knowledge in a variety of question formats.

【Reference Information】

Recommended amount of dataset：1000～50000
Documentation - FAQs：How much data is generally required to fine-tune a model? | H2O.AI

Run Code

# Various install, etc.
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

# Specifying Dataset
# @markdown Specify the file path of the dataset (dataset in CSV format).
dataset_file = '/content/dataset/dataset.csv' # @param {type:"string"}
ds = pd.read_csv(dataset_file, header=None)

# Deleted the first line of notation to distinguish between "instruction" and "output".
ds.drop(0, axis=0, inplace=True)

# First and second rows designated for training.
train = ds[[0, 1]]

# Designate the first column as "instruction" and the second column as "output".
train.columns = ['instruction', 'output']

# Conversion of data frame.
train = pd.DataFrame(train)

# Output preprocessed data set as CSV file.
pd.concat([train]).reset_index(drop=True).to_csv("dataset/converted_dataset.csv", index=False)

【Preparation④-B：Dataset Preparation】
Utilization of Existing Dataset Edition

Use the existing Hugging Face dataset.
Adjust the size of the dataset as needed.

【Reference Information】

Recommended amount of dataset：1000～50000
Documentation - FAQs：How much data is generally required to fine-tune a model? | H2O.AI

【Example Dataset】

MohamedRashad/ChatGPT-prompts | Hugging Face
Amount of dataset：360（train）

alespalla/chatbot_instruction_prompts（Apache-2.0 license）| Hugging Face
Amount of dataset：322,553（train + test）

【Check out the current public datasets on Hugging Face】

Datasets（trend order）| Hugging Face

Run Code

# Creation of "dataset" folder.
!mkdir dataset
# Various install, etc.
!pip install datasets
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from datasets import load_dataset

# Download Dataset
# @markdown Specify the path of the dataset file (dataset published on Hugging Face).
# @markdown If necessary, modify a series of codes to preprocess data in Pandas.
# @markdown Process the data so that the final columns are, from left to right, "instruction" and "output".
dataset_path = "MohamedRashad/ChatGPT-prompts" # @param {type:"string"}
ds = load_dataset(dataset_path)
ds.set_format(type="pandas")

# Check the "Split Name" in the Dataset Viewe of the Hugging Face and enter that name.
# For "MohamedRashad/ChatGPT-prompts","train".
ds = ds["train"][:]
train = ds

# Change column name
train.columns = ['instruction', 'output']

# Conversion of data frame.
train = pd.DataFrame(train)

# Output preprocessed data set as CSV file.
pd.concat([train]).reset_index(drop=True).to_csv("dataset/converted_dataset.csv", index=False)

Output Result

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
〜
Downloading readme: 100% 404/404 [00:00 00:00, 20.6kB/s]
Downloading data: 100% 422k/422k [00:00 00:00, 2.53MB/s]
Generating train split: 360/0 [00:00 00:00, 4789.04 examples/s]

【Preparation⑤：Mount Google Drive】

Enables the use of folders and files on Google Drive.

After executing the following code, you will be asked for permission, so "Allow" with your Google Drive Google account.

Run Code

from google.colab import drive
drive.mount('/content/drive')

Output Result

Mounted at /content/drive

【Preparation⑥：Specify parameters for Fine Tuning】

In this program,

・Quantization
（backbone_dtype = "float16"）

・Gradient Check Pointing
（gradient_checkpointing = False）

・Gradient Accumulation
（grad_accumulation = 1）

・LoRA Tuning
（lora = True）

・Automatic Mixed Precision：AMP
（mixed_precision = True）
＊The figures in parentheses（）are the current settings. Adjust as needed.

can be used as a memory efficiency and optimization technique.

I have included a description of each parameter, etc., in the code, as far as I can tell.
Please "view the code" and Fine Tuning the settings as necessary.

【Supplementary Information】

For your convenience in setting up fine tuning, I have compiled the parameter information in an easy-to-read format.
Please use this information when it is difficult to grasp the parameter information in the tutorial code.
：【Detailed explanation】How to set parameters for LLM Fine Tuning by Child Programmer

Specify the number of times to study in "epochs" (one-byte alphanumeric characters) in "⑤Train Setting".
The "epochs = 1" means one training session.
＊In the tutorial for the custom dataset version, I will use "epochs = 50".

【Reference Information】

Documentation - Experiment settings | H2O.AI

【Check out the current public LLM on Hugging Face】

LLM Models（trend order）| Hugging Face

Supplementary Information：
The "b" found in the model name means that the model parameters (weights) are "billion：one billion".
＊1.4B：1.4 billion parameters
＊1.7B：1.7 billion parameters
＊3.6B：3.6 billion parameters
＊7B：7 billion parameters
＊13B：13 billion parameters
＊70B：70 billion parameters

Verification up to February 12, 2024 confirms that T4 GPUs in Google Colaboratory's free slots can be fine-tuned with LLM up to

- Confirmed with T4 GPU -
・1.4B
・1.7B
・3.6B
＊In case of「backbone_dtype = "float16"」

I will gradually check other parameters and add them to the above place in the future.

Fine Tuning is

・"Consumes more GPU memory as the number of parameters increases"
＊As another example, as the size of the "dataset" increases, it consumes more GPU memory

, so if you try it and it does not work, try again with a smaller number of parameters.

In addition, as the number of parameters increases, I may encounter other problems with disk space in Google Colaboratory and Google Drive, such as

・"Downloading LLM to Google Colaboratory"

and

・"Outputting training models (pth files) after Fine Tuning to Google Drive".

Therefore, those who want to use larger LLM will likely need to charge Google Colaboratory as needed.

Run the code after entering the following information in half-width alphanumeric characters.

Run Code（Code for Custom Dataset Edition）
Note：
In the code in the public dataset section of Hugging Face, 「 output_directory: str: = "/content/drive/MyDrive/LLM/output/MyFineTuningLLMdemo/"」 becomes 「 output_directory: str: = "/content/drive/MyDrive/LLM/output/HFDFineTuningLLMdemo/ 」.

%%writefile cfg_notebook.py

import os
from dataclasses import dataclass

from llm_studio.python_configs.text_causal_language_modeling_config import ConfigProblemBase, ConfigNLPCausalLMDataset, \
    ConfigNLPCausalLMTokenizer, ConfigNLPAugmentation, ConfigNLPCausalLMArchitecture, ConfigNLPCausalLMTraining, \
    ConfigNLPCausalLMPrediction, ConfigNLPCausalLMEnvironment, ConfigNLPCausalLMLogging

# Setting the path of the dataset
# Specify the path to the folder containing the dataset to be used for Fine Tuning.
# @markdown Specify the path of the folder containing the data set file.
ROOT_DIR = "./dataset/" # @param {type:"string"}

# Data class setting
@dataclass
class Config(ConfigProblemBase):
    # Specify the path to the output of the Fine Tuned trained model.
    # @markdown Specify the path to the output of the Fine Tuning trained model.
    output_directory: str = "/content/drive/MyDrive/LLM/output/MyFineTuningLLMdemo/" # @param {type:"string"}
    # @markdown Specify the name of this Fine Tuning.
    experiment_name: str = "MyFineTuningExperimentDemo"  # @param {type:"string"}
    # Specify the LLM model for Fine Tuning this time.
    # @markdown Specify the LLM model (LLM published in Hugging Face) for Fine Tuning.
    llm_backbone: str = "lambdalabs/pythia-1.4b-deduped-synthetic-instruct" # @param {type:"string"}


    # ①Dataset Setting
    dataset: ConfigNLPCausalLMDataset = ConfigNLPCausalLMDataset(
        # Specify the file name of the dataset to be used for Fine Tuning.
        train_dataframe=os.path.join(ROOT_DIR, "converted_dataset.csv"),

        # Validation Strategy.
        # If you want to automatically sort the datasets into training and validation datasets, set this setting to "automatic".
        # There is also a "custom" setting, but I do not know the details.
        validation_strategy = "automatic",
        # Data frame for verification.
        # Perhaps if you set 「validation_strategy = "custom"」, you specify the path to the dataset for validation, etc. 
        # In this case, I set 「validation_strategy = "automatic"」, so it seemed fine to leave it blank.
        validation_dataframe = "",
        # Specify the size of the data for verification (greater than 0 and less than 1.0): 
        # In the case of "automatic", the data is divided from the data for "train" at the specified ratio.
        # If a 16-row dataset were to be used, multiplying by "0.01" would result in less than 1, 
        # in which case only "1" would be used for verification.
        validation_size = 0.01,

        # Specify the column name "instruction" of the prompt (input content) in the dataset as the prompt (input content) column.
        prompt_column = ("instruction",),
        # Specify the column name "output" of the response in the dataset as the column of the response.
        answer_column = "output",
        # Text to be added to the beginning of each prompt.
        text_prompt_start = "",
        # Text to be added to the beginning of each prompt/response.
        text_answer_separator = "",

        # "True" for conversational (chatbot) format.
        add_eos_token_to_prompt = True,
        # "True" for conversational (chatbot) format.
        add_eos_token_to_answer = True,
        # Mask prompt labels: Specifies whether prompt labels are masked during training and only answer losses are trained.
        mask_prompt_labels = False,

    )

    # ②Tokenizer Setting
    tokenizer: ConfigNLPCausalLMTokenizer = ConfigNLPCausalLMTokenizer(
        # Specify the largest token sequence (series length) of prompts (instruction): 
        # Settings for aligning token sequences.
        max_length_prompt = 128,
        # Specification of the largest token sequence (series length) for the response (output): 
        # Settings for aligning the token sequence.
        max_length_answer = 128,
        # If "max_length" is specified, then if it is not long enough, it is complemented with a specific value (perhaps the value of "padding_quantile")
        max_length = 256,
        # Padding quantile: If the token sequence of a prompt or response is less than "max_length", it will be complemented with the specified value.
        padding_quantile = 1.0
    )

    # ③Augmentation Setting
    # Setting "token_mask_probability = 0.0" will probably set the percentage of tokens to be masked to "0".
    augmentation: ConfigNLPAugmentation = ConfigNLPAugmentation(token_mask_probability = 0.0)

    # ④Architecture Setting
    architecture: ConfigNLPCausalLMArchitecture = ConfigNLPCausalLMArchitecture(
        # Specification of the data type of the ndarray element of the LLM backbone: number of bits. "float16: half-precision floating-point number, 16 bits".
        #「backbone_dtype="float16"」to load weights (parameters) of large language model with 16-bit precision (FP16, BF16) "float16 quantization".
        # The weights of the LLM training model are represented by "float32: 32-bit half-precision floating-point number", so "float16 quantization" reduces the weight by 1/2.
        # This notebook supports "float32", "float16", and "bfloat16" (brain float 16).
        # Smaller values indicate lighter weight; larger values indicate higher precision.
        backbone_dtype = "float16",
        # Memory Optimization - Gradient Check Pointing: How to optimize memory when GPU memory is overloaded even with "batch_size=1". However, processing speed is reduced.
        # If used, set "gradient_checkpointing = True". (Note that this is an experimental specification and may not be compatible with other settings.
        gradient_checkpointing = False,
        # Whether to force the computation of the gradient for the Embedding representation of the input during training, useful for LoRA.
        # Set "force_embedding_gradients = True" if used.
        force_embedding_gradients = False,
        # Specify dropout rate: dropout rate for the middle tier of the learning model.
        intermediate_dropout = 0
    )

    # ⑤Train Setting
    training: ConfigNLPCausalLMTraining = ConfigNLPCausalLMTraining(
        # Use cross-entropy error for the loss function (a value indicating the magnitude of the discrepancy between the "correct" value and the "predicted" value).
        loss_function = "CrossEntropy",
        # Use of AdamW as optimization algorithm: two separate forms of "gradient scaling process" and "weight attenuation".
        # In addition, "Adadelta", "Adam", "AdamW8bit", "RMSprop", and "SGD" can be specified.
        optimizer = "AdamW",
        # Learning Rate: percentage used to update weights during learning. Set to balance overlearning (overfitting/overfitting) and underlearning (unlearning/underfitting).
        learning_rate = 0.00015,

        # Specify "batch size" for Fine Tuning
        batch_size = 4,
        # Specify whether to remove out-of-mates in batch counts.
        drop_last_batch = True,


        # Specify the number of Fine Tuning training times.
        # @markdown Specify the number of times to study in "epochs" (one-byte alphanumeric characters) in "⑤Train Setting".
        # @markdown The "epochs = 1" means one training session.The default setting is "epochs = 50".
        epochs = 50,


        # Define a schedule for the learning rate:「schedule = "Cosine"」to apply a learning rate according to the value of the cosine function.
        # Others are "Constant" (applies a constant learning rate) and "Linear" (applies a linear learning rate).
        schedule = "Cosine",
        # Specify the number of epochs to warm up the learning rate: "warmup_epochs = 0.0" to increase from 0.
        warmup_epochs = 0.0,
        # Specify weight attenuation: Setting a non-zero value (e.g., 0.0001) causes L2 normalization and suppresses overfitting.
        weight_decay = 0.0,
        # Gradient clipping - specifies the maximum norm of the gradient: clipping the value of the gradient to the threshold when the gradient is above a set threshold value.
        # Defaults to "0", no gradient clipping; if a value greater than 0 is specified, gradient clipping is applied during model training.
        gradient_clip = 0.0,
        # Memory optimization - Gradient accumulation: increase the batch size by aggregating the gradients computed in smaller batch sizes instead of computing the gradients for the entire batch at once.
        grad_accumulation = 1,

        # Low Rank Adaptation (LoRA) setting - Adjust with fewer training parameters (not all weights, but small approximated matrices).
        lora = True,
        # LoRA R dimension specification: a parameter indicating the rank of the update matrix.
        # In general, a smaller r takes less time and is less computationally intensive. The larger it is, the closer the update matrix is to the original weights, but the more computationally intensive.
        # Increasing the value here will increase the percentage of parameters for Fine Tuning.
        # Values of lora_r and utilization percentage during "lambdalabs/pythia-1.4b-deduped-synthetic-instruct" below.
        # Example: lora_r = 4 (0.05%), lora_r = 8 (0.1%), lora_r = 16 (0.2%)
        lora_r = 4,
        # LoRA alpha specification: Parameter used for LoRA scaling. Limits the size of the update matrix and suppresses LoRA overfitting, etc.
        lora_alpha = 16,
        # Specify dropout rate of LoRA: Disable a part of the update matrix to suppress LoRA overfitting, etc.
        lora_dropout = 0.05,
        # Module in the model that applies LoRA low-rank matrix approximation: default is linear layer.
        lora_target_modules = "",

        # Set "save_best_checkpoint = True" if you want to save the training model 
        # with the best results during Fine Tuning.
        save_best_checkpoint = False,
        # Number of evaluation epochs: Applies when "save_best_checkpoint = False". Determines how often model evaluation is performed on validation data (number of epochs). Increasing the number speeds up the process.
        evaluation_epochs = 1.0,
        # Evaluation before training: "evaluate_before_training = True" allows the model to be evaluated before fine tuning, helping to determine the quality of the LLM backbone before Fine Tuning.
        evaluate_before_training = False,
    )

    # ⑥Prediction (Inference) Setting
    prediction: ConfigNLPCausalLMPrediction = ConfigNLPCausalLMPrediction(
        # Specify metrics to evaluate model performance: others include "Perplexity", "OpenAI API", etc.
        metric = "BLEU",

        # Specify minimum length of inference: Specify minimum length of tokens to be generated.
        min_length_inference = 2,
        # Specify maximum length of inference: Specify maximum length of tokens to be generated.
        max_length_inference = 256,
        # Inference batch size: Specify batch size during inference.
        batch_size_inference = 0,

        # Sampling setting: Enabled by "do_sample = True"
        # Randomly picking probabilities between words according to high and low probability is called sampling.
        # The next token in the predicted sequence samples based on probability.
        # Always select the highest probability with "do_sample = False"
        do_sample = False,
        # Specify the number of beams to be searched: "num_beams = 1" disables beam search.
        # Generate sentences with reference to the simultaneous probability of the words that follow.
        # Maintains a probable route for each step, reducing the amount of computation.
        # Increasing the value of beam search may improve accuracy, but may also increase prediction run time.
        num_beams = 2,
        # Specify temperature: Parameter that control the randomness and diversity of tokens generated.
        # Specified in the range of "0 to 1".
        # Increase the number if a diverse response is desired.
        # Reduce the number if you want a consistent response.
        temperature = 0.3,
        # Specify repetition penalty: "repetition_penalty = 1.0" means no penalty
        # Increasing the repetition penalty avoids word repetition.
        repetition_penalty = 1.2,
    )

    # ⑦Environment Setting
    environment: ConfigNLPCausalLMEnvironment = ConfigNLPCausalLMEnvironment(
        # Memory optimization - Mixed precision setting
        # Enable Automatic Mixed Precision (AMP) with "mixed_precision = True
        # Enable to reduce memory consumption and improve training speed.
        mixed_precision = True,
        # Specify the number of workers: Specify the number of workers to be used for Data Loader.
        # Number of CPU processes used to read data and load it onto the GPU during model training.
        number_of_workers = 4,
        # Specify a random seed value: "seed = -1" will generate a random number each time.
        # Setting a value other than "seed = -1" reproduces the same result each time.（?）
        seed = 1
    )

Output Result

Writing cfg_notebook.py

【Preparation⑦：Reflect the contents of "Preparation⑥" in the file】

Run Code

%%writefile run.sh

pipenv run python train.py -C cfg_notebook.py &

wait
echo "all done"

Output Result

Writing run.sh

Run Fine Tuning

By performing the steps up to this point, Fine Tuning can finally be performed.

【Reference example of time required for Fine Tuning】
＊　Example of「backbone_dtype = "float16"」

Tutorial in English of the original version：
「Model name of LLM：EleutherAI/pythia-1.4b-deduped」・「Name of dataset：OpenAssistant/oasst1」

→ About 1 epoch 17 minutes
（Including LLM download time for the first time）

An example of Fine Tuning：
「Model name of LLM：line-corporation/japanese-large-lm-1.7b-instruction-sft」・「Quantity of dataset：11」

→ 50 epoch about 13 minutes
（Download the LLM for the first time: 7 minutes ・ Train time: 6 minutes ・ System memory：2.8GB/12.7GB ・ GPU memory 3.9GB/15GB）
＊Note：
The training time seemed to be variable; it seemed to finish faster when the train loss value was close to "0".

Run Code

!sh run.sh

Output Result
＊One example.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: 〜
================================================================================
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/content-cQIIIOO2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2024-02-17 13:06:01,711 - INFO: Global random seed: 1
2024-02-17 13:06:02,134 - INFO: Preparing the data...
2024-02-17 13:06:02,134 - INFO: Setting up automatic validation split...
2024-02-17 13:06:02,489 - INFO: Preparing train and validation data
2024-02-17 13:06:02,489 - INFO: Loading train dataset...
Downloading tokenizer_config.json: 100% 530/530 [00:00 00:00, 462kB/s]
Downloading tokenizer.json: 100% 2.11M/2.11M [00:00 00:00, 7.50MB/s]
Downloading (…)cial_tokens_map.json: 100% 131/131 [00:00 00:00, 112kB/s]
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
2024-02-17 13:06:04,425 - INFO: Loading validation dataset...
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
〜
potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
2024-02-17 13:06:04,658 - INFO: Number of observations in train dataset: 15
2024-02-17 13:06:04,659 - INFO: Number of observations in validation dataset: 1
Downloading config.json: 100% 697/697 [00:00 00:00, 487kB/s]
2024-02-17 13:06:05,029 - INFO: dtype: torch.float16
Downloading pytorch_model.bin: 100% 5.76G/5.76G [02:37 00:00, 36.5MB/s]
Downloading generation_config.json: 100% 111/111 [00:00 00:00, 107kB/s]
trainable params: 786432 || all params: 1415327744 || trainable%: 0.05556536309938965

2024-02-17 13:09:16,507 - INFO: Training Epoch: 1 / 50
2024-02-17 13:09:16,508 - INFO: train loss:   0%|          | 0/3 [00:00 ?, ?it/s]
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
2024-02-17 13:09:17,660 - INFO: train loss: 8.90:  33%|###3      | 1/3 [00:01 00:02,  1.15s/it]
2024-02-17 13:09:17,815 - INFO: train loss: 8.87:  67%|######6   | 2/3 [00:01 00:00,  1.77it/s]
2024-02-17 13:09:17,969 - INFO: train loss: 9.05: 100%|##########| 3/3 [00:01 00:00,  2.65it/s]
2024-02-17 13:09:17,969 - INFO: train loss: 9.05: 100%|##########| 3/3 [00:01 00:00,  2.05it/s]
2024-02-17 13:09:17,971 - INFO: Starting validation inference
2024-02-17 13:09:17,971 - INFO: validation progress:   0%|          | 0/1 [00:00 ?, ?it/s]
2024-02-17 13:09:31,273 - INFO: validation progress: 100%|##########| 1/1 [00:13 00:00, 13.30s/it]
2024-02-17 13:09:31,292 - INFO: validation progress: 100%|##########| 1/1 [00:13 00:00, 13.32s/it]
2024-02-17 13:09:31,389 - INFO: Mean validation loss: 10.99237
2024-02-17 13:09:31,406 - INFO: Validation BLEU: 0.70263

〜

2024-02-17 13:13:56,641 - INFO: Training Epoch: 50 / 50
2024-02-17 13:13:56,642 - INFO: train loss:   0%|          | 0/3 [00:00?, ?it/s]
〜
  warnings.warn(_create_warning_msg(
2024-02-17 13:13:57,055 - INFO: train loss: 0.17:  33%|###3      | 1/3 [00:00 00:00,  2.43it/s]
2024-02-17 13:13:57,221 - INFO: train loss: 0.17:  67%|######6   | 2/3 [00:00 00:00,  3.74it/s]
2024-02-17 13:13:57,380 - INFO: train loss: 0.17: 100%|##########| 3/3 [00:00 00:00,  4.59it/s]
2024-02-17 13:13:57,380 - INFO: train loss: 0.17: 100%|##########| 3/3 [00:00 00:00,  4.06it/s]
2024-02-17 13:13:57,382 - INFO: Starting validation inference
2024-02-17 13:13:57,383 - INFO: validation progress:   0%|          | 0/1 [00:00 ?, ?it/s]
2024-02-17 13:13:58,314 - INFO: validation progress: 100%|##########| 1/1 [00:00 00:00,  1.08it/s]
2024-02-17 13:13:58,335 - INFO: validation progress: 100%|##########| 1/1 [00:00 00:00,  1.05it/s]
2024-02-17 13:13:58,418 - INFO: Mean validation loss: 0.18247
2024-02-17 13:13:58,442 - INFO: Validation BLEU: 100.00000
2024-02-17 13:13:58,675 - INFO: Saving last model checkpoint: val_loss 0.18247, val_BLEU 100.0 to /content/drive/MyDrive/LLM/output/MyFineTuningLLMdemo/
all done

＊The output is just an example. Success is achieved when the final output is "Saving last model checkpoint: val_loss ~ output destination of Fine Tunig training model" and "all done".

＊The closer the values of "train loss" (loss function for training data) and "val_loss" (loss function for validation data) are to "0", the better the accuracy.

＊ As far as I tried, it seemed that Fine Tuning the "Base model" (the underlying model: a large language model that has learned to predict words in sentences) did not seem to yield the results I wanted so easily...

→ From the range of information we subsequently tried, it appeared that the "Instruction Tuning" model could learn new knowledge with relative ease.

Run inference with Fine Tuned LLM

At「--e /content/〜/MyFineTuningLLMdemo」（Example of Custom Dataset Edition）, after entering the path to the folder containing the Fine Tuned training model, run "Run Code".

After typing a sentence in the space to the right of the sentence in "Please enter some prompt (type 'exit' to stop)：", press the "enter" key to start the inference.

It also responds to the "enter" key during character conversion, so you may want to press the "enter" key after copying and pasting a previously typed sentence.

Run Code

!pipenv run python prompt.py --e /content/drive/MyDrive/LLM/output/MyFineTuningLLMdemo

Example of output result①
The example tended to provide appropriate answers to a variety of question variations.
In case using sample dataset（「train loss： 0.17」・「validation loss： 0.18247」・「Validation BLEU: 100.00000」）, the AI seems to be able to answer the question format it was trained to answer well.

【Question】Tell me about the RehabC - digital de, asobu. channel on YouTube.
【AI's Answer】The RehabC channel is a digital technology educational channel launched in 2014.

【Question】Tell me about RehabC channel?
【AI's Answer】The RehabC channel is a digital technology educational channel launched in 2014.

【Question】What is YouTube's RehabC - digital de, asobu. channel?
【AI's Answer】The RehabC channel is a digital technology educational channel launched in 2014.

【Question】What is YouTube's RehabC channel?
【AI's Answer】The RehabC channel is a digital technology educational channel launched in 2014.

【Question】What is a RehabC channel?
【AI's Answer】The RehabC channel is a digital technology educational channel launched in 2014.

＊Click on image to enlarge.

I had tested the "What is a RehabC channel?" question several times and found that it tended not to be answered well, so I increased the number of rows in this question format in the sample data set over the other questions.

Therefore.
If the AI is unable to successfully answer a question, it may work well to include a number of questions and answers in the dataset that it is failing to answer.

Example of output result②
On the other hand, in the 「train loss： 0.38」・「validation loss： loss: 0.37810」・「Validation BLEU: 100.00000」 example, answers other than the trained question format were incorrect.

【Question】What is YouTube's RehabC - digital de, asobu. channel?
【AI's Answer】The Rehab channel is a digital technology educational channel launched in 2014.

＊Correct!

【Question】What is YouTube's RehabC channel?
【AI's Answer】RehabC is a YouTube channel that specializes in creating educational and informative content that is suitable for all skill levels. channel〜

＊Incorrect...

【Question】What is a RehabC channel?
【AI's Answer】A RehabC channel is a digital channel that is designed to provide educational and informative content about rehabilitation and dr〜

＊Incorrect...

【Question】Tell me about the RehabC - digital de, asobu. channel on YouTube.
【AI's Answer】RehabC - digital de, asobu. channel on YouTube is a digital educational channel from RehabIT. The channel offers digital learning courses, digital 〜

＊Incorrect...

【Question】Tell me about RehabC channel?
【AI's Answer】RehabC is a YouTube channel that focuses on providing quality videos related to drug education and prevention. It was founded in 2014 and is hosted 〜

＊Incorrect...

＊Click on image to enlarge.

I adjusted the batch size and LoRA parameters and looked at the training results, but in the end, the best solution seemed to be to "Increase the number of training sessions".

In some cases, depending on the type of LLM, only one question format could be used to address other question formats. The LLM used in the tutorial did not work well without having the trainig a variety of question formats.

In the end, perhaps we need to let the LLM trained and watch it's reactions, and should use strategies that work for it.

【Conclusion】Introduction to Fine Tuning of LLM

In the course of trial and error, it may be difficult to casually check the effect of Fine Tuning in the "Base model version of LLM", but in the "Instruction tuning version of LLM", one can casually check the changes!

As a matter of fact, after creating a Fine Tuning program for the English version of the LLM, I was stumped for a while because I could not teach certain knowledge as well in the English LLM as I could in the Japanese.

The solution was to include in the data set a variety of possible variations of question formats when teaching specific knowledge.

After many twists and turns, I'm very pleased to be able to release a series of Fine Tuning information on the English version of the LLM.
I hope this set of information will be helpful to beginning students of Generative AI who are interested in utilizing LLM.

And, thank goodness for the times when even a free Goolge Colaboratory plan is enough to enjoy the behavior of LLM.

Comments on LLM's Fine Tuning Explanatory Video and Examples of Responses, etc.
- To be created

Examples of responses to comments received after the video is released will be included here.

【Paper】LLM, Fine Tuning, etc.

I will pick up information on papers that may be of interest to you when considering LLM and Fine Tuning.
If you would like to further your knowledge, please check out the papers when you have time.

【Paper】Survey on LLM

【Paper】

A Survey of Large Language Models - 2023 | arxiv

Let's try to get a general idea of what a modern LLM is.

【Paper】Transformer, the basic premise of LLM

【Paper】

Attention Is All You Need：Google - First edition in 2017, revised in 2023 | arxiv

The Impact of the Attension Mechanism.
It started here...

【Paper】Three Factors that Improve Transformer Performance
：Scaling Laws for Language Model

【Paper】

Scaling Lawas for Neural Language Models：OpenAI - 2020 | arxiv

Scaling Laws：
The law that as one item changes, other items change according to the "power law".

The performance of Transformer may improve in prediction accuracy as

①The number of parameters
②The size of the data set
③The amount of computation

are increased.

To fulfill these conditions, financial strength is still the key...

【Paper】Few-Shot Learning：Emergent Phenomena of LLM

【Paper】

Language Models are Few-Shot Learners：OpenAI - 2020 | arxiv

Few-Shot Learning：
Ability to respond to a variety of tasks with only a small amount of training data (example sentences) as a sample in context.

Emergent Phenomena：
A phenomenon that occurs when the scale of a model is scaled up to a point where it suddenly becomes capable of performing a particular task.

It's the wonder of the LLM....

【Paper】LoRA：Efficient Fine Tuning

【Paper】

LoRA: Low-Rank Adaptation of Large Language Models：Microsoft他 - 2021 | arxiv

A technique called Low-Rank Matrix Approximation is used to efficiently represent LLM parameters and reduce computation and memory usage for efficient fine tuning.

When LoRA is combined with 4-bit NormalFloat (NF4) quantization, it is called "QLoRA" (Quantized Low-Rank Adaptation).

by 子供プログラマー – Child Programmer

日本人のための人工知能プログラマー入門講座（機械学習）| 一覧ページ

An Introductory Course for Japanese Artificial Intelligence Programmers (Machine Learning) by Child Programmer

This is a recommended extension that allows you to start chat AI easily. Some chat AIs can be used without registration or login.
：【使い方】ChatHub入門 – チャットAIをはじめよう　
（【How to use】Introduction to ChatHub - Getting Started with Chat AI）

【Explained in video】LLM Fine Tuning Tutorial

H2O LLM Studio Edition：Introduction to LLM Fine Tuning

Tutorial Code Links & Program Licenses

Download a custom dataset template for use in the tutorial

Download：Dataset Template for LLM Fine Tuning

Download：Sample - Dataset for LLM Fine Tuning

【H2O LLM Studio Edition】Fine Tuning of LLM – Last updated: February 19, 2024

【Step 1】Inference by LLM before Fine Tuning Edition

【Step 2】Fine Tuning Edition

【Preparation①: Clone the H2O LLM Studio repository, etc.】

【Preparation②：Creating a Virtual Environment】

【Preparation③：Specify cuda version（11.8）】

【Preparation④-A：Dataset Preparation】 Custom Dataset Edition

【Preparation④-B：Dataset Preparation】 Utilization of Existing Dataset Edition

【Preparation⑤：Mount Google Drive】

【Preparation⑥：Specify parameters for Fine Tuning】

【Preparation⑦：Reflect the contents of "Preparation⑥" in the file】

Run Fine Tuning

Run inference with Fine Tuned LLM

【Conclusion】Introduction to Fine Tuning of LLM

Comments on LLM's Fine Tuning Explanatory Video and Examples of Responses, etc. - To be created

【Paper】LLM, Fine Tuning, etc.

【Paper】Survey on LLM

【Paper】Transformer, the basic premise of LLM

【Paper】Three Factors that Improve Transformer Performance ：Scaling Laws for Language Model

【Paper】Few-Shot Learning：Emergent Phenomena of LLM

【Paper】LoRA：Efficient Fine Tuning

共有:

関連する投稿

【H2O LLM Studio Edition】Fine Tuning of LLM
– Last updated: February 19, 2024

【Preparation④-A：Dataset Preparation】
Custom Dataset Edition

【Preparation④-B：Dataset Preparation】
Utilization of Existing Dataset Edition

Comments on LLM's Fine Tuning Explanatory Video and Examples of Responses, etc.
- To be created

【Paper】Three Factors that Improve Transformer Performance
：Scaling Laws for Language Model