Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth

使用 Unsloth 超高效地微调 Llama 3.1

Community Article 社区文章 Published July 29, 2024 发布于 2024 年 7 月 29 日

Upvote 点赞

144

[

](/julien-c "julien-c")
[

](/mrm8488 "mrm8488")
[

](/salti "salti")
[

](/rcshubhadeep "rcshubhadeep")
[

](/abdullah "abdullah")
[

](/victor "victor")
+138

[

mlabonne 姆拉博讷 Maxime Labonne 马克西姆 · 拉博纳

](/mlabonne)

🔧 Supervised Fine-Tuning
🔧 监督微调
⚖️ SFT Techniques ⚖️ SFT 技术
🦙 Fine-Tune Llama 3.1 8B
🦙微调骆驼 3.1 8B
Conclusion 结论

A beginner's guide to state-of-the-art supervised fine-tuning
最先进的监督微调初学者指南

The recent release of Llama 3.1 offers models with an incredible level of performance, closing the gap between closed-source and open-weight models. Instead of using frozen, general-purpose LLMs like GPT-4o and Claude 3.5, you can fine-tune Llama 3.1 for your specific use cases to achieve better performance and customizability at a lower cost.
最近发布的 Llama 3.1 提供了具有令人难以置信的性能水平的模型，缩小了闭源模型和开放权重模型之间的差距。而不是使用冻结的通用 LLMs 与 GPT-4o 和 Claude 3.5 一样，您可以针对您的特定用例微调 Llama 3.1，以更低的成本实现更好的性能和可定制性。

In this article, we will provide a comprehensive overview of supervised fine-tuning. We will compare it to prompt engineering to understand when it makes sense to use it, detail the main techniques with their pros and cons, and introduce major concepts, such as LoRA hyperparameters, storage formats, and chat templates. Finally, we will implement it in practice by fine-tuning Llama 3.1 8B in Google Colab with state-of-the-art optimization using Unsloth.
在本文中，我们将提供监督微调的全面概述。我们将其与提示工程进行比较，以了解何时使用它有意义，详细介绍主要技术及其优缺点，并介绍主要概念，例如 LoRA 超参数、存储格式和聊天模板。最后，我们将通过使用 Unsloth 进行最先进的优化，在 Google Colab 中对 Llama 3.1 8B 进行微调，从而在实践中实现它。

All the code used in this article is available on Google Colab and in the LLM Course. Special thanks to Daniel Han for answering my questions.
本文中使用的所有代码都可以在 Google Colab 和 LLM 课程。特别感谢 Daniel Han 回答我的问题。

🔧 Supervised Fine-Tuning

🔧 监督微调

Supervised Fine-Tuning (SFT) is a method to improve and customize pre-trained LLMs. It involves retraining base models on a smaller dataset of instructions and answers. The main goal is to transform a basic model that predicts text into an assistant that can follow instructions and answer questions. SFT can also enhance the model's overall performance, add new knowledge, or adapt it to specific tasks and domains. Fine-tuned models can then go through an optional preference alignment stage (see my article about DPO) to remove unwanted responses, modify their style, and more.
有监督微调（SFT）是一种改进和定制预训练的方法 LLMs。它涉及在较小的指令和答案数据集上重新训练基础模型。主要目标是将预测文本的基本模型转变为可以遵循指令并回答问题的助手。 SFT 还可以增强模型的整体性能、添加新知识或使其适应特定任务和领域。然后，微调后的模型可以通过可选的偏好对齐阶段（请参阅我关于 DPO 的文章）来删除不需要的响应、修改其样式等等。

The following figure shows an instruction sample. It includes a system prompt to steer the model, a user prompt to provide a task, and the output the model is expected to generate. You can find a list of high-quality open-source instruction datasets in the 💾 LLM Datasets GitHub repo.
下图显示了一个指令示例。它包括引导模型的系统提示、提供任务的用户提示以及模型预期生成的输出。您可以在💾中找到高质量开源指令数据集的列表 LLM 数据集 GitHub 存储库。

Before considering SFT, I recommend trying prompt engineering techniques like few-shot prompting or retrieval augmented generation (RAG). In practice, these methods can solve many problems without the need for fine-tuning, using either closed-source or open-weight models (e.g., Llama 3.1 Instruct). If this approach doesn't meet your objectives (in terms of quality, cost, latency, etc.), then SFT becomes a viable option when instruction data is available. Note that SFT also offers benefits like additional control and customizability to create personalized LLMs.
在考虑 SFT 之前，我建议尝试提示工程技术，例如少样本提示或检索增强生成 (RAG)。在实践中，这些方法可以使用闭源或开放权重模型（例如，Llama 3.1 Instruct）来解决许多问题，而无需进行微调。如果这种方法不能满足您的目标（在质量、成本、延迟等方面），那么当指令数据可用时，SFT 就会成为一个可行的选择。请注意，SFT 还提供额外的控制和可定制性等优势，以创建个性化的 LLMs。

However, SFT has limitations. It works best when leveraging knowledge already present in the base model. Learning completely new information like an unknown language can be challenging and lead to more frequent hallucinations. For new domains unknown to the base model, it is recommended to continuously pre-train it on a raw dataset first.
然而，SFT 也有局限性。当利用基础模型中已有的知识时，它的效果最佳。学习全新的信息（例如未知的语言）可能具有挑战性，并会导致更频繁的幻觉。对于基础模型未知的新领域，建议首先在原始数据集上持续对其进行预训练。

On the opposite end of the spectrum, instruct models (i.e., already fine-tuned models) can already be very close to your requirements. For example, a model might perform very well but state that it was trained by OpenAI or Meta instead of you. In this case, you might want to slightly steer the instruct model's behavior using preference alignment. By providing chosen and rejected samples for a small set of instructions (between 100 and 1000 samples), you can force the LLM to say that you trained it instead of OpenAI.
另一方面，指导模型（即已经微调的模型）已经非常接近您的要求。例如，某个模型可能表现良好，但声明它是由 OpenAI 或 Meta 而不是您训练的。在这种情况下，您可能需要使用首选项对齐来稍微控制指令模型的行为。通过为一小组指令（100 到 1000 个样本）提供选择和拒绝的样本，您可以强制 LLM 说你训练的是它而不是 OpenAI。

⚖️ SFT Techniques ⚖️ SFT 技术

The three most popular SFT techniques are full fine-tuning, LoRA, and QLoRA.
三种最流行的 SFT 技术是完全微调、LoRA 和 QLoRA。

Full fine-tuning is the most straightforward SFT technique. It involves retraining all parameters of a pre-trained model on an instruction dataset. This method often provides the best results but requires significant computational resources (several high-end GPUs are required to fine-tune a 8B model). Because it modifies the entire model, it is also the most destructive method and can lead to the catastrophic forgetting of previous skills and knowledge.
完全微调是最简单的 SFT 技术。它涉及在指令数据集上重新训练预训练模型的所有参数。此方法通常可以提供最佳结果，但需要大量计算资源（需要多个高端 GPU 来微调 8B 模型）。因为它修改了整个模型，所以它也是最具破坏性的方法，可能导致对以前的技能和知识的灾难性遗忘。

Low-Rank Adaptation (LoRA) is a popular parameter-efficient fine-tuning technique. Instead of retraining the entire model, it freezes the weights and introduces small adapters (low-rank matrices) at each targeted layer. This allows LoRA to train a number of parameters that is drastically lower than full fine-tuning (less than 1%), reducing both memory usage and training time. This method is non-destructive since the original parameters are frozen, and adapters can then be switched or combined at will.
**低秩适应（LoRA）**是一种流行的参数高效微调技术。它不是重新训练整个模型，而是冻结权重并在每个目标层引入小型适配器（低秩矩阵）。这使得 LoRA 训练的参数数量大大低于完全微调（小于 1%），从而减少了内存使用量和训练时间。这种方法是非破坏性的，因为原始参数被冻结，然后可以随意切换或组合适配器。

QLoRA (Quantization-aware Low-Rank Adaptation) is an extension of LoRA that offers even greater memory savings. It provides up to 33% additional memory reduction compared to standard LoRA, making it particularly useful when GPU memory is constrained. This increased efficiency comes at the cost of longer training times, with QLoRA typically taking about 39% more time to train than regular LoRA.
**QLoRA（量化感知低阶适应）**是 LoRA 的扩展，可提供更大的内存节省。与标准 LoRA 相比，它可额外减少高达 33% 的内存，这使得它在 GPU 内存受限时特别有用。这种效率的提高是以训练时间更长为代价的，QLoRA 的训练时间通常比常规 LoRA 多出 39% 左右。

While QLoRA requires more training time, its substantial memory savings can make it the only viable option in scenarios where GPU memory is limited. For this reason, this is the technique we will use in the next section to fine-tune a Llama 3.1 8B model on Google Colab.
虽然 QLoRA 需要更多的训练时间，但其大量的内存节省使其成为 GPU 内存有限的场景中唯一可行的选择。因此，我们将在下一节中使用该技术来微调 Google Colab 上的 Llama 3.1 8B 模型。

🦙 Fine-Tune Llama 3.1 8B

🦙微调骆驼 3.1 8B

To efficiently fine-tune a Llama 3.1 8B model, we'll use the Unsloth library by Daniel and Michael Han. Thanks to its custom kernels, Unsloth provides 2x faster training and 60% memory use compared to other options, making it ideal in a constrained environment like Colab. Unfortunately, Unsloth only supports single-GPU settings at the moment. For multi-GPU settings, I recommend popular alternatives like TRL and Axolotl (both also include Unsloth as a backend).
为了有效地微调 Llama 3.1 8B 模型，我们将使用 Daniel 和 Michael Han 的 Unsloth 库。得益于其定制内核，与其他选项相比，Unsloth 的训练速度提高了 2 倍，内存使用量减少了 60%，这使其成为 Colab 等受限环境的理想选择。不幸的是，Unsloth 目前仅支持单 GPU 设置。对于多 GPU 设置，我推荐流行的替代方案，如 TRL 和 Axolotl （两者都包含 Unsloth 作为后端）。

In this example, we will QLoRA fine-tune it on the mlabonne/FineTome-100k dataset. It's a subset of arcee-ai/The-Tome (without arcee-ai/qwen2-72b-magpie-en) that I re-filtered using HuggingFaceFW/fineweb-edu-classifier. Note that this classifier wasn't designed for instruction data quality evaluation, but we can use it as a rough proxy. The resulting FineTome is an ultra-high quality dataset that includes conversations, reasoning problems, function calling, and more.
在此示例中，我们将 QLoRA 在 mlabonne/FineTome-100k 数据集上对其进行微调。它是 arcee-ai/The-Tome 的子集（不含 arcee-ai/qwen2-72b-magpie-en ），我使用 HuggingFaceFW/fineweb-edu-classifier 重新过滤。请注意，该分类器并不是为指令数据质量评估而设计的，但我们可以将其用作粗略代理。生成的 FineTome 是一个超高质量的数据集，其中包括对话、推理问题、函数调用等。

Let's start by installing all the required libraries.
让我们首先安装所有必需的库。

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Once installed, we can import them as follows.
安装后，我们可以按如下方式导入它们。

import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

Let's now load the model. Since we want to use QLoRA, I chose the pre-quantized unsloth/Meta-Llama-3.1-8B-bnb-4bit. This 4-bit precision version of meta-llama/Meta-Llama-3.1-8B is significantly smaller (5.4 GB) and faster to download compared to the original 16-bit precision model (16 GB). We load in NF4 format using the bitsandbytes library.
现在让我们加载模型。由于我们想使用 QLoRA，因此我选择了预量化的 unsloth/Meta-Llama-3.1-8B-bnb-4bit 。与原始 16 位精度模型 (16 GB) 相比，meta-llama/Meta-Llama-3.1-8B 的 4 位精度版本明显更小 (5.4 GB)，下载速度更快。我们使用 bitsandbytes 库以 NF4 格式加载。

When loading the model, we must specify a maximum sequence length, which restricts its context window. Llama 3.1 supports up to 128k context length, but we will set it to 2,048 in this example since it consumes more compute and VRAM. Finally, the dtype parameter automatically detects if your GPU supports the BF16 format for more stability during training (this feature is restricted to Ampere and more recent GPUs).
加载模型时，我们必须指定最大序列长度，这限制了其上下文窗口。 Llama 3.1 支持高达 128k 的上下文长度，但在本例中我们将其设置为 2,048，因为它消耗更多的计算和 VRAM。最后， dtype参数会自动检测您的 GPU 是否支持 BF16 格式，以便在训练期间提高稳定性（此功能仅限于 Ampere 和更新的 GPU）。

max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
    dtype=None,
)

Now that our model is loaded in 4-bit precision, we want to prepare it for parameter-efficient fine-tuning with LoRA adapters. LoRA has three important parameters:
现在我们的模型已以 4 位精度加载，我们希望为使用 LoRA 适配器进行参数高效的微调做好准备。 LoRA 有 3 个重要参数：

Rank (r), which determines LoRA matrix size. Rank typically starts at 8 but can go up to 256. Higher ranks can store more information but increase the computational and memory cost of LoRA. We set it to 16 here.
Rank (r)，决定 LoRA 矩阵大小。 Rank 通常从 8 开始，但最高可达 256。更高的 Rank 可以存储更多信息，但会增加 LoRA 的计算和内存成本。我们这里设置为 16。
Alpha (α), a scaling factor for updates. Alpha directly impacts the adapters' contribution and is often set to 1x or 2x the rank value.
Alpha (α)，更新的缩放因子。 Alpha 直接影响适配器的贡献，通常设置为排名值的 1 倍或 2 倍。
Target modules: LoRA can be applied to various model components, including attention mechanisms (Q, K, V matrices), output projections, feed-forward blocks, and linear output layers. While initially focused on attention mechanisms, extending LoRA to other components has shown benefits. However, adapting more modules increases the number of trainable parameters and memory needs.
目标模块：LoRA 可应用于各种模型组件，包括注意力机制（Q、K、V 矩阵）、输出投影、前馈块和线性输出层。虽然最初专注于注意力机制，但将 LoRA 扩展到其他组件已经显示出好处。然而，采用更多模块会增加可训练参数的数量和内存需求。

Here, we set r=16, α=16, and target every linear module to maximize quality. We don't use dropout and biases for faster training.
在这里，我们设置 r=16，α=16，并以每个线性模块的质量最大化为目标。我们不使用丢失和偏差来加快训练速度。

In addition, we will use Rank-Stabilized LoRA (rsLoRA), which modifies the scaling factor of LoRA adapters to be proportional to 1/√r instead of 1/r. This stabilizes learning (especially for higher adapter ranks) and allows for improved fine-tuning performance as rank increases. Gradient checkpointing is handled by Unsloth to offload input and output embeddings to disk and save VRAM.
此外，我们将使用 Rank-Stabilized LoRA (rsLoRA)，它将 LoRA 适配器的缩放因子修改为与 1/√r 成正比，而不是 1/r。这可以稳定学习（尤其是对于更高的适配器等级），并可以随着等级的增加而提高微调性能。梯度检查点由 Unsloth 处理，以将输入和输出嵌入卸载到磁盘并节省 VRAM。

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"], 
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

With this LoRA configuration, we'll only train 42 million out of 8 billion parameters (0.5196%). This shows how much more efficient LoRA is compared to full fine-tuning.
通过这种 LoRA 配置，我们只能训练 80 亿个参数中的 4200 万个（0.5196%）。这表明 LoRA 与完全微调相比效率要高得多。

Let's now load and prepare our dataset. Instruction datasets are stored in a particular format: it can be Alpaca, ShareGPT, OpenAI, etc. First, we want to parse this format to retrieve our instructions and answers. Our mlabonne/FineTome-100k dataset uses the ShareGPT format with a unique "conversations" column containing messages in JSONL. Unlike simpler formats like Alpaca, ShareGPT is ideal for storing multi-turn conversations, which is closer to how users interact with LLMs.
现在让我们加载并准备我们的数据集。指令数据集以特定格式存储：可以是 Alpaca、ShareGPT、OpenAI 等。首先，我们要解析此格式以检索我们的指令和答案。我们的 mlabonne/FineTome-100k 数据集使用 ShareGPT 格式，并具有包含 JSONL 消息的独特 “对话” 列。与 Alpaca 等更简单的格式不同，ShareGPT 非常适合存储多轮对话，这更接近用户与用户交互的方式 LLMs。

Once our instruction-answer pairs are parsed, we want to reformat them to follow a chat template. Chat templates are a way to structure conversations between users and models. They typically include special tokens to identify the beginning and the end of a message, who's speaking, etc. Base models don't have chat templates so we can choose any: ChatML, Llama3, Mistral, etc. In the open-source community, the ChatML template (originally from OpenAI) is a popular option. It simply adds two special tokens (<|im_start|> and <|im_end|>) to indicate who's speaking.
一旦我们的指令 - 答案对被解析，我们想要重新格式化它们以遵循聊天模板。聊天模板是一种在用户和模型之间构建对话的方法。它们通常包含特殊标记来识别消息的开头和结尾、谁在说话等。基本模型没有聊天模板，因此我们可以选择任何模板：ChatML、Llama3、Mistral 等。在开源社区中， ChatML 模板（最初来自 OpenAI）是一个流行的选择。它只是添加两个特殊标记（ <|im_start|>和<|im_end|> ）来指示谁在说话。

If we apply this template to the previous instruction sample, here's what we get:
如果我们将此模板应用于之前的指令示例，我们将得到以下结果：

<|im_start|>system
You are a helpful assistant, who always provide explanation. Think like you are answering to a five year old.<|im_end|>
<|im_start|>user
Remove the spaces from the following sentence: It prevents users to suspect that there are some hidden products installed on theirs device.
<|im_end|>
<|im_start|>assistant
Itpreventsuserstosuspectthattherearesomehiddenproductsinstalledontheirsdevice.<|im_end|>

In the following code block, we parse our ShareGPT dataset with the mapping parameter and include the ChatML template. We then load and process the entire dataset to apply the chat template to every conversation.
在以下代码块中，我们使用mapping参数解析 ShareGPT 数据集并包含 ChatML 模板。然后，我们加载并处理整个数据集，将聊天模板应用到每个对话。

tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

def apply_template(examples):
    messages = examples["conversations"]
    text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
    return {"text": text}

dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = dataset.map(apply_template, batched=True)

We're now ready to specify the training parameters for our run. I want to briefly introduce the most important hyperparameters:
我们现在准备好指定运行的训练参数。我想简单介绍一下最重要的超参数：

Learning rate: It controls how strongly the model updates its parameters. Too low, and training will be slow and may get stuck in local minima. Too high, and training may become unstable or diverge, which degrades performance.
学习率：它控制模型更新其参数的强度。太低，训练会很慢，并且可能会陷入局部最小值。太高，训练可能会变得不稳定或发散，从而降低性能。
LR scheduler: It adjusts the learning rate (LR) during training, starting with a higher LR for rapid initial progress and then decreasing it in later stages. Linear and cosine schedulers are the two most common options.
LR 调度器：它在训练期间调整学习率（LR），从较高的 LR 开始以实现快速的初始进展，然后在后期阶段降低它。线性和余弦调度程序是两个最常见的选项。
Batch size: Number of samples processed before the weights are updated. Larger batch sizes generally lead to more stable gradient estimates and can improve training speed, but they also require more memory. Gradient accumulation allows for effectively larger batch sizes by accumulating gradients over multiple forward/backward passes before updating the model.
批量大小：更新权重之前处理的样本数。较大的批量大小通常会导致更稳定的梯度估计，并且可以提高训练速度，但它们也需要更多的内存。梯度累积通过在更新模型之前累积多次前向 / 后向传递的梯度，可以有效地获得更大的批量大小。
Num epochs: The number of complete passes through the training dataset. More epochs allow the model to see the data more times, potentially leading to better performance. However, too many epochs can cause overfitting.
Num epochs ：完整通过训练数据集的次数。更多纪元允许模型更多次查看数据，可能会带来更好的性能。然而，太多的纪元可能会导致过度拟合。
Optimizer: Algorithm used to adjust the parameters of a model to minimize the loss function. In practice, AdamW 8-bit is strongly recommended: it performs as well as the 32-bit version while using less GPU memory. The paged version of AdamW is only interesting in distributed settings.
优化器：用于调整模型参数以最小化损失函数的算法。在实践中，强烈建议使用 AdamW 8 位：它的性能与 32 位版本一样好，但使用的 GPU 内存更少。 AdamW 的分页版本仅在分布式设置中有意义。
Weight decay: A regularization technique that adds a penalty for large weights to the loss function. It helps prevent overfitting by encouraging the model to learn simpler, more generalizable features. However, too much weight decay can impede learning.
权重衰减：一种正则化技术，可为损失函数添加对大权重的惩罚。它通过鼓励模型学习更简单、更通用的特征来帮助防止过度拟合。然而，过多的权重衰减会阻碍学习。
Warmup steps: A period at the beginning of training where the learning rate is gradually increased from a small value to the initial learning rate. Warmup can help stabilize early training, especially with large learning rates or batch sizes, by allowing the model to adjust to the data distribution before making large updates.
热身步骤：训练开始时的一段时间，学习率从小值逐渐增加到初始学习率。通过允许模型在进行大规模更新之前调整数据分布，预热可以帮助稳定早期训练，尤其是在学习率或批量大小较大的情况下。
Packing: Batches have a pre-defined sequence length. Instead of assigning one batch per sample, we can combine multiple small samples in one batch, increasing efficiency.
包装：批次具有预定义的序列长度。我们可以将多个小样本合并为一批，从而提高效率，而不是为每个样本分配一批。

I trained the model on the entire dataset (100k samples) using an A100 GPU (40 GB of VRAM) on Google Colab. The training took 4 hours and 45 minutes. Of course, you can use smaller GPUs with less VRAM and a smaller batch size, but they're not nearly as fast. For example, it takes roughly 19 hours and 40 minutes on an L4 and a whopping 47 hours on a free T4.
我使用 Google Colab 上的 A100 GPU（40 GB VRAM）在整个数据集（100k 样本）上训练了模型。培训历时 4 小时 45 分钟。当然，您可以使用具有较少 VRAM 和较小批量大小的较小 GPU，但它们的速度没有那么快。例如，L4 大约需要 19 小时 40 分钟，而免费 T4 则需要长达 47 小时。

In this case, I recommend only loading a subset of the dataset to speed up training. You can do it by modifying the previous code block, like dataset = load_dataset("mlabonne/FineTome-100k", split="train[:10000]") to only load 10k samples. Alternatively, you can use cheaper cloud GPU providers like Paperspace, RunPod, or Lambda Labs.
在这种情况下，我建议仅加载数据集的子集以加快训练速度。您可以通过修改前面的代码块来实现，例如dataset = load_dataset("mlabonne/FineTome-100k", split="train[:10000]")仅加载 10k 样本。或者，您可以使用更便宜的云 GPU 提供商，例如 Paperspace、RunPod 或 Lambda Labs。

trainer=SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        lr_scheduler_type="linear",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        seed=0,
    ),
)

trainer.train()

Now that the model is trained, let's test it with a simple prompt. This is not a rigorous evaluation but just a quick check to detect potential issues. We use FastLanguageModel.for_inference() to get 2x faster inference.
现在模型已经训练完毕，让我们用一个简单的提示来测试它。这不是严格的评估，只是快速检查以发现潜在问题。我们使用FastLanguageModel.for_inference()来获得 2 倍更快的推理速度。

model = FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "Is 9.11 larger than 9.9?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True)

The model's response is"9.9", which is correct!
模型的响应是 “9.9”，这是正确的！

Let's now save our trained model. If you remember the part about LoRA and QLoRA, what we trained is not the model itself but a set of adapters. There are three save methods in Unsloth: lora to only save the adapters, and merged_16bit/merged_4bit to merge the adapters with the model in 16-bit/ 4-bit precision.
现在让我们保存训练好的模型。如果你还记得 LoRA 和 QLoRA 的部分，我们训练的不是模型本身，而是一组适配器。 Unsloth 中有三种保存方法： lora仅保存适配器， merged_16bit / merged_4bit以 16 位 / 4 位精度将适配器与模型合并。

In the following, we merge them in 16-bit precision to maximize the quality. We first save it locally in the "model" directory and then upload it to the Hugging Face Hub. You can find the trained model on mlabonne/FineLlama-3.1-8B.
接下来，我们以 16 位精度合并它们，以最大限度地提高质量。我们首先将其保存在本地的 “model” 目录中，然后上传到 Hugging Face Hub。您可以在 mlabonne/FineLlama-3.1-8B 上找到经过训练的模型。

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("mlabonne/FineLlama-3.1-8B", tokenizer, save_method="merged_16bit")

Unsloth also allows you to directly convert your model into GGUF format. This is a quantization format created for llama.cpp and compatible with most inference engines, like LM Studio, Ollama, and oobabooga's text-generation-webui. Since you can specify different precisions (see my article about GGUF and llama.cpp), we'll loop over a list to quantize it in q2_k, q3_k_m, q4_k_m, q5_k_m, q6_k, q8_0 and upload these quants on Hugging Face. The mlabonne/FineLlama-3.1-8B-GGUF contains all our GGUFs.
Unsloth 还允许您直接将模型转换为 GGUF 格式。这是为 llama.cpp 创建的量化格式，与大多数推理引擎兼容，例如 LM Studio 、 Ollama 和 oobabooga 的 text- Generation-webui 。由于您可以指定不同的精度（请参阅我关于 GGUF 和 llama.cpp 的文章），我们将循环遍历一个列表，在q2_k 、 q3_k_m 、 q4_k_m 、 q5_k_m 、 q6_k 、 q8_0中对其进行量化，并将这些量化上传到 Hugging Face 上。 mlabonne/FineLlama-3.1-8B-GGUF 包含我们所有的 GGUF。

quant_methods = ["q2_k", "q3_k_m", "q4_k_m", "q5_k_m", "q6_k", "q8_0"]
for quant in quant_methods:
    model.push_to_hub_gguf("mlabonne/FineLlama-3.1-8B-GGUF", tokenizer, quant)

Congratulations, we fine-tuned a model from scratch and uploaded quants you can now use in your favorite inference engine. Feel free to try the final model available on mlabonne/FineLlama-3.1-8B-GGUF. What to do now? Here are some ideas on how to use your model:
恭喜，我们从头开始微调了模型并上传了量化，您现在可以在您最喜欢的推理引擎中使用。请随意尝试 mlabonne/FineLlama-3.1-8B-GGUF 上提供的最终模型。现在该怎么办？以下是有关如何使用模型的一些想法：

Evaluate it on the Open LLM Leaderboard (you can submit it for free) or using other evals like in LLM AutoEval.
在公开场合评估它 LLM 排行榜（您可以免费提交）或使用其他评估，例如 LLM 自动评估。
Align it with Direct Preference Optimization using a preference dataset like mlabonne/orpo-dpo-mix-40k to boost performance.
使用 mlabonne/orpo-dpo-mix-40k 等偏好数据集将其与直接偏好优化结合起来，以提高性能。
Quantize it in other formats like EXL2, AWQ, GPTQ, or HQQ for faster inference or lower precision using AutoQuant.
使用 AutoQuant 以 EXL2、AWQ、GPTQ 或 HQQ 等其他格式对其进行量化，以实现更快的推理或降低精度。
Deploy it on a Hugging Face Space with ZeroChat for models that have been sufficiently trained to follow a chat template (~20k samples).
将其部署在带有 ZeroChat 的 Hugging Face Space 上，用于经过充分训练以遵循聊天模板（约 20k 样本）的模型。

Conclusion 结论

This article provided a comprehensive overview of supervised fine-tuning and how to apply it in practice to a Llama 3.1 8B model. By leveraging QLoRA's efficient memory usage, we managed to fine-tune an 8B LLM on a super high-quality dataset with limited GPU resources. We also provided more efficient alternatives for bigger runs and suggestions for further steps, including evaluation, preference alignment, quantization, and deployment.
本文全面概述了监督微调以及如何将其实际应用到 Llama 3.1 8B 模型。通过利用 QLoRA 的高效内存使用，我们成功地微调了 8BLLM 在 GPU 资源有限的超高质量数据集上。我们还为更大规模的运行提供了更有效的替代方案，并为进一步的步骤提供了建议，包括评估、偏好调整、量化和部署。

I hope this guide was useful. If you're interested in learning more about LLMs, I recommend checking the LLM Course. If you enjoyed this article, follow me on X @maximelabonne and on Hugging Face @mlabonne. Good luck fine-tuning models!
我希望本指南有用。如果您有兴趣了解更多信息 LLMs，我建议检查 LLM 课程。如果您喜欢这篇文章，请在 X @maximelabonne 和 Hugging Face @mlabonne 上关注我。祝模型微调好运！