基于 Unsloth 降低大模型微调成本的实践

前言

Unsloth 是目前最激进的微调加速方案——通过手动优化的 Triton kernel 和一系列工程技巧，把微调速度提升 2-5 倍，显存占用降低 50-80%。

最大的卖点是：在 Google Colab 的免费 T4 GPU（16GB 显存）上就能微调 7B 模型。

一、安装

1
2
3

pip install unsloth
# 安装对应 PyTorch 版本的 unsloth
# 它会自动下载适配的 transformers 等依赖

如果是 Colab 环境，使用官方一键安装：

1 2	# Colab 安装 !pip install unsloth

二、加载模型

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",  # Unsloth 预量化好的模型
    max_seq_length=2048,     # 根据你的数据长度调整
    dtype=None,              # None = 自动检测
    load_in_4bit=True,       # 4-bit 量化加载
)

Unsloth 已经预量化了很多常用模型，直接用带 -bnb-4bit 后缀的版本。支持的模型列表在 unsloth.ai 可以查。

三、加 LoRA

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    target_modules=[         # 对那些层加 LoRA
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,          # 数据不多时设为 0
    bias="none",
    use_gradient_checkpointing="unsloth",  # unsloth 优化的检查点
    random_state=3407,
    use_rslora=False,
)

四、准备数据

from datasets import Dataset

# 训练数据格式：每条包含 instruction、input（可选）、output
data = [
    {"instruction": "将下面的文字翻译成英文", "input": "你好世界", "output": "Hello World"},
    {"instruction": "将下面的文字翻译成英文", "input": "今天天气不错", "output": "The weather is nice today"},
    # ... 更多数据
]

# 转成对话格式
def format_conversation(examples):
    texts = []
    for instruction, inp, output in zip(
        examples["instruction"], examples["input"], examples["output"]
    ):
        content = instruction
        if inp:
            content += f"\n{inp}"
        messages = [
            {"role": "user", "content": content},
            {"role": "assistant", "content": output}
        ]
        text = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=False
        )
        texts.append(text)
    return {"text": texts}

dataset = Dataset.from_list(data)
dataset = dataset.map(format_conversation, batched=True)

五、开始训练

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,          # 先跑 60 步看效果，效果好再加
        learning_rate=2e-4,    # Unsloth 推荐的学习率
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="./output",
        optim="adamw_8bit",
        seed=3407,
    ),
)

trainer_stats = trainer.train()
print(f"训练完成，耗时: {trainer_stats.metrics['train_runtime']:.1f} 秒")

Unsloth 推荐的学习率参考：

基座模型	推荐学习率
Qwen2.5	2e-4
Llama 3.2	2e-4
DeepSeek	1e-4
Mistral	2e-4

六、推理测试

# 切换到推理模式
FastLanguageModel.for_inference(model)

def ask(instruction, input_text=""):
    content = instruction
    if input_text:
        content += f"\n{input_text}"
    messages = [{"role": "user", "content": content}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.3)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(ask("将下面文字翻译成英文：", "机器学习很有趣"))

七、保存模型

# 保存 LoRA 权重（只有几 MB 到几十 MB）
model.save_pretrained("./lora_weights")
tokenizer.save_pretrained("./lora_weights")

# 或者保存为 GGUF 格式（给 Ollama / llama.cpp 用）
model.save_pretrained_gguf("./gguf_model", tokenizer, quantization_method="q4_k_m")

保存为 GGUF 后直接用 Ollama 加载：

1
2
3

ollama create my-model -f Modelfile
# Modelfile:
# FROM ./gguf_model/ggml-model-q4_k_m.gguf

八、Colab 免费 GPU 完整流程

打开 Google Colab，选 T4 GPU 运行时
!pip install unsloth
运行上面的加载模型 + 准备数据 + 训练代码
训练完成后把 LoRA 权重保存到 Google Drive
断开运行时（不消耗 GPU 额度了）

Colab T4 上的性能参考（QLoRA 微调 7B 模型）：

数据量	batch_size=2	训练时间	Gradient Accumulation
200 条	2	~8 分钟	4
500 条	2	~20 分钟	4
1000 条	2	~40 分钟	4
2000 条	2	~80 分钟	4

Colab 免费版有使用时长限制（约 3-4 小时连续，之后可能断），2000 条以内的数据一般来得及。

九、Unsloth vs LLaMA-Factory 怎么选

	Unsloth	LLaMA-Factory
上手难度	中等（写代码）	低（Web UI 点点点）
训练速度	最快（2-5 倍加速）	正常
显存占用	最低	正常
支持模型数	30+	100+
Colab 上跑 7B	轻松	可以，但 Unsloth 更稳
适用场景	追求速度和低显存	追求方便和模型多样性

选 Unsloth 如果你：显卡小、想在 Colab 跑、追求训练速度。

选 LLaMA-Factory 如果你：不想写代码、要微调的模型比较小众、需要一站式工具链。