Chat Supervised Fine-Tuning (SFT)

This guide demonstrates supervised fine-tuning (SFT) on chat-formatted data using a running TuFT server. Full runnable code is in the examples/chat_sft/chat_sft.ipynb notebook. Although this is a general SFT guide, it also documents common issues users may encounter when using TuFT for SFT and provides step-by-step guidance to help them successfully complete an end-to-end run.


What You’ll Learn

  1. How to load chat datasets from HuggingFace and extract multi-turn messages

  2. How to format conversations using model chat templates (apply_chat_template)

  3. How to implement assistant-only loss masking and compute masked negative log-likelihood for evaluation

  4. How to construct Datum objects and run an end-to-end LoRA SFT loop via TuFT server.

  5. How to choose and tune LoRA rank and learning rate based on train/test curves


Table of Contents

  1. When to Use SFT vs. RL

  2. Datasets

  3. Minimal Training Example (SFT)

  4. Key Concepts

  5. Parameter Selection

  6. Q&A


When to Use SFT vs. RL

SFT vs. RL (high-level comparison)

Topic

SFT (Supervised Fine-Tuning)

RL (Reinforcement Learning)

Training signal

Demonstrations (target responses)

Reward / preferences (scalar or ranking)

Best for

Style, format, instruction following, domain behavior from curated answers

Aligning behavior to preferences/constraints, safety policies, multi-objective trade-offs

Data required

High-quality assistant responses

Reward model, preference pairs, or evaluators

Typical workflow

Often the first stage

Often follows SFT (SFT → RL)

Examples of training data / signal

Input-output pairs, e.g. prompt: “Rewrite as a polite email …” → target: “Dear … Sincerely …”

LLM-as-judge: rank A vs B or score responses

Rule of thumb

  • Use SFT when you can provide good “gold” assistant responses and want the model to imitate a clear target output.

  • Use RL when there is no single correct response, but you can define what is “better” via a reward or preference signal, often based on task requirements like helpfulness, safety, style, formatting, or tool-use behavior.


Datasets

This guide uses no_robots.

Dataset

Source

Size

Train On

Use Case

no_robots

HuggingFaceH4/no_robots

~9.5K train + 500 test

All assistant messages (masked)

Quick experiments

Minimal loader:

from datasets import load_dataset

ds = load_dataset("HuggingFaceH4/no_robots")
train_data = [row["messages"] for row in ds["train"]]
test_data  = [row["messages"] for row in ds["test"]]

Each sample is a list of chat messages:

{"role": "user" | "assistant", "content": "..."}

Minimal Training Example (SFT)

TuFT (Tenant-unified FineTuning) is a multi-tenant system that provides a unified service API for fine-tuning large language models (LLMs). It supports the Tinker service API and can be used with the Tinker SDK. Unlike the Tinker, TuFT can run on local GPUs; the experiments below were conducted on a local 2× NVIDIA A100-SXM4-80GB setup (Driver 550.54.15, CUDA 12.9). Before running the example, follow the Installation Guide to start the TuFT server locally.

Key TuFT calls (full code in examples/chat_sft/chat_sft.ipynb):

import tinker
from tinker import types

service_client = tinker.ServiceClient(base_url="http://localhost:10610", api_key=TINKER_API_KEY)

training_client = service_client.create_lora_training_client(
    base_model=BASE_MODEL,
    rank=LORA_RANK,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

fwdbwd = training_client.forward_backward(datums, loss_fn="cross_entropy").result()
training_client.optim_step(types.AdamParams(learning_rate=LEARNING_RATE)).result()

Key Concepts

Chat Formatting & Templates

We use the base model’s chat template so the prompt follows the same role/marker format seen during training.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False,
)
tokens = tokenizer.encode(text, add_special_tokens=False)
  • tokenize=False: return rendered text (string), not token IDs; we tokenize explicitly in the next line.

  • add_generation_prompt=False: don’t append the final “assistant start” marker; useful for training/encoding existing turns. (For inference, often set True to prompt the model to generate the next assistant reply.)

  • add_special_tokens=False: avoid duplicating special tokens since the chat template already includes the needed markers.

Loss Masking (Assistant-only)

For chat SFT, we usually want the model to learn to produce assistant responses, not to predict the user prompt. We therefore build per-token weights:

  • tokens from assistant turns → weight = 1.0

  • tokens from user turns → weight = 0.0

Because training is next-token prediction, the mask must be aligned to the target tokens (the tokens being predicted). If we build weights for the original token stream tokens[0..N-1], then the loss at step t predicts tokens[t+1], so we use weights[1:] to align with target_tokens = tokens[1:].

def build_sft_example(messages, tokenizer, max_length=2048):
    # Build token stream + per-token weights (assistant=1, user=0)
    tokens, weights = [], []
    for msg in messages:
        turn_tokens = tokenizer.encode(msg["content"], add_special_tokens=False)
        tokens += turn_tokens
        weights += [1.0 if msg["role"] == "assistant" else 0.0] * len(turn_tokens)

    # Optional truncation
    tokens, weights = tokens[:max_length], weights[:max_length]

    # Next-token prediction: input[t] -> target[t] = tokens[t+1]
    input_tokens  = tokens[:-1]
    target_tokens = tokens[1:]

    # Align mask to targets (the predicted tokens)
    target_weights = weights[1:]

    return input_tokens, target_tokens, target_weights

Datum Format

Each conversation is converted into a next-token-prediction sample:

  • model_input: tokens [0..T-2]

  • target_tokens: tokens [1..T-1]

  • weights: mask applied on targets (assistant-only)

Example:

from tinker import types

datum = types.Datum(
    model_input=types.ModelInput.from_ints(input_tokens),
    loss_fn_inputs={
        "target_tokens": list(target_tokens),
        "weights": target_weights.tolist(),
    },
)

Loss Function

Training uses the following loss function:

  • loss_fn="cross_entropy"

TuFT returns per-token log probabilities (logprobs). The guide computes masked Negative Log-Likelihood (NLL):

\[ \mathrm{NLL}=\frac{\sum_{t}\bigl(-\log p(y_t)\bigr)\,w_t}{\sum_{t} w_t} \]

Minimal computation:

def masked_nll(loss_fn_outputs, datums):
    total_loss, total_w = 0.0, 0.0
    for out, d in zip(loss_fn_outputs, datums):
        for lp, w in zip(out["logprobs"], d.loss_fn_inputs["weights"]):
            total_loss += -lp * w
            total_w += w
    return total_loss / max(total_w, 1.0)

Parameter Selection

This section explains how to choose lora_rank and learning_rate, and summarizes conclusions from the provided experiment results. This documentation is based on Qwen/Qwen3-4B-Instruct-2507.

What do lora_rank and learning_rate do?

lora_rank (LoRA adapter rank) controls adapter capacity:

  • Higher rank = more trainable params → potentially better fit, more compute/memory, higher overfitting risk

  • Lower rank = cheaper, often sufficient for style/small behavior changes

learning_rate controls update step size:

  • Too high (e.g. 1e-3): fast but can be unstable/overfit

  • Too low (e.g. 1e-5): stable but slow

  • Middle (e.g. 1e-4): common default for LoRA SFT

Experimental conclusions from the plots

Based on Figure 1 (test NLL) and Figure 2 (train mean NLL):

  1. Very low LR (1e-5) converges much more slowly

  2. 1e-4 and 1e-3 improve quickly early

  3. Rank has diminishing returns beyond a point

  4. Best test losses often cluster around moderate rank + moderate/high LR

Note: exact “best” depends on stopping step and downstream generation quality (not only NLL).

Test NLL

Figure 1. Test NLL

Train mean NLL

Figure 2. Train mean NLL

Practical recommendations

  • Strong default: lora_rank = 8 or 32, learning_rate = 1e-4

  • Faster early progress (riskier): lora_rank = 8 or 32, learning_rate = 1e-3

  • If unstable/overfitting: lower LR (1e-4 5e-5 1e-5) or lower rank (32 8)

  • If task is harder: try 32 before 128, keep LR 1e-4, increase steps before rank if possible. “Harder” means the learning problem is intrinsically more difficult (more complex input→output mapping), such as stricter output constraints/format, longer context, more reasoning steps, or higher output diversity/ambiguity. It does not simply mean “more data”; more data usually just requires more training steps, not a higher LoRA rank.


Q&A

(1) Dataset download fails due to network issues when accessing huggingface.co

If you see an error like:

MaxRetryError('HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded ...
(Caused by NewConnectionError(... [Errno 101] Network is unreachable))')

For Jupyter notebook users, add the following at the very top of the first cell:

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

Then Restart Kernel and Clear ALL Outputs.


(2) invalid Api_key

In the Tinker SDK, the environment variable TINKER_API_KEY takes precedence over the api_key= argument passed here:

service_client = tinker.ServiceClient(base_url=TINKER_BASE_URL, api_key=TINKER_API_KEY)

So if your code is passing the correct key but you still get invalid api_key, you need to either set the correct environment variable (via export TINKER_API_KEY=...) or clear it and rely on the api_key= argument:

unset TINKER_API_KEY

(3) Jupyter warning: TqdmWarning: IProgress not found...

If you see:

TqdmWarning: IProgress not found. Please update jupyter and ipywidgets.

Option A (recommended): install/upgrade Jupyter widgets

pip install -U ipywidgets jupyter

Then restart the kernel.

Option B: avoid widget-based tqdm in notebooks Use the standard tqdm progress bar instead of tqdm.auto / tqdm.notebook:

from tqdm import tqdm

(4) OOM or slow training

If you run into out-of-memory (OOM) errors or training is too slow, reduce one or more of:

  • MAX_LENGTH

  • BATCH_SIZE

  • LORA_RANK

In most cases, lowering MAX_LENGTH gives the biggest memory/speed improvement, followed by BATCH_SIZE, then LORA_RANK.

(5) Add a virtual environment to Jupyter (register a new kernel)

If you’re working on a remote server, it’s often convenient to add your existing virtual environment (virtualenv/venv) as a selectable Jupyter kernel.

  1. Activate the virtual environment

source /path/to/venv/bin/activate
  1. Install ipykernel inside the environment

pip install ipykernel
  1. Register the environment as a Jupyter kernel

python -m ipykernel install --user --name=myproject --display-name "Python (myproject)"
  1. Select the kernel in Jupyter

  • In Jupyter Notebook/Lab: Kernel → Change Kernel → Python (myproject)