Deploy on Lambda Cloud

Lambda Cloud rents plain on-demand GPU VMs, billed per minute until you terminate them — no orchestration layer and no preemption. This guide takes you end to end — configure a TuFT server, launch it on a Lambda GPU, train a “talk like Yoda” LoRA on Qwen/Qwen3-0.6B from your laptop, and download the adapter — with no local GPU required.

What you’ll build

A TuFT server running on a Lambda GPU VM (auto-provisioned and self-bootstrapped via Docker), which you reach over an SSH tunnel and fine-tune from your laptop using the runnable code in examples/personality_sft/.

Lambda has no scale-to-zero

Unlike Modal, a Lambda VM bills continuously until you terminate it — there is no idle/auto-stop. Always tear the instance down when you’re done (see Step 5).

Prerequisites

  1. A Lambda Cloud account, API key, and SSH key. In the Lambda Cloud dashboard, generate an API key (API keys) and register an SSH public key (SSH keys) — you’ll use the matching private key to reach the server. See Lambda’s docs for details. Export the API key so the launcher can find it:

    export LAMBDA_API_KEY=secret_...
    
  2. The TuFT repo. The deploy helper talks to Lambda’s HTTP API; it only needs pyyaml locally (all GPU dependencies run inside the container on the VM):

    git clone https://github.com/agentscope-ai/TuFT
    cd TuFT
    pip install pyyaml
    

Step 1 — Configure the server

The deploy helper deploy/lambda/launch.py is config-file driven and mirrors the Modal launcher: you edit a standard tuft_config.yaml and run the script. Lambda infra goes in an optional lambda: section that is stripped before the server sees it.

Save this as yoda_lambda.yaml:

checkpoint_dir: ~/.cache/tuft/checkpoints   # mapped to the VM's /data; see Step 4 about durability
model_owner: cloud-user

supported_models:
  - model_name: Qwen/Qwen3-0.6B
    model_path: Qwen/Qwen3-0.6B            # HF id (downloaded on first launch) or a local path
    max_model_len: 4096
    tensor_parallel_size: 1
    colocate: true                         # single GPU: training + vLLM sampling share it
    sampling_memory_fraction: 0.4
    max_lora_rank: 16
    max_loras: 2

authorized_users:
  tml-REPLACE_WITH_A_STRONG_KEY: cloud-user  # clients send this as the X-API-Key header

persistence:
  mode: DISABLE

telemetry:
  enabled: false

# Lambda Cloud infra for deploy/lambda/launch.py (TuFT ignores this; it's stripped before the server sees it):
lambda:
  gpu: a100              # family hint; auto-pick prefers a100
  name: tuft-yoda
  # ssh_key: my-key      # default: your account's sole registered key
  # filesystem: tuft     # Lambda persistent filesystem for durable checkpoints (else ephemeral root disk)

Generate a real API key for authorized_users (it must start with tml-):

python -c "import secrets; print('tml-' + secrets.token_urlsafe(24))"

Pick a100, not a10, for training

The cheapest Lambda GPU (gpu_1x_a10, sm_86) has a known issue in the current TuFT image: serving works, but training returns null logprobs. Auto-select therefore prefers a100 and uses a10 only as a last resort. For this training example, stay on a100.

Step 2 — Launch the GPU server

Run the launcher. With no instance pinned, it auto-selects the cheapest available GPU that matches your gpu: hint, provisions it, and self-bootstraps TuFT in Docker via cloud-init (no manual SSH needed):

python deploy/lambda/launch.py --config yoda_lambda.yaml

It prints the chosen instance and, once the VM is up, a connect banner with an SSH-tunnel command and the instance id:

[launch] gpu_1x_a100_sxm4 in us-east-1 (~$1.99/hr), ssh_key=my-key, name=tuft-yoda
[launch] provisioning instance abcd...  (this takes a minute)
[launch] instance abcd1234... is active at 203.0.113.45

Connect securely over an SSH tunnel (recommended; keeps :10610 off the public net):
    ssh -N -L 10610:localhost:10610 ubuntu@203.0.113.45

Tip

Check or list instances any time with python deploy/lambda/launch.py --status. To reuse an existing instance instead of launching a new one, pass --instance-id <id>.

Step 3 — Train the Yoda LoRA from your laptop

Open the SSH tunnel printed above in one terminal (the first boot pulls the image and loads vLLM, which can take a few minutes):

ssh -N -L 10610:localhost:10610 ubuntu@203.0.113.45

Note

The training still runs on your laptoptrain.py is a CPU-only client that drives the loop over HTTP. The -L 10610:localhost:10610 flag forwards your laptop’s port 10610 to the server’s port 10610 on the VM, so http://localhost:10610 (used below) is only the local end of the tunnel: every request is carried over SSH to the remote GPU, where the training and sampling actually run. Keep this SSH terminal open for the whole run. (Prefer not to tunnel? You can point --base-url at http://<vm-ip>:10610 directly instead, but that exposes the API on the public internet.)

In a second terminal, confirm the server is healthy through the tunnel:

curl http://localhost:10610/api/v1/healthz
# {"status":"ok"}

The training script examples/personality_sft/train.py is a client that drives the loop over HTTP via the Tinker SDK — it needs only CPU-side dependencies:

pip install tinker transformers

The dataset is ~50 hand-authored (user, assistant-in-Yoda-voice) pairs in dataset.py. Only the assistant tokens get loss weight (the prompt is masked), so the model learns the voice, not the questions. A couple of example pairs:

YODA_PAIRS = [
    (
        "How do I stay motivated?",
        "Motivated, you wish to stay? Hmm. Small steps each day, take you must. "
        "Begin, and momentum follows. Do or do not — there is no try.",
    ),
    (
        "What's the best way to learn programming?",
        "Patient you must be, young one. Write code daily, you should. "
        "Fail often, learn faster you will. Much to learn, you still have.",
    ),
    # ... ~50 pairs total
]

Point the script at the tunnel (http://localhost:10610), pass the tml- key, and select the model (it must match the supported_models entry on the server):

python examples/personality_sft/train.py \
    --base-url http://localhost:10610 \
    --api-key tml-REPLACE_WITH_A_STRONG_KEY \
    --model Qwen/Qwen3-0.6B

The script samples the base model before training, runs 60 LoRA steps, then samples the trained adapter after so you can see the personality emerge:

[before] sampling base model on held-out prompts
   [base] Give me advice on becoming a better person.
      -> Focus on small consistent habits: be honest, listen to others, and learn from your mistakes.
   [base] What's the key to writing good software?
      -> Keep it readable and well-tested, prefer small functions, and document your intent.

[train] 60 steps, batch 8, lr 0.0001, rank 16
   step   0  loss=2.3987
   step  10  loss=1.7765
   step  50  loss=1.2410
   step  59  loss=1.1702

[save] sampler=tinker://abcd1234.../yoda-sampler
[save] run_id=abcd1234-...

[after] sampling the trained adapter
   [yoda] Give me advice on becoming a better person.
      -> Small steps each day, take you must. Honest, be. Listen more than speak, you should. Grow, you will, hmm.
   [yoda] What's the key to writing good software?
      -> Simple, keep it, you must. Read aloud your code, you should. Fail often, learn faster you will, young one.

Tip

Tunables: --lora-rank, --num-steps, --batch-size, --learning-rate, --max-length. Pass --no-before to skip the base-model sampling. Edit dataset.py to swap in a different character.

Step 4 — Download the adapter

Training writes a standard PEFT LoRA adapter to the server’s checkpoint_dir, which on the VM lives under /home/ubuntu/tuft-data/checkpoints. The script prints a run_id — copy the adapter off the instance with scp:

scp -r ubuntu@203.0.113.45:/home/ubuntu/tuft-data/checkpoints/<run_id> ./weights/
# -> ./weights/<run_id>/yoda-final/adapter/{adapter_config.json, adapter_model.safetensors}

Download before you terminate

By default checkpoints live on the instance’s ephemeral root disk — terminating the VM (Step 5) destroys them. Download first, or launch with a persistent filesystem: (a Lambda filesystem) so checkpoints survive termination.

Optionally merge the adapter into full model weights (needs torch, peft, transformers):

import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, "./weights/<run_id>/yoda-final/adapter").merge_and_unload()
merged.save_pretrained("./yoda-merged")   # standard HF model dir, servable by vLLM

Step 5 — Tear down

Lambda bills until you terminate, so always shut the instance down when finished:

python deploy/lambda/launch.py --down --instance-id abcd1234...
# or by name:
python deploy/lambda/launch.py --down --name tuft-yoda

Verify nothing is left running:

python deploy/lambda/launch.py --status

Tip

Lambda bills the GPU continuously, so a single laptop-driven run pays for a lot of idle time. Because TuFT is multi-tenant, you can point several concurrent jobs or users at the same instance (each with a key under authorized_users; raise max_loras for more adapters at once) to keep the GPU busy and split the cost. See Keeping the GPU busy.

Next steps

  • Try a bigger model (e.g. Qwen/Qwen3-4B) by changing both names in the config; the auto-picked a100 (80 GB) has ample room.

  • Prefer a scale-to-zero option? See the Modal guide.

  • Browse Lambda’s docs for instance types, regions, and filesystems.