trinity.trainer.verl.fsdp_checkpoint_manager module#

FSDP Checkpoint Manager. Modified from volcengine/verl

class trinity.trainer.verl.fsdp_checkpoint_manager.FSDPCheckpointManager(*args, ray_namespace: str = '', trust_remote_code: bool = False, **kwargs)[source]#

Bases: FSDPCheckpointManager

An enhanced version of the original FSDP checkpoint manager that:

  1. Uploads model state dicts to a remote Synchronizer actor (either directly or via checkpoints).

  2. Offloads saving operations (model, optimizer, extra states) into background threads to avoid blocking the training loop.

This class is useful in distributed training scenarios where synchronization and non-blocking I/O are important.

__init__(*args, ray_namespace: str = '', trust_remote_code: bool = False, **kwargs)[source]#
register_checkpoint(new_path: str, max_ckpt_to_keep: int | None = None)[source]#

Register a successfully saved checkpoint and enforce retention limit.

Adds the new checkpoint path to tracking and removes excess old checkpoints beyond max_ckpt_to_keep.

upload_state_dict(global_step: int)[source]#

Uploads the full model state dictionary to the synchronizer actor for remote access.

Parameters:

global_step (int) – The current training step number.

save_state_dict(local_path: str, global_step: int = 0)[source]#
save_checkpoint(local_path: str, global_step: int = 0, max_ckpt_to_keep: int | None = None, save_as_hf: bool = False)[source]#

Modified from verl.utils.checkpoint.fsdp_checkpoint_manager.py:save_checkpoint

Saves the model checkpoint to disk and uses background threads to prevent blocking the main training loop.

Main improvements over the base class: - Uses separate threads for saving model/optimizer/extras. - Registers background work with CheckpointMonitor so trainer-side coordination

can wait on state-dict and checkpoint completion.

Parameters:
  • local_path (str) – Local directory path to save the checkpoint.

  • global_step (int) – Current training step.

  • max_ckpt_to_keep (int, optional) – Maximum number of checkpoints to keep locally.

  • save_as_hf (bool) – Whether to force save the model in Hugging Face format.

wait_on_save_thread() None[source]#

Wait for all background saving threads to complete.