trinity.trainer.verl.fsdp_checkpoint_manager module#

FSDP Checkpoint Manager. Modified from volcengine/verl

class trinity.trainer.verl.fsdp_checkpoint_manager.FSDPCheckpointManager(*args, ray_namespace: str = '', trust_remote_code: bool = False, **kwargs)[源代码]#

基类:FSDPCheckpointManager

An enhanced version of the original FSDP checkpoint manager that:

  1. Uploads model state dicts to a remote Synchronizer actor (either directly or via checkpoints).

  2. Offloads saving operations (model, optimizer, extra states) into background threads to avoid blocking the training loop.

This class is useful in distributed training scenarios where synchronization and non-blocking I/O are important.

__init__(*args, ray_namespace: str = '', trust_remote_code: bool = False, **kwargs)[源代码]#
register_checkpoint(new_path: str, max_ckpt_to_keep: int | None = None)[源代码]#

Register a successfully saved checkpoint and enforce retention limit.

Adds the new checkpoint path to tracking and removes excess old checkpoints beyond max_ckpt_to_keep.

upload_state_dict(global_step: int)[源代码]#

Uploads the full model state dictionary to the synchronizer actor for remote access.

参数:

global_step (int) -- The current training step number.

save_state_dict(local_path: str, global_step: int = 0)[源代码]#
save_checkpoint(local_path: str, global_step: int = 0, max_ckpt_to_keep: int | None = None, save_as_hf: bool = False)[源代码]#

Modified from verl.utils.checkpoint.fsdp_checkpoint_manager.py:save_checkpoint

Saves the model checkpoint to disk and uses background threads to prevent blocking the main training loop.

Main improvements over the base class: - Uses separate threads for saving model/optimizer/extras. - Registers background work with CheckpointMonitor so trainer-side coordination

can wait on state-dict and checkpoint completion.

参数:
  • local_path (str) -- Local directory path to save the checkpoint.

  • global_step (int) -- Current training step.

  • max_ckpt_to_keep (int, optional) -- Maximum number of checkpoints to keep locally.

  • save_as_hf (bool) -- Whether to force save the model in Hugging Face format.

wait_on_save_thread() None[源代码]#

Wait for all background saving threads to complete.