trinity.trainer.verl.megatron_engine module

trinity.trainer.verl.megatron_engine module#

Megatron-specific checkpoint and weight sync helpers for Trinity.

These helper functions are called by TrinityActorRolloutRefWorker to perform Megatron-specific operations that veRL 0.8’s engine does not provide natively:

  • megatron_save_state_dict: Save state dict for checkpoint sync

  • megatron_upload_state_dict: Upload state dict to Synchronizer (memory sync)

  • megatron_sync_weight_nccl: Broadcast params via NCCL

All functions receive the engine object (a McoreEngine instance from verl.workers.engine.megatron) which exposes:

  • engine.module: the Megatron model

  • engine.get_per_tensor_param(): generator of (name, tensor) pairs

  • engine.save_checkpoint() / engine.load_checkpoint()

trinity.trainer.verl.megatron_engine.megatron_save_state_dict(engine, local_path: str, global_step: int, coordinator: CheckpointCoordinator, logger)[source]#

Save Megatron actor model state dict for checkpoint-based weight sync.

Delegates to the engine’s built-in save_checkpoint for proper distributed checkpoint handling, but temporarily restricts the checkpoint manager’s checkpoint_save_contents to ["model"] so that only the actor model parameters are written to disk – optimizer and extra (rng / scheduler) states are skipped, since this path is used only for weight-sync to the rollout side, not for training resume. The save is wrapped with CheckpointMonitor notifications via the coordinator.

Note: Megatron’s save_checkpoint involves distributed barriers internally, so it runs synchronously. The coordinator’s save_sync is used to add Monitor notifications without background threading.

Parameters:
  • engine – The McoreEngine instance (engine.actor.engine).

  • local_path – Local directory path to save the state dict.

  • global_step – Current training step.

  • coordinator – CheckpointCoordinator for Monitor integration.

  • logger – Logger instance from the calling worker.

trinity.trainer.verl.megatron_engine.megatron_upload_state_dict(engine, synchronizer, global_step: int, logger)[source]#

Upload Megatron model state dict to Synchronizer for memory-based weight sync.

Iterates over per-tensor parameters and collects them on rank 0, then sends the full state dict to the Synchronizer actor.

Parameters:
  • engine – The McoreEngine instance (engine.actor.engine).

  • synchronizer – The Synchronizer Ray actor handle.

  • global_step – Current training step (used as version key).

trinity.trainer.verl.megatron_engine.megatron_sync_weight_nccl(engine, model_update_group)[source]#

Broadcast Megatron model parameters via NCCL.

Uses the engine’s get_per_tensor_param() to iterate over parameters and broadcasts each from rank 0.

Parameters:
  • engine – The McoreEngine instance (engine.actor.engine).

  • model_update_group – The NCCL process group for weight broadcast.