trinity.algorithm.advantage_fn.gigpo_advantage module#

GiGPO (Group-in-Group Policy Optimization) advantage computation.

Reference:

Feng et al., “Group-in-Group Policy Optimization for LLM Agent Training”, arXiv:2505.10978.

class trinity.algorithm.advantage_fn.gigpo_advantage.GiGPOAdvantageFn(omega: float = 1.0, gamma: float = 1.0, fnorm: Literal['std', 'none'] = 'none', epsilon: float = 1e-06, step_reward_key: str = 'step_reward', env_state_hash_key: str = 'env_state_hash', **kwargs)[source]#

Bases: AdvantageFn, ExperienceOperator

Compute hierarchical GiGPO advantages for multi-turn agent experiences.

GiGPO combines episode-level relative advantages (GRPO-style over full trajectories) with step-level relative advantages within anchor-state groups. The combined scalar advantage is A = A_E + omega * A_S, then broadcast to tokens via action_mask.

Workflows must set experience.info[env_state_hash_key] for step-level grouping and should set experience.info[step_reward_key] for per-step immediate rewards. See examples/gigpo_alfworld/README.md.

omega#

Weight on step-level advantage A_S.

gamma#

Discount factor for discounted step returns R_t.

fnorm#

Normalization mode, "std" (GRPO) or "none" (RLOO-style).

epsilon#

Small constant added to the normalization denominator.

step_reward_key#

Key in experience.info for immediate reward r_t.

env_state_hash_key#

Key in experience.info for anchor state identity.

__init__(omega: float = 1.0, gamma: float = 1.0, fnorm: Literal['std', 'none'] = 'none', epsilon: float = 1e-06, step_reward_key: str = 'step_reward', env_state_hash_key: str = 'env_state_hash', **kwargs) None[source]#

Initialize GiGPO advantage computation.

Parameters:
  • omega – Weight on step-level advantage A_S in the combined advantage.

  • gamma – Discount factor for R_t = sum_{k>=t} gamma^{k-t} r_k.

  • fnorm – Group normalization. "std" divides by standard deviation; "none" uses F_norm = 1 (paper default for agent benchmarks).

  • epsilon – Stabilizer when dividing by std or 1.

  • step_reward_keyexperience.info field for immediate step reward.

  • env_state_hash_keyexperience.info field for anchor-state hash.

  • **kwargs – Ignored; accepted for registry compatibility.

process(exps: List[Experience]) Tuple[List[Experience], Dict][source]#

Compute GiGPO advantages for a batch of multi-step experiences.

Episode-level: group by task, compare trajectory returns R(tau) across runs (GRPO-style). Step-level: group by env_state_hash across the batch, compare discounted returns R_t; singleton anchors get A_S = 0.

Parameters:

exps – Multi-step experiences with eid.task, eid.run, eid.step, and optional anchor metadata in info.

Returns:

Experiences with advantages and

returns set, plus logging metrics prefixed with gigpo/.

Return type:

Tuple[List[Experience], Dict]

classmethod compute_in_trainer() bool[source]#

Whether advantages are computed in the trainer loop.

Returns:

False; GiGPO runs in the experience pipeline.

Return type:

bool

classmethod default_args() Dict[source]#

Return default advantage_fn_args for GiGPO.

Returns:

Default hyperparameters for GiGPOAdvantageFn.

Return type:

Dict