trinity.algorithm.advantage_fn.gigpo_advantage module#
GiGPO (Group-in-Group Policy Optimization) advantage computation.
- Reference:
Feng et al., “Group-in-Group Policy Optimization for LLM Agent Training”, arXiv:2505.10978.
- class trinity.algorithm.advantage_fn.gigpo_advantage.GiGPOAdvantageFn(omega: float = 1.0, gamma: float = 1.0, fnorm: Literal['std', 'none'] = 'none', epsilon: float = 1e-06, step_reward_key: str = 'step_reward', env_state_hash_key: str = 'env_state_hash', **kwargs)[source]#
Bases:
AdvantageFn,ExperienceOperatorCompute hierarchical GiGPO advantages for multi-turn agent experiences.
GiGPO combines episode-level relative advantages (GRPO-style over full trajectories) with step-level relative advantages within anchor-state groups. The combined scalar advantage is
A = A_E + omega * A_S, then broadcast to tokens viaaction_mask.Workflows must set
experience.info[env_state_hash_key]for step-level grouping and should setexperience.info[step_reward_key]for per-step immediate rewards. Seeexamples/gigpo_alfworld/README.md.- omega#
Weight on step-level advantage A_S.
- gamma#
Discount factor for discounted step returns R_t.
- fnorm#
Normalization mode,
"std"(GRPO) or"none"(RLOO-style).
- epsilon#
Small constant added to the normalization denominator.
- step_reward_key#
Key in
experience.infofor immediate reward r_t.
- env_state_hash_key#
Key in
experience.infofor anchor state identity.
- __init__(omega: float = 1.0, gamma: float = 1.0, fnorm: Literal['std', 'none'] = 'none', epsilon: float = 1e-06, step_reward_key: str = 'step_reward', env_state_hash_key: str = 'env_state_hash', **kwargs) None[source]#
Initialize GiGPO advantage computation.
- Parameters:
omega – Weight on step-level advantage A_S in the combined advantage.
gamma – Discount factor for R_t = sum_{k>=t} gamma^{k-t} r_k.
fnorm – Group normalization.
"std"divides by standard deviation;"none"uses F_norm = 1 (paper default for agent benchmarks).epsilon – Stabilizer when dividing by std or 1.
step_reward_key –
experience.infofield for immediate step reward.env_state_hash_key –
experience.infofield for anchor-state hash.**kwargs – Ignored; accepted for registry compatibility.
- process(exps: List[Experience]) Tuple[List[Experience], Dict][source]#
Compute GiGPO advantages for a batch of multi-step experiences.
Episode-level: group by task, compare trajectory returns R(tau) across runs (GRPO-style). Step-level: group by
env_state_hashacross the batch, compare discounted returns R_t; singleton anchors get A_S = 0.- Parameters:
exps – Multi-step experiences with
eid.task,eid.run,eid.step, and optional anchor metadata ininfo.- Returns:
- Experiences with
advantagesand returnsset, plus logging metrics prefixed withgigpo/.
- Experiences with
- Return type:
Tuple[List[Experience], Dict]