trinity.algorithm.advantage_fn.grpo_advantage module#

GRPO advantage computation

class trinity.algorithm.advantage_fn.grpo_advantage.GRPOAdvantageFn(epsilon: float = 1e-06)[source]#

Bases: AdvantageFn

GRPO advantage computation

__init__(epsilon: float = 1e-06) None[source]#
classmethod default_args() Dict[source]#
Returns:

The default init arguments for the advantage function.

Return type:

Dict

class trinity.algorithm.advantage_fn.grpo_advantage.GRPOGroupedAdvantage(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group')[source]#

Bases: GroupAdvantage

An advantage class that calculates GRPO advantages.

__init__(epsilon: float = 1e-06, std_threshold: float | None = None, duplicate_experiences: bool = False, rank_penalty: float | None = None, std_cal_level: str = 'group') None[source]#

Initialize the GRPO advantage function.

Parameters:
  • epsilon (float) – A small value to avoid division by zero.

  • std_threshold (Optional[float]) – If provided, groups with a reward standard deviation equal or below this threshold will be skipped.

  • duplicate_experiences (bool) – If True, allows duplicate experiences to keep the original experience count. Only used when std_threshold is not None (https://hkunlp.github.io/blog/2025/Polaris).

  • rank_penalty (Optional[float]) – A penalty applied to the rank of rewards to correct for bias (https://arxiv.org/pdf/2506.02355).

  • std_cal_level (str) – The scope for calculating the reward standard deviation for normalization. Can be ‘group’ (default, std is calculated per group) or ‘batch’ (std is calculated across the entire batch). The mean is always calculated per group. Calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping(https://arxiv.org/pdf/2508.08221v1).

group_experiences(exps)[source]#

Group experiences by a certain criterion.

Parameters:

exps (List[Experience]) – List of experiences to be grouped.

Returns:

A dictionary where keys are group identifiers and values are lists of experiences.

Return type:

Dict[str, List[Experience]]

calculate_group_advantage(group_id: str, exps: List[Experience], precomputed_std: Tensor | None = None) Tuple[List[Experience], Dict][source]#

Calculate advantages for a group of experiences.

Parameters:
  • group_id (str) – The identifier for the group of experiences.

  • exps (List[Experience]) – List of experiences in the group.

Returns:

A tuple containing the modified list of experiences and a dictionary of metrics.

Return type:

List[Experience]

process(exps)[source]#

Process a list of experiences and return a transformed list.

Parameters:

exps (List[Experience]) – List of experiences to process, which contains all experiences generated by the Explorer in one explore step.

Returns:

A tuple containing the processed list of experiences and a dictionary of metrics.

Return type:

Tuple[List[Experience], Dict]

classmethod default_args() dict[source]#
Returns:

The default init arguments for the advantage function.

Return type:

Dict