trinity.common.rewards.dapo_reward module#

Reward Function with Overlong Reward Shaping described in DAPO (https://arxiv.org/pdf/2503.14476)

class trinity.common.rewards.dapo_reward.MathDAPORewardFn(enable_overlong_penalty: bool | None = None, penalty_factor: float | None = None, max_response_length: int | None = None, cache_length: int | None = None)[源代码]#

基类:RewardFn

A reward function that follows the definition in DAPO for math task.

__init__(enable_overlong_penalty: bool | None = None, penalty_factor: float | None = None, max_response_length: int | None = None, cache_length: int | None = None) None[源代码]#

Initialize DAPO math reward settings.

参数:
  • enable_overlong_penalty -- Whether to apply overlong response shaping.

  • penalty_factor -- Magnitude for overlong penalties.

  • max_response_length -- Maximum allowed response length in tokens.

  • cache_length -- Soft-penalty transition window in tokens.

compute_overlong_penalty(response_token)[源代码]#

Compute soft/hard penalty for long responses.

参数:

response_token -- Response token ids.

返回:

Length-based shaping value, where negative values penalize overlong outputs.

返回类型:

float