trinity.common.rewards.dapo_reward module#
Reward Function with Overlong Reward Shaping described in DAPO (https://arxiv.org/pdf/2503.14476)
- class trinity.common.rewards.dapo_reward.MathDAPORewardFn(enable_overlong_penalty: bool | None = None, penalty_factor: float | None = None, max_response_length: int | None = None, cache_length: int | None = None)[source]#
Bases:
RewardFnA reward function that follows the definition in DAPO for math task.
- __init__(enable_overlong_penalty: bool | None = None, penalty_factor: float | None = None, max_response_length: int | None = None, cache_length: int | None = None) None[source]#
Initialize DAPO math reward settings.
- Parameters:
enable_overlong_penalty β Whether to apply overlong response shaping.
penalty_factor β Magnitude for overlong penalties.
max_response_length β Maximum allowed response length in tokens.
cache_length β Soft-penalty transition window in tokens.