trinity.common.rewards.naive_dapo_score module#
This file contains the naive dapo reward function for math tasks. Adapted from LLM360/Reasoning360
- trinity.common.rewards.naive_dapo_score.normalize_final_answer(final_answer: str) str[source]#
Normalize a final answer to a quantitative reasoning question.
- Parameters:
final_answer β The answer string to normalize
- Returns:
Normalized answer string
- trinity.common.rewards.naive_dapo_score.are_equal_under_sympy(ground_truth_normalized: str, given_normalized: str)[source]#
- trinity.common.rewards.naive_dapo_score.split_tuple(expr: str)[source]#
Split the elements in a tuple/interval, while handling well-formatted commas in large numbers
- trinity.common.rewards.naive_dapo_score.grade_answer(given_answer: str, ground_truth: str) tuple[bool, str][source]#
The answer will be considered correct if: (a) it normalizes to the same string as the ground truth answer OR (b) sympy can simplify the difference between the expressions to 0
- trinity.common.rewards.naive_dapo_score.compute_score(solution_str: str, ground_truth: str) Tuple[float, str][source]#
Compute the reward score for a solution. This draws heavily from the LLM-as-judge and PRIME reward functions
- Parameters:
solution_str β The solution string
ground_truth β The ground truth answer
extra_info β dict with additional info for the score computation
- Returns:
(reward score 1.0 or 0.0, extracted_model_output)
- Return type:
Tuple[float, str]