trinity.utils.metrics module#

Unified metrics aggregation utilities for Trinity-RFT.

Metric keys may carry an aggregation-type suffix in the form name:agg. Supported suffixes: :mean, :sum, :max, :min, :last. Keys without a suffix default to mean aggregation.

class trinity.utils.metrics.AggType(*values)[source]#

Bases: str, Enum

MEAN = 'mean'#

SUM = 'sum'#

MAX = 'max'#

MIN = 'min'#

LAST = 'last'#

trinity.utils.metrics.take_last(values: List[float]) → float[source]#

trinity.utils.metrics.group_numeric_metrics(metric_dicts: List[Dict[str, float]]) → Dict[Tuple[str, AggType], List[float]][source]#

trinity.utils.metrics.group_metrics_by_canonical_key(metric_dicts: List[Dict[str, float]]) → Dict[str, Tuple[AggType, List[float]]][source]#

trinity.utils.metrics.parse_metric_key(key: str) → Tuple[str, AggType][source]#

Parse a metric key into (name, aggregation_type).

Examples

“reward” -> (“reward”, AggType.MEAN) “experience_count:sum” -> (“experience_count”, AggType.SUM) “model_version:last” -> (“model_version”, AggType.LAST) “some:unknown_suffix” -> (“some:unknown_suffix”, AggType.MEAN)

trinity.utils.metrics.aggregate_metrics(metric_dicts: List[Dict[str, float]], prefix: str = '', default_output_stats: List[str] | None = None) → Dict[str, float][source]#

Aggregate a list of metric dictionaries respecting per-key aggregation types.

For keys with AggType.MEAN, outputs {prefix}/{name}/mean, /max, /min (controlled by default_output_stats). For AggType.SUM, outputs {prefix}/{name}/sum. For AggType.MAX, outputs {prefix}/{name}/max. For AggType.MIN, outputs {prefix}/{name}/min. For AggType.LAST, outputs {prefix}/{name}/last.

Parameters:

metric_dicts – List of flat metric dictionaries (values must be numeric).
prefix – Optional prefix prepended as {prefix}/{name}/....
default_output_stats – Stats to output for MEAN metrics. Defaults to [“mean”, “max”, “min”].

Returns:

Flat dictionary of aggregated metrics ready for monitor logging.

trinity.utils.metrics.aggregate_eval_metrics(metric_dicts: List[Dict[str, float]], prefix: str = '', output_stats: List[str] | None = None, detailed_stats: bool = False) → Dict[str, float][source]#

Aggregate eval metrics with optional detailed statistics.

For MEAN metrics:

If detailed_stats=True: output mean/max/min/std per the output_stats list.
If detailed_stats=False: output only the mean value as {prefix}/{name}.

For non-MEAN metrics: same behavior as aggregate_metrics.

trinity.utils.metrics.aggregate_run_level_metrics(metric_dicts: List[Dict[str, float]]) → Dict[str, float][source]#

Aggregate experience-level metrics into a single run-level metric dict.

Unlike batch-level aggregation, this preserves the original key format (with :agg suffix if present) so that downstream task/batch aggregation can still see the aggregation type annotation.

Aggregation rules:

MEAN keys: averaged across experiences
SUM keys: summed across experiences
MAX keys: max across experiences
MIN keys: min across experiences
LAST keys: last value

trinity.utils.metrics.bootstrap_metric(data: List[Any], subset_size: int, reduce_fns: List[Callable[[List[Any]], float]], n_bootstrap: int = 1000, seed: int = 42) → List[Tuple[float, float]][source]#: Estimate metric statistics with bootstrap resampling.

trinity.utils.metrics.calculate_task_level_metrics(metrics: List[Dict[str, float]], is_eval: bool) → Dict[str, float][source]#: Calculate task-level metrics from multiple runs of the same task.

trinity.utils.metrics module

Contents

trinity.utils.metrics module#