trinity.utils.metrics module#

Unified metrics aggregation utilities for Trinity-RFT.

Metric keys may carry an aggregation-type suffix in the form name:agg. Supported suffixes: :mean, :sum, :max, :min, :last. Keys without a suffix default to mean aggregation.

class trinity.utils.metrics.AggType(*values)[source]#

Bases: str, Enum

MEAN = 'mean'#
SUM = 'sum'#
MAX = 'max'#
MIN = 'min'#
LAST = 'last'#
trinity.utils.metrics.take_last(values: List[float]) float[source]#
trinity.utils.metrics.group_numeric_metrics(metric_dicts: List[Dict[str, float]]) Dict[Tuple[str, AggType], List[float]][source]#
trinity.utils.metrics.group_metrics_by_canonical_key(metric_dicts: List[Dict[str, float]]) Dict[str, Tuple[AggType, List[float]]][source]#
trinity.utils.metrics.parse_metric_key(key: str) Tuple[str, AggType][source]#

Parse a metric key into (name, aggregation_type).

Examples

“reward” -> (“reward”, AggType.MEAN) “experience_count:sum” -> (“experience_count”, AggType.SUM) “model_version:last” -> (“model_version”, AggType.LAST) “some:unknown_suffix” -> (“some:unknown_suffix”, AggType.MEAN)

trinity.utils.metrics.aggregate_metrics(metric_dicts: List[Dict[str, float]], prefix: str = '', default_output_stats: List[str] | None = None) Dict[str, float][source]#

Aggregate a list of metric dictionaries respecting per-key aggregation types.

For keys with AggType.MEAN, outputs {prefix}/{name}/mean, /max, /min (controlled by default_output_stats). For AggType.SUM, outputs {prefix}/{name}/sum. For AggType.MAX, outputs {prefix}/{name}/max. For AggType.MIN, outputs {prefix}/{name}/min. For AggType.LAST, outputs {prefix}/{name}/last.

Parameters:
  • metric_dicts – List of flat metric dictionaries (values must be numeric).

  • prefix – Optional prefix prepended as {prefix}/{name}/....

  • default_output_stats – Stats to output for MEAN metrics. Defaults to [“mean”, “max”, “min”].

Returns:

Flat dictionary of aggregated metrics ready for monitor logging.

trinity.utils.metrics.aggregate_eval_metrics(metric_dicts: List[Dict[str, float]], prefix: str = '', output_stats: List[str] | None = None, detailed_stats: bool = False) Dict[str, float][source]#

Aggregate eval metrics with optional detailed statistics.

For MEAN metrics:
  • If detailed_stats=True: output mean/max/min/std per the output_stats list.

  • If detailed_stats=False: output only the mean value as {prefix}/{name}.

For non-MEAN metrics: same behavior as aggregate_metrics.

trinity.utils.metrics.aggregate_run_level_metrics(metric_dicts: List[Dict[str, float]]) Dict[str, float][source]#

Aggregate experience-level metrics into a single run-level metric dict.

Unlike batch-level aggregation, this preserves the original key format (with :agg suffix if present) so that downstream task/batch aggregation can still see the aggregation type annotation.

Aggregation rules:
  • MEAN keys: averaged across experiences

  • SUM keys: summed across experiences

  • MAX keys: max across experiences

  • MIN keys: min across experiences

  • LAST keys: last value

trinity.utils.metrics.bootstrap_metric(data: List[Any], subset_size: int, reduce_fns: List[Callable[[List[Any]], float]], n_bootstrap: int = 1000, seed: int = 42) List[Tuple[float, float]][source]#

Estimate metric statistics with bootstrap resampling.

trinity.utils.metrics.calculate_task_level_metrics(metrics: List[Dict[str, float]], is_eval: bool) Dict[str, float][source]#

Calculate task-level metrics from multiple runs of the same task.