Automatically evaluate and compare multiple models or AI agents without pre-existing test data. This end-to-end pipeline generates test queries, collects responses, and ranks models/agents through pairwise comparison.

Overview

Auto Arena is ideal for model comparison, agent pipeline testing, new domain evaluation, and rapid prototyping—all without preparing test data upfront.

No Test Data Required

Unlike traditional evaluation, Auto Arena generates its own test queries from the task description, eliminating the need for pre-existing test datasets.

The pipeline automates seven steps: generate test queries → collect responses → create evaluation rubrics → run pairwise comparisons → analyze results → generate report → create visualization.

Step Component Description
1 QueryGenerator Generate diverse test queries from task description
2 ResponseCollector Collect responses from all target endpoints
3 TaskBasedRubricGenerator Generate evaluation criteria for the task
4 GradingRunner Run pairwise comparisons with judge model
5 PairwiseAnalyzer Analyze results and produce rankings
6 ReportGenerator Generate detailed Markdown evaluation report
7 WinRateChartGenerator Create win rate visualization chart

Quick Start

Get started with Auto Arena in just a few lines of code. Choose the approach that best fits your workflow:

The recommended way to run evaluations programmatically:

import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline

async def main():
    pipeline = AutoArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    print(f"Best Model: {result.best_pipeline}")
    for rank, (model, win_rate) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {win_rate:.1%}")

asyncio.run(main())

Run evaluations directly from the command line:

# Run evaluation with config file
python -m cookbooks.auto_arena --config config.yaml --save

# Resume from checkpoint (default behavior)
python -m cookbooks.auto_arena --config config.yaml --save

# Start fresh, ignore checkpoint
python -m cookbooks.auto_arena --config config.yaml --fresh --save

# Use pre-generated queries
python -m cookbooks.auto_arena --config config.yaml --queries_file queries.json --save

Skip query generation by providing your own queries file—useful when you want to evaluate models on a specific set of questions.

Create a queries.json file with your test cases:

[
  {"query": "Translate: AI is transforming industries."},
  {"query": "Translate: The weather is nice today."},
  {"query": "Translate: How to learn programming effectively?"}
]

Optional Fields

The category and difficulty fields are optional: {"query": "...", "category": "general", "difficulty": "easy"}

Then run the evaluation with your queries:

python -m cookbooks.auto_arena --config config.yaml --queries_file queries.json --save

All methods require a YAML configuration file. Here's a complete example:

# Task description
task:
  description: "English to Chinese translation assistant"
  scenario: "Users need to translate English content into fluent Chinese"

# Target endpoints to evaluate
target_endpoints:
  gpt4_baseline:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4"
    extra_params:
      temperature: 0.7

  qwen_candidate:
    base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
    api_key: "${DASHSCOPE_API_KEY}"
    model: "qwen-max"
    extra_params:
      temperature: 0.7

# Judge endpoint for pairwise evaluation
judge_endpoint:
  base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
  api_key: "${DASHSCOPE_API_KEY}"
  model: "qwen-max"
  extra_params:
    temperature: 0.1

# Query generation settings
query_generation:
  num_queries: 20
  seed_queries:
    - "Translate this paragraph into Chinese: 'AI is transforming industries.'"
  queries_per_call: 10
  temperature: 0.9

# Evaluation settings
evaluation:
  max_concurrency: 10
  timeout: 60

# Output settings
output:
  output_dir: "./evaluation_results"

Environment Variables

Use ${ENV_VAR} syntax to reference environment variables for sensitive data like API keys.

Component Guide

For fine-grained control, use individual pipeline components directly. The workflow below shows how each component connects:

Pipeline Components
  1. Generate Test Queries Use `QueryGenerator` to create diverse test queries from your task description. Supports parallel generation, automatic deduplication, and optional Evol-Instruct complexity evolution.
  2. Collect Responses Use `ResponseCollector` to query all target models/agents concurrently and gather their responses for comparison.
  3. Generate Evaluation Rubrics Use `TaskBasedRubricGenerator` to automatically create evaluation criteria (accuracy, completeness, clarity, etc.) tailored to your specific task.
  4. Run Pairwise Evaluation Use `AutoArenaPipeline` to orchestrate the full evaluation, comparing all response pairs and producing final rankings.
Code Examples for Each Step

Step 1: Generate Test Queries

from cookbooks.auto_arena.query_generator import QueryGenerator
from cookbooks.auto_arena.schema import TaskConfig, QueryGenerationConfig, OpenAIEndpoint

task = TaskConfig(
    description="Code review assistant for Python",
    scenario="Review code for bugs, style issues, and improvements"
)

judge_endpoint = OpenAIEndpoint(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    model="gpt-4"
)

query_config = QueryGenerationConfig(
    num_queries=20,
    seed_queries=["Review this Python function for bugs..."],
    enable_evolution=True,
    evolution_rounds=1
)

generator = QueryGenerator(judge_endpoint, task, query_config)
queries = await generator.generate()

Step 2: Collect Responses

from cookbooks.auto_arena.response_collector import ResponseCollector
from cookbooks.auto_arena.schema import EvaluationConfig

collector = ResponseCollector(
    target_endpoints={"model_a": endpoint_a, "model_b": endpoint_b},
    evaluation_config=EvaluationConfig(max_concurrency=10)
)
responses = await collector.collect(queries)

Step 3: Generate Evaluation Rubrics

from openjudge.generator.simple_rubric import TaskBasedRubricGenerator

rubric_gen = TaskBasedRubricGenerator(
    model=judge_model,
    task_description=task.description,
    scenario=task.scenario,
)
rubrics = await rubric_gen.generate(sample_queries=[q.query for q in queries[:5]])
# Output: Accuracy, Completeness, Clarity criteria

Step 4: Run Full Evaluation

from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline

pipeline = AutoArenaPipeline(
    task_description="Code review assistant",
    target_endpoints=target_endpoints,
    judge_endpoint=judge_endpoint,
    num_queries=20
)
result = await pipeline.evaluate()

Advanced Topics

The EvaluationResult provides comprehensive ranking statistics:

Field Type Description
rankings List[Tuple[str, float]] Models sorted by win rate (best first)
win_rates Dict[str, float] Win rate for each model (0.0-1.0)
win_matrix Dict[str, Dict[str, float]] Head-to-head win rates between models
best_pipeline str Model with highest win rate
total_queries int Total number of test queries
total_comparisons int Total number of pairwise comparisons

Sample Output

============================================================
AUTO ARENA EVALUATION RESULTS
============================================================
Task: English to Chinese translation assistant...
Queries: 20 | Comparisons: 80

Rankings:
  1. qwen_candidate      [################----] 80.0%
  2. gpt4_baseline       [########------------] 40.0%

Best Pipeline: qwen_candidate
============================================================

Output Files:

File Description
evaluation_report.md Detailed Markdown report with analysis
win_rate_chart.png Visual bar chart for presentations
comparison_details.json Traceable pairwise comparison records
evaluation_results.json Structured result data (JSON)

Fine-tune query generation behavior:

Option Default Description
num_queries 20 Total number of queries to generate
queries_per_call 10 Queries per API call (1-50)
num_parallel_batches 3 Number of parallel generation batches
temperature 0.9 Sampling temperature for diversity
max_similarity 0.85 Deduplication similarity threshold
enable_evolution false Enable Evol-Instruct complexity evolution
evolution_rounds 1 Number of evolution rounds (0-3)
Enable Evol-Instruct for Harder Queries

Evol-Instruct progressively increases query complexity:

query_generation:
  enable_evolution: true
  evolution_rounds: 2
  complexity_levels:
    - "constraints"    # Add time, scope, or condition constraints
    - "reasoning"      # Require multi-step reasoning
    - "edge_cases"     # Include edge cases

Generate a comprehensive Markdown report with concrete examples:

report:
  enabled: true        # Enable report generation
  language: "zh"       # "zh" (Chinese) or "en" (English)
  include_examples: 3  # Examples per section (1-10)

The report includes Executive Summary, Ranking Explanation, Model Analysis, and Representative Cases.

All results are saved to the output directory:

evaluation_results/
├── evaluation_report.md      # Generated Markdown report
├── win_rate_chart.png        # Win rate visualization chart
├── comparison_details.json   # All pairwise comparison details
├── evaluation_results.json   # Final rankings and statistics
├── queries.json              # Generated test queries
├── responses.json            # Model responses
└── rubrics.json              # Evaluation criteria

Example Report

View a real report: Oncology Medical Translation Evaluation

Automatically generate a beautiful bar chart showing model win rates:

report:
  chart:
    enabled: true          # Enable chart generation (default: true)
    title: null            # Custom title (auto-generated if not set)
    figsize: [12, 7]       # Figure size (width, height) in inches
    dpi: 150               # Image resolution (72-300)
    format: "png"          # Output format: png / svg / pdf
    show_values: true      # Show percentage values on bars
    highlight_best: true   # Highlight best model with accent color

Chart Features:

  • 🥇 Best model highlighted with orange diagonal stripes
  • 📊 Gray gradient for other models by rank
  • 🔢 Value labels on top of each bar
  • 🌏 CJK font support for Chinese/Japanese/Korean text

Win Rate Chart Example

Example: Oncology medical translation evaluation with 5 models

Evaluations automatically save checkpoints for resumption after interruptions:

# First run (interrupted)
python -m cookbooks.auto_arena --config config.yaml --save

# Resume from checkpoint (automatic)
python -m cookbooks.auto_arena --config config.yaml --save

# Start fresh (ignore checkpoint)
python -m cookbooks.auto_arena --config config.yaml --fresh --save

Checkpoint stages: QUERIES_GENERATEDRESPONSES_COLLECTEDRUBRICS_GENERATEDEVALUATION_COMPLETE

Best Practices

Do

  • Start with a clear task description that captures the core objective
  • Use seed queries to guide query generation style
  • Set num_queries to at least 20 for statistically meaningful results
  • Choose a strong judge model (at least as capable as models being evaluated)
  • Use --save flag to persist results for later analysis
  • Use the generated win rate chart for presentations and reports

Don't

  • Use a judge model weaker than the models being evaluated
  • Set max_concurrency too high for your API rate limits
  • Skip checkpoint resumption for long-running evaluations
  • Compare models with fundamentally different capabilities (e.g., text vs vision)

Related Topics: Pairwise Evaluation · Refine Data Quality · Create Custom Graders · Run Grading Tasks