Evaluate how accurately LLMs recommend real academic references. This benchmark verifies every paper citation against four authoritative academic databases—Crossref, PubMed, arXiv, and DBLP—providing objective, reproducible metrics to measure reference hallucination across models and disciplines.

What is Reference Hallucination Arena?

Reference Hallucination Arena is a benchmark designed to evaluate LLMs' ability to recommend real, verifiable academic papers. Unlike subjective evaluation tasks, this benchmark uses fully automated, objective verification: every reference generated by a model is checked against real-world academic databases.

The benchmark addresses a critical problem: when researchers ask LLMs for literature recommendations, models frequently "hallucinate" references—generating papers that sound plausible but do not actually exist. Reference Hallucination Arena quantifies this phenomenon across multiple models and academic disciplines.

The official evaluation dataset is available on HuggingFace: OpenJudge/ref-hallucination-arena.

Key Features:

Feature Description
Multi-source Verification Cross-validates references against Crossref, PubMed, arXiv, and DBLP
Multi-discipline Coverage Supports Computer Science, Biomedical, Physics, Chemistry, Social Science, Interdisciplinary, and more
Field-level Accuracy Checks title, author, year, and DOI individually for fine-grained analysis
Strict Verification All fields (title, author, year) must exactly match a real paper to count as VERIFIED
Tool-augmented Mode Optional ReAct agent with Tavily web search to compare bare vs. tool-augmented hallucination rates
Year Constraint Support Tests whether models respect temporal constraints (e.g., "papers after 2020")
Checkpoint Resume Fine-grained per-item checkpointing for long-running evaluations
Objective Metrics No subjective judgment—all scores are based on verifiable facts

The evaluation pipeline consists of six automated steps:

Step Component Description
1 DatasetLoader Load evaluation queries from JSON/JSONL dataset
2 ResponseCollector Collect BibTeX-formatted responses from target models (bare mode or tool-augmented ReAct mode)
3 BibExtractor Extract structured references from model responses
4 CompositeVerifier Verify each reference against Crossref/PubMed/arXiv/DBLP
5 ObjectiveScorer + RankingCalculator Compute verification metrics and rank models
6 RefReportGenerator + RefChartGenerator Generate detailed report and visualization charts

Dataset

The evaluation dataset is hosted on HuggingFace: OpenJudge/ref-hallucination-arena.

Each query item in the dataset follows this schema:

{
  "query": "Please recommend papers on Transformer architectures for NLP.",
  "discipline": "computer_science",
  "num_refs": 5,
  "language": "en",
  "year_constraint": {"min_year": 2020}
}
Field Required Description
query Yes The prompt text for reference recommendation
discipline No Academic discipline for verification routing
num_refs No Expected number of references (default: 5)
language No Query language: zh or en (default: zh)
year_constraint No Time constraint on recommended references
metadata No Arbitrary extra metadata (dict)

Year Constraint Formats:

Format Example Meaning
Exact year {"exact": 2023} Only papers from 2023
Year range {"min_year": 2020, "max_year": 2024} Papers between 2020–2024
After a year {"min_year": 2020} Papers from 2020 onwards
Before a year {"max_year": 2015} Papers before 2015

You can download the dataset and use it directly, or create your own custom queries following the same format.

How to Run the Evaluation

Follow this workflow to evaluate your models' reference recommendation capabilities:

Evaluation Workflow
  1. Prepare Dataset Download the official dataset from HuggingFace or create your own query file in JSON/JSONL format. ???+ example "Show Code"
    # Option 1: Use the bundled example queries
    ls cookbooks/ref_hallucination_arena/examples/queries_example.json
    
    # Option 2: Download from HuggingFace
    pip install huggingface_hub
    python -c "
    from huggingface_hub import hf_hub_download
    hf_hub_download(
        repo_id='OpenJudge/ref-hallucination-arena',
        filename='ref_hallucination_query.json',
        repo_type='dataset',
        local_dir='./data'
    )
    "
    
    Or download directly from [HuggingFace](https://huggingface.co/datasets/OpenJudge/ref-hallucination-arena).
  2. Configure Endpoints Create a YAML configuration file defining target models, verification settings, and output options. ???+ example "Show Code"
    task:
      description: "Evaluate LLM reference recommendation capabilities"
    
    dataset:
      path: "./data/queries.json"
      shuffle: false        # Whether to shuffle queries before evaluation
      max_queries: null      # Max number of queries to use (null = use all)
    
    target_endpoints:
      # Bare mode (default): direct LLM call
      model_a:
        base_url: "https://api.example.com/v1"
        api_key: "${MODEL_A_API_KEY}"
        model: "model-a"
        system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format."
        max_concurrency: 5   # Per-endpoint concurrency (default: 5)
        extra_params:
          temperature: 0.3
    
      model_b:
        base_url: "https://api.example.com/v1"
        api_key: "${MODEL_B_API_KEY}"
        model: "model-b"
        system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format."
        max_concurrency: 8
        extra_params:
          temperature: 0.3
    
      # Tool-augmented mode: ReAct agent with Tavily web search
      model_b_with_tools:
        base_url: "https://api.example.com/v1"
        api_key: "${MODEL_B_API_KEY}"
        model: "model-b"
        extra_params:
          temperature: 0.3
        tool_config:
          enabled: true
          tavily_api_key: "${TAVILY_API_KEY}"
          max_iterations: 10        # ReAct iterations (1-30, default: 10)
          search_depth: "advanced"   # "basic" or "advanced"
    
    verification:
      max_workers: 10
      crossref_mailto: ""    # Email for Crossref polite pool
      pubmed_api_key: ""     # PubMed API key for higher rate limit
      timeout: 30            # Per-request timeout in seconds
    
    evaluation:
      timeout: 120           # Model API request timeout in seconds
      retry_times: 3         # Number of retry attempts
    
    output:
      output_dir: "./evaluation_results/ref_hallucination_arena"
    
    report:
      enabled: true
      language: "en"
    
  3. Run Evaluation Execute the pipeline via CLI or Python API. The pipeline supports checkpoint resume for long-running evaluations. ???+ example "Show Code" === "CLI"
    # Run evaluation with config file
    python -m cookbooks.ref_hallucination_arena --config config.yaml --save
    
    # Resume from checkpoint (default behavior)
    python -m cookbooks.ref_hallucination_arena --config config.yaml --save
    
    # Start fresh, ignore checkpoint
    python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save
    
    === "Python API"
    import asyncio
    from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline
    
    async def main():
        pipeline = RefArenaPipeline.from_config("config.yaml")
        result = await pipeline.evaluate()
    
        # Print rankings
        for rank, (model, score) in enumerate(result.rankings, 1):
            print(f"{rank}. {model}: {score:.1%}")
    
    asyncio.run(main())
    

Environment Variables

Use ${ENV_VAR} syntax in YAML config to reference environment variables for API keys. Never hardcode sensitive credentials in configuration files.

Interpreting Results

The primary metric is overall accuracy (also called verification rate)—the percentage of references where title, author, and year all exactly match a real paper. Models are ranked by: overall accuracy → year compliance rate → average confidence → completeness (descending):

============================================================
REFERENCE HALLUCINATION ARENA - RANKINGS
============================================================
  1. Model A          [################----] overall=78.4%  title=85.2%  author=80.1%  doi=52.3%  refs=50
  2. Model B          [###############-----] overall=75.2%  title=82.0%  author=77.5%  doi=48.7%  refs=50
  3. Model C          [##############------] overall=72.8%  title=80.3%  author=75.0%  doi=45.2%  refs=50
  4. Model D          [#############-------] overall=69.5%  title=78.1%  author=72.8%  doi=41.5%  refs=50
============================================================

Benchmark Leaderboard

For real-world evaluation results on mainstream LLMs, visit the OpenJudge Leaderboard.

Interpretation:

  • > 75% — Excellent: Model rarely hallucinates references
  • 60-75% — Good: Most references are real, but some fabrication occurs
  • 40-60% — Fair: Significant hallucination, use with caution
  • < 40% — Poor: Model frequently fabricates references

Beyond overall rates, examine per-field accuracy for fine-grained insight:

Per-Field Accuracy (Model A):
  Title Accuracy  : 82.3%    # Percentage of titles matching real papers
  Author Accuracy : 68.5%    # Percentage of correct author lists
  Year Accuracy   : 71.2%    # Percentage of correct publication years
  DOI Accuracy    : 45.8%    # Percentage of valid DOIs

This breakdown reveals that models may get titles right but fabricate author names or DOIs—a common pattern where the model "remembers" a paper's topic but not its exact metadata.

Per-Discipline Performance shows which academic fields are most challenging:

Per-Discipline Overall Accuracy:
  Computer Science  : 81.2%
  Biomedical        : 74.5%
  Physics           : 70.3%
  Chemistry         : 65.8%
  Social Science    : 58.1%

Error Analysis

Analyze verification results to understand hallucination patterns and guide model selection:

Verification Status Categories

Each reference receives one of four verification statuses:

Status Meaning Typical Cause
VERIFIED Reference confirmed as real Paper found in academic databases with title, author, and year all strictly matching
SUSPECT Partial match found Title similar but author/year mismatch; may be a real paper with wrong details
NOT_FOUND No match in any database Likely a fabricated reference, or a real paper with incorrect metadata
ERROR Verification failed API timeout, rate limiting, or network issues

Note: Under the current strict verification logic, a reference is only marked VERIFIED when all provided fields (title, author, year) exactly match a real paper. Partial matches (e.g., correct title but wrong authors) are counted as NOT_FOUND with match details preserved for per-field accuracy analysis.

Common Hallucination Patterns

Pattern Description Detection
Plausible fabrication Paper sounds real but does not exist High title similarity to real papers but no exact match
Author swapping Correct paper title but wrong authors Title verified but author accuracy low
Year shifting Real paper but wrong publication year Title/author match but year mismatch
DOI invention Fabricated DOI that follows valid format DOI format is correct but resolves to nothing
Journal confusion Real paper attributed to wrong venue Paper exists but published in different journal

Programmatic Error Analysis

import json

# Load verification results
with open("evaluation_results/ref_hallucination_arena/verification_results.json") as f:
    results = json.load(f)

# Analyze hallucination patterns per model
for model_name, model_results in results.items():
    total = sum(r["total_refs"] for r in model_results)
    verified = sum(r["verified"] for r in model_results)
    not_found = sum(r["not_found"] for r in model_results)

    print(f"\n{model_name}:")
    print(f"  Total refs: {total}")
    print(f"  Verified: {verified} ({verified/total:.1%})")
    print(f"  Not found: {not_found} ({not_found/total:.1%})")

    # Per-discipline breakdown
    by_discipline = {}
    for r in model_results:
        d = r.get("discipline", "unknown")
        if d not in by_discipline:
            by_discipline[d] = {"total": 0, "verified": 0}
        by_discipline[d]["total"] += r["total_refs"]
        by_discipline[d]["verified"] += r["verified"]

    for d, stats in by_discipline.items():
        rate = stats["verified"] / stats["total"] if stats["total"] > 0 else 0
        print(f"  {d}: {rate:.1%} ({stats['verified']}/{stats['total']})")

Improving Model Performance

Based on error analysis, consider these strategies:

Error Pattern Root Cause Solution
Low verification rate overall Model lacks factual grounding Enable tool-augmented mode with web search, or use RAG-capable models
High NOT_FOUND rate with partial matches Partial knowledge of papers Strengthen system prompt to require exact metadata
Poor DOI accuracy DOIs are hard to memorize Ask models to omit DOIs if uncertain
Discipline-specific weakness Domain knowledge gaps Use domain-specialized models for specific fields
Year constraint violations Model ignores temporal restrictions Emphasize time constraints in the prompt
Tool mode reaches max iterations Insufficient search depth Increase max_iterations in tool_config (up to 30)

Output Files

All results are saved to the configured output directory:

evaluation_results/ref_hallucination_arena/
├── evaluation_report.md          # Detailed Markdown report (bilingual zh/en)
├── evaluation_results.json       # Final rankings, per-field accuracy, and scores
├── verification_chart.png        # Per-field accuracy breakdown bar chart (Title/Author/Year/DOI/Overall)
├── discipline_chart.png          # Per-discipline overall accuracy grouped bar chart
├── queries.json                  # Loaded evaluation queries
├── responses.json                # Raw model responses
├── extracted_refs.json           # Extracted BibTeX references
├── verification_results.json     # Detailed per-reference verification results
└── checkpoint.json               # Pipeline checkpoint for resume

Advanced Topics

The CompositeVerifier checks references against four academic databases with discipline-aware routing:

Source Coverage Best For
Crossref Broadest coverage (130M+ records) General academic papers with DOIs
PubMed Biomedical and life sciences Medical, biological, and health papers
arXiv Preprints in STEM fields Computer science, physics, mathematics
DBLP Computer science bibliography CS conferences and journals

A reference is marked as VERIFIED only when all of the following strict checks pass against a real paper in any of the four sources:

  • Title: Normalized exact match (lowercase, strip punctuation/HTML, compare word sequences)
  • Author: Every author last name the model provides must appear in the real author list
  • Year: Publication year must be identical

The verification order depends on the query's discipline. For example, biomedical queries check Crossref → PubMed → arXiv → DBLP, while computer_science queries check Crossref → DBLP → arXiv → PubMed. When a DOI is present, Crossref is always tried first regardless of discipline.

Increase Verification Rate Limits

Provide optional credentials to get higher API rate limits:

verification:
  crossref_mailto: "your-email@example.com"  # Join Crossref polite pool
  pubmed_api_key: "your-pubmed-api-key"       # Higher PubMed rate limit

Evaluations automatically save fine-grained checkpoints. Both response collection (Step 2) and reference verification (Step 4) support per-item checkpointing, so interrupted runs lose at most one item of progress:

# First run (interrupted after verifying 500/1000 items)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Resume from checkpoint (automatically picks up at item 501)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Start fresh (ignore checkpoint)
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

Checkpoint stages: QUERIES_LOADEDRESPONSES_COLLECTINGRESPONSES_COLLECTEDREFS_EXTRACTEDVERIFICATION_IN_PROGRESSVERIFICATION_COMPLETEEVALUATION_COMPLETE

The pipeline supports an optional tool-augmented mode where models use a ReAct agent with Tavily web search to find and verify real papers before recommending them. This enables direct comparison of "bare model" vs. "tool-augmented" hallucination rates for the same model.

target_endpoints:
  # Same model, bare mode (no tools)
  model_a_bare:
    base_url: "https://api.example.com/v1"
    api_key: "${MODEL_A_API_KEY}"
    model: "model-a"
    extra_params:
      temperature: 0.3

  # Same model, tool-augmented mode
  model_a_with_tools:
    base_url: "https://api.example.com/v1"
    api_key: "${MODEL_A_API_KEY}"
    model: "model-a"
    extra_params:
      temperature: 0.3
    tool_config:
      enabled: true
      tavily_api_key: "${TAVILY_API_KEY}"
      max_iterations: 10
      search_depth: "advanced"
Parameter Default Description
enabled false Set to true to activate tool-augmented mode
tavily_api_key null Tavily API key (falls back to TAVILY_API_KEY env var)
max_iterations 10 Maximum ReAct reasoning iterations (1–30)
search_depth "advanced" Tavily search depth: "basic" or "advanced"

When the ReAct agent exhausts its iterations without producing BibTeX output, the pipeline automatically runs a fallback summarization step—one additional LLM call without tools—so the model can synthesize all gathered search results into proper BibTeX format.

Separate Prompts for Tool Mode

When no custom system_prompt is set, the pipeline automatically uses a different default prompt for tool-augmented mode that instructs the model to search and verify papers before recommending them.

The system prompt controls how models format their reference output. Use the {num_refs} placeholder to dynamically insert the expected number of references:

target_endpoints:
  my_model:
    base_url: "https://api.example.com/v1"
    api_key: "${API_KEY}"
    model: "my-model"
    system_prompt: |
      You are an academic literature recommendation expert.
      Based on the user's research topic, recommend {num_refs}
      real, high-quality academic papers. Output each paper in
      standard BibTeX format with title, author, year,
      journal/booktitle, and doi fields.

BibTeX Format is Critical

The pipeline extracts references using BibTeX parsing. Ensure your system prompt explicitly requests BibTeX-formatted output for reliable extraction.

When no custom system_prompt is provided, the pipeline uses built-in defaults in both Chinese and English, selected based on the query's language field. Tool-augmented mode uses separate default prompts that include instructions for web search.

Generate a comprehensive Markdown report with concrete examples:

report:
  enabled: true        # Enable report generation
  language: "zh"       # "zh" (Chinese) or "en" (English)
  include_examples: 3  # Examples per section (1-10)
  chart:
    enabled: true          # Generate visualization charts
    orientation: "vertical"  # "horizontal" or "vertical"
    show_values: true      # Show values on bars
    highlight_best: true   # Highlight best-performing model

The report includes Executive Summary, Per-Field Accuracy Breakdown, Model Rankings, Per-Discipline Analysis, Verification Source Distribution, and Representative Cases.

Best Practices

Do

  • Use the official dataset from HuggingFace for reproducible and comparable results
  • Set temperature: 0.3 or lower for more deterministic reference generation
  • Provide crossref_mailto to join the Crossref polite pool for better rate limits
  • Use --save flag to persist all intermediate results for later analysis
  • Include diverse disciplines in your evaluation queries for comprehensive assessment
  • Use {num_refs} placeholder in system prompts to control reference count
  • Use tool-augmented mode to compare bare vs. search-assisted hallucination rates for the same model
  • Set per-endpoint max_concurrency based on each provider's rate limit

Don't

  • Set max_concurrency too high—this may trigger API rate limits on verification services
  • Skip checkpoint resumption for large-scale evaluations (hundreds of queries × many models)
  • Compare models with different system prompts unless intentionally testing prompt effects
  • Ignore per-discipline results—aggregate scores can mask discipline-specific weaknesses
  • Set tool_config.max_iterations too high for tool-augmented mode—this increases latency and cost significantly

Next Steps