Reference Hallucination Arena

Evaluate how accurately LLMs recommend real academic references. This benchmark verifies every paper citation against four authoritative academic databases—Crossref, PubMed, arXiv, and DBLP—providing objective, reproducible metrics to measure reference hallucination across models and disciplines.

What is Reference Hallucination Arena?

Reference Hallucination Arena is a benchmark designed to evaluate LLMs' ability to recommend real, verifiable academic papers. Unlike subjective evaluation tasks, this benchmark uses fully automated, objective verification: every reference generated by a model is checked against real-world academic databases.

The benchmark addresses a critical problem: when researchers ask LLMs for literature recommendations, models frequently "hallucinate" references—generating papers that sound plausible but do not actually exist. Reference Hallucination Arena quantifies this phenomenon across multiple models and academic disciplines.

The official evaluation dataset is available on HuggingFace: OpenJudge/ref-hallucination-arena.

Key Features:

Feature	Description
Multi-source Verification	Cross-validates references against Crossref, PubMed, arXiv, and DBLP
Multi-discipline Coverage	Supports Computer Science, Biomedical, Physics, Chemistry, Social Science, Interdisciplinary, and more
Field-level Accuracy	Checks title, author, year, and DOI individually for fine-grained analysis
Strict Verification	All fields (title, author, year) must exactly match a real paper to count as VERIFIED
Tool-augmented Mode	Optional ReAct agent with Tavily web search to compare bare vs. tool-augmented hallucination rates
Year Constraint Support	Tests whether models respect temporal constraints (e.g., "papers after 2020")
Checkpoint Resume	Fine-grained per-item checkpointing for long-running evaluations
Objective Metrics	No subjective judgment—all scores are based on verifiable facts

The evaluation pipeline consists of six automated steps:

Step	Component	Description
1	`DatasetLoader`	Load evaluation queries from JSON/JSONL dataset
2	`ResponseCollector`	Collect BibTeX-formatted responses from target models (bare mode or tool-augmented ReAct mode)
3	`BibExtractor`	Extract structured references from model responses
4	`CompositeVerifier`	Verify each reference against Crossref/PubMed/arXiv/DBLP
5	`ObjectiveScorer` + `RankingCalculator`	Compute verification metrics and rank models
6	`RefReportGenerator` + `RefChartGenerator`	Generate detailed report and visualization charts

Dataset

The evaluation dataset is hosted on HuggingFace: OpenJudge/ref-hallucination-arena.

Each query item in the dataset follows this schema:

{
  "query": "Please recommend papers on Transformer architectures for NLP.",
  "discipline": "computer_science",
  "num_refs": 5,
  "language": "en",
  "year_constraint": {"min_year": 2020}
}

Field	Required	Description
`query`	Yes	The prompt text for reference recommendation
`discipline`	No	Academic discipline for verification routing
`num_refs`	No	Expected number of references (default: 5)
`language`	No	Query language: `zh` or `en` (default: `zh`)
`year_constraint`	No	Time constraint on recommended references
`metadata`	No	Arbitrary extra metadata (dict)

Year Constraint Formats:

Format	Example	Meaning
Exact year	`{"exact": 2023}`	Only papers from 2023
Year range	`{"min_year": 2020, "max_year": 2024}`	Papers between 2020–2024
After a year	`{"min_year": 2020}`	Papers from 2020 onwards
Before a year	`{"max_year": 2015}`	Papers before 2015

You can download the dataset and use it directly, or create your own custom queries following the same format.

How to Run the Evaluation

Follow this workflow to evaluate your models' reference recommendation capabilities:

Evaluation Workflow

Prepare Dataset Download the official dataset from HuggingFace or create your own query file in JSON/JSONL format. ???+ example "Show Code"

# Option 1: Use the bundled example queries
ls cookbooks/ref_hallucination_arena/examples/queries_example.json

# Option 2: Download from HuggingFace
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='OpenJudge/ref-hallucination-arena',
    filename='ref_hallucination_query.json',
    repo_type='dataset',
    local_dir='./data'
)
"

Or download directly from [HuggingFace](https://huggingface.co/datasets/OpenJudge/ref-hallucination-arena).

Configure Endpoints Create a YAML configuration file defining target models, verification settings, and output options. ???+ example "Show Code"

task:
  description: "Evaluate LLM reference recommendation capabilities"

dataset:
  path: "./data/queries.json"
  shuffle: false        # Whether to shuffle queries before evaluation
  max_queries: null      # Max number of queries to use (null = use all)

target_endpoints:
  # Bare mode (default): direct LLM call
  model_a:
    base_url: "https://api.example.com/v1"
    api_key: "${MODEL_A_API_KEY}"
    model: "model-a"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format."
    max_concurrency: 5   # Per-endpoint concurrency (default: 5)
    extra_params:
      temperature: 0.3

  model_b:
    base_url: "https://api.example.com/v1"
    api_key: "${MODEL_B_API_KEY}"
    model: "model-b"
    system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format."
    max_concurrency: 8
    extra_params:
      temperature: 0.3

  # Tool-augmented mode: ReAct agent with Tavily web search
  model_b_with_tools:
    base_url: "https://api.example.com/v1"
    api_key: "${MODEL_B_API_KEY}"
    model: "model-b"
    extra_params:
      temperature: 0.3
    tool_config:
      enabled: true
      tavily_api_key: "${TAVILY_API_KEY}"
      max_iterations: 10        # ReAct iterations (1-30, default: 10)
      search_depth: "advanced"   # "basic" or "advanced"

verification:
  max_workers: 10
  crossref_mailto: ""    # Email for Crossref polite pool
  pubmed_api_key: ""     # PubMed API key for higher rate limit
  timeout: 30            # Per-request timeout in seconds

evaluation:
  timeout: 120           # Model API request timeout in seconds
  retry_times: 3         # Number of retry attempts

output:
  output_dir: "./evaluation_results/ref_hallucination_arena"

report:
  enabled: true
  language: "en"

Run Evaluation Execute the pipeline via CLI or Python API. The pipeline supports checkpoint resume for long-running evaluations. ???+ example "Show Code" === "CLI"

# Run evaluation with config file
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Resume from checkpoint (default behavior)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Start fresh, ignore checkpoint
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

=== "Python API"

import asyncio
from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline

async def main():
    pipeline = RefArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    # Print rankings
    for rank, (model, score) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {score:.1%}")

asyncio.run(main())

Environment Variables

Use ${ENV_VAR} syntax in YAML config to reference environment variables for API keys. Never hardcode sensitive credentials in configuration files.

Interpreting Results

The primary metric is overall accuracy (also called verification rate)—the percentage of references where title, author, and year all exactly match a real paper. Models are ranked by: overall accuracy → year compliance rate → average confidence → completeness (descending):

============================================================
REFERENCE HALLUCINATION ARENA - RANKINGS
============================================================
  1. Model A          [################----] overall=78.4%  title=85.2%  author=80.1%  doi=52.3%  refs=50
  2. Model B          [###############-----] overall=75.2%  title=82.0%  author=77.5%  doi=48.7%  refs=50
  3. Model C          [##############------] overall=72.8%  title=80.3%  author=75.0%  doi=45.2%  refs=50
  4. Model D          [#############-------] overall=69.5%  title=78.1%  author=72.8%  doi=41.5%  refs=50
============================================================

Benchmark Leaderboard

For real-world evaluation results on mainstream LLMs, visit the OpenJudge Leaderboard.

Interpretation:

> 75% — Excellent: Model rarely hallucinates references
60-75% — Good: Most references are real, but some fabrication occurs
40-60% — Fair: Significant hallucination, use with caution
< 40% — Poor: Model frequently fabricates references

Beyond overall rates, examine per-field accuracy for fine-grained insight:

Per-Field Accuracy (Model A):
  Title Accuracy  : 82.3%    # Percentage of titles matching real papers
  Author Accuracy : 68.5%    # Percentage of correct author lists
  Year Accuracy   : 71.2%    # Percentage of correct publication years
  DOI Accuracy    : 45.8%    # Percentage of valid DOIs

This breakdown reveals that models may get titles right but fabricate author names or DOIs—a common pattern where the model "remembers" a paper's topic but not its exact metadata.

Per-Discipline Performance shows which academic fields are most challenging:

Per-Discipline Overall Accuracy:
  Computer Science  : 81.2%
  Biomedical        : 74.5%
  Physics           : 70.3%
  Chemistry         : 65.8%
  Social Science    : 58.1%

Error Analysis

Analyze verification results to understand hallucination patterns and guide model selection:

Verification Status Categories

Each reference receives one of four verification statuses:

Status	Meaning	Typical Cause
VERIFIED	Reference confirmed as real	Paper found in academic databases with title, author, and year all strictly matching
SUSPECT	Partial match found	Title similar but author/year mismatch; may be a real paper with wrong details
NOT_FOUND	No match in any database	Likely a fabricated reference, or a real paper with incorrect metadata
ERROR	Verification failed	API timeout, rate limiting, or network issues

Note: Under the current strict verification logic, a reference is only marked VERIFIED when all provided fields (title, author, year) exactly match a real paper. Partial matches (e.g., correct title but wrong authors) are counted as NOT_FOUND with match details preserved for per-field accuracy analysis.

Common Hallucination Patterns

Pattern	Description	Detection
Plausible fabrication	Paper sounds real but does not exist	High title similarity to real papers but no exact match
Author swapping	Correct paper title but wrong authors	Title verified but author accuracy low
Year shifting	Real paper but wrong publication year	Title/author match but year mismatch
DOI invention	Fabricated DOI that follows valid format	DOI format is correct but resolves to nothing
Journal confusion	Real paper attributed to wrong venue	Paper exists but published in different journal

Programmatic Error Analysis

import json

# Load verification results
with open("evaluation_results/ref_hallucination_arena/verification_results.json") as f:
    results = json.load(f)

# Analyze hallucination patterns per model
for model_name, model_results in results.items():
    total = sum(r["total_refs"] for r in model_results)
    verified = sum(r["verified"] for r in model_results)
    not_found = sum(r["not_found"] for r in model_results)

    print(f"\n{model_name}:")
    print(f"  Total refs: {total}")
    print(f"  Verified: {verified} ({verified/total:.1%})")
    print(f"  Not found: {not_found} ({not_found/total:.1%})")

    # Per-discipline breakdown
    by_discipline = {}
    for r in model_results:
        d = r.get("discipline", "unknown")
        if d not in by_discipline:
            by_discipline[d] = {"total": 0, "verified": 0}
        by_discipline[d]["total"] += r["total_refs"]
        by_discipline[d]["verified"] += r["verified"]

    for d, stats in by_discipline.items():
        rate = stats["verified"] / stats["total"] if stats["total"] > 0 else 0
        print(f"  {d}: {rate:.1%} ({stats['verified']}/{stats['total']})")

Improving Model Performance

Based on error analysis, consider these strategies:

Error Pattern	Root Cause	Solution
Low verification rate overall	Model lacks factual grounding	Enable tool-augmented mode with web search, or use RAG-capable models
High NOT_FOUND rate with partial matches	Partial knowledge of papers	Strengthen system prompt to require exact metadata
Poor DOI accuracy	DOIs are hard to memorize	Ask models to omit DOIs if uncertain
Discipline-specific weakness	Domain knowledge gaps	Use domain-specialized models for specific fields
Year constraint violations	Model ignores temporal restrictions	Emphasize time constraints in the prompt
Tool mode reaches max iterations	Insufficient search depth	Increase `max_iterations` in `tool_config` (up to 30)

Output Files

All results are saved to the configured output directory:

evaluation_results/ref_hallucination_arena/
├── evaluation_report.md          # Detailed Markdown report (bilingual zh/en)
├── evaluation_results.json       # Final rankings, per-field accuracy, and scores
├── verification_chart.png        # Per-field accuracy breakdown bar chart (Title/Author/Year/DOI/Overall)
├── discipline_chart.png          # Per-discipline overall accuracy grouped bar chart
├── queries.json                  # Loaded evaluation queries
├── responses.json                # Raw model responses
├── extracted_refs.json           # Extracted BibTeX references
├── verification_results.json     # Detailed per-reference verification results
└── checkpoint.json               # Pipeline checkpoint for resume

Advanced Topics

Verification SourcesCheckpoint & ResumeTool-Augmented ModeCustom System PromptsEvaluation Report

The CompositeVerifier checks references against four academic databases with discipline-aware routing:

Source	Coverage	Best For
Crossref	Broadest coverage (130M+ records)	General academic papers with DOIs
PubMed	Biomedical and life sciences	Medical, biological, and health papers
arXiv	Preprints in STEM fields	Computer science, physics, mathematics
DBLP	Computer science bibliography	CS conferences and journals

A reference is marked as VERIFIED only when all of the following strict checks pass against a real paper in any of the four sources:

Title: Normalized exact match (lowercase, strip punctuation/HTML, compare word sequences)
Author: Every author last name the model provides must appear in the real author list
Year: Publication year must be identical

The verification order depends on the query's discipline. For example, biomedical queries check Crossref → PubMed → arXiv → DBLP, while computer_science queries check Crossref → DBLP → arXiv → PubMed. When a DOI is present, Crossref is always tried first regardless of discipline.

Increase Verification Rate Limits

Provide optional credentials to get higher API rate limits:

verification:
  crossref_mailto: "your-email@example.com"  # Join Crossref polite pool
  pubmed_api_key: "your-pubmed-api-key"       # Higher PubMed rate limit

Evaluations automatically save fine-grained checkpoints. Both response collection (Step 2) and reference verification (Step 4) support per-item checkpointing, so interrupted runs lose at most one item of progress:

# First run (interrupted after verifying 500/1000 items)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Resume from checkpoint (automatically picks up at item 501)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save

# Start fresh (ignore checkpoint)
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save

Checkpoint stages: QUERIES_LOADED → RESPONSES_COLLECTING → RESPONSES_COLLECTED → REFS_EXTRACTED → VERIFICATION_IN_PROGRESS → VERIFICATION_COMPLETE → EVALUATION_COMPLETE

The pipeline supports an optional tool-augmented mode where models use a ReAct agent with Tavily web search to find and verify real papers before recommending them. This enables direct comparison of "bare model" vs. "tool-augmented" hallucination rates for the same model.

target_endpoints:
  # Same model, bare mode (no tools)
  model_a_bare:
    base_url: "https://api.example.com/v1"
    api_key: "${MODEL_A_API_KEY}"
    model: "model-a"
    extra_params:
      temperature: 0.3

  # Same model, tool-augmented mode
  model_a_with_tools:
    base_url: "https://api.example.com/v1"
    api_key: "${MODEL_A_API_KEY}"
    model: "model-a"
    extra_params:
      temperature: 0.3
    tool_config:
      enabled: true
      tavily_api_key: "${TAVILY_API_KEY}"
      max_iterations: 10
      search_depth: "advanced"

Parameter	Default	Description
`enabled`	`false`	Set to `true` to activate tool-augmented mode
`tavily_api_key`	`null`	Tavily API key (falls back to `TAVILY_API_KEY` env var)
`max_iterations`	`10`	Maximum ReAct reasoning iterations (1–30)
`search_depth`	`"advanced"`	Tavily search depth: `"basic"` or `"advanced"`

When the ReAct agent exhausts its iterations without producing BibTeX output, the pipeline automatically runs a fallback summarization step—one additional LLM call without tools—so the model can synthesize all gathered search results into proper BibTeX format.

Separate Prompts for Tool Mode

When no custom system_prompt is set, the pipeline automatically uses a different default prompt for tool-augmented mode that instructs the model to search and verify papers before recommending them.

The system prompt controls how models format their reference output. Use the {num_refs} placeholder to dynamically insert the expected number of references:

target_endpoints:
  my_model:
    base_url: "https://api.example.com/v1"
    api_key: "${API_KEY}"
    model: "my-model"
    system_prompt: |
      You are an academic literature recommendation expert.
      Based on the user's research topic, recommend {num_refs}
      real, high-quality academic papers. Output each paper in
      standard BibTeX format with title, author, year,
      journal/booktitle, and doi fields.

BibTeX Format is Critical

The pipeline extracts references using BibTeX parsing. Ensure your system prompt explicitly requests BibTeX-formatted output for reliable extraction.

When no custom system_prompt is provided, the pipeline uses built-in defaults in both Chinese and English, selected based on the query's language field. Tool-augmented mode uses separate default prompts that include instructions for web search.

Generate a comprehensive Markdown report with concrete examples:

report:
  enabled: true        # Enable report generation
  language: "zh"       # "zh" (Chinese) or "en" (English)
  include_examples: 3  # Examples per section (1-10)
  chart:
    enabled: true          # Generate visualization charts
    orientation: "vertical"  # "horizontal" or "vertical"
    show_values: true      # Show values on bars
    highlight_best: true   # Highlight best-performing model

The report includes Executive Summary, Per-Field Accuracy Breakdown, Model Rankings, Per-Discipline Analysis, Verification Source Distribution, and Representative Cases.

Best Practices

Use the official dataset from HuggingFace for reproducible and comparable results
Set temperature: 0.3 or lower for more deterministic reference generation
Provide crossref_mailto to join the Crossref polite pool for better rate limits
Use --save flag to persist all intermediate results for later analysis
Include diverse disciplines in your evaluation queries for comprehensive assessment
Use {num_refs} placeholder in system prompts to control reference count
Use tool-augmented mode to compare bare vs. search-assisted hallucination rates for the same model
Set per-endpoint max_concurrency based on each provider's rate limit

Don't

Set max_concurrency too high—this may trigger API rate limits on verification services
Skip checkpoint resumption for large-scale evaluations (hundreds of queries × many models)
Compare models with different system prompts unless intentionally testing prompt effects
Ignore per-discipline results—aggregate scores can mask discipline-specific weaknesses
Set tool_config.max_iterations too high for tool-augmented mode—this increases latency and cost significantly

Next Steps

Auto Arena — Automatically compare models with generated queries
Refine Data Quality — Improve model outputs using grader feedback
Create Custom Graders — Build custom evaluation pipelines