Evaluate how accurately LLMs recommend real academic references. This benchmark verifies every paper citation against four authoritative academic databases—Crossref, PubMed, arXiv, and DBLP—providing objective, reproducible metrics to measure reference hallucination across models and disciplines.
What is Reference Hallucination Arena?
Reference Hallucination Arena is a benchmark designed to evaluate LLMs' ability to recommend real, verifiable academic papers. Unlike subjective evaluation tasks, this benchmark uses fully automated, objective verification: every reference generated by a model is checked against real-world academic databases.
The benchmark addresses a critical problem: when researchers ask LLMs for literature recommendations, models frequently "hallucinate" references—generating papers that sound plausible but do not actually exist. Reference Hallucination Arena quantifies this phenomenon across multiple models and academic disciplines.
The official evaluation dataset is available on HuggingFace: OpenJudge/ref-hallucination-arena.
Key Features:
| Feature | Description |
|---|---|
| Multi-source Verification | Cross-validates references against Crossref, PubMed, arXiv, and DBLP |
| Multi-discipline Coverage | Supports Computer Science, Biomedical, Physics, Chemistry, Social Science, Interdisciplinary, and more |
| Field-level Accuracy | Checks title, author, year, and DOI individually for fine-grained analysis |
| Strict Verification | All fields (title, author, year) must exactly match a real paper to count as VERIFIED |
| Tool-augmented Mode | Optional ReAct agent with Tavily web search to compare bare vs. tool-augmented hallucination rates |
| Year Constraint Support | Tests whether models respect temporal constraints (e.g., "papers after 2020") |
| Checkpoint Resume | Fine-grained per-item checkpointing for long-running evaluations |
| Objective Metrics | No subjective judgment—all scores are based on verifiable facts |
The evaluation pipeline consists of six automated steps:
| Step | Component | Description |
|---|---|---|
| 1 | DatasetLoader |
Load evaluation queries from JSON/JSONL dataset |
| 2 | ResponseCollector |
Collect BibTeX-formatted responses from target models (bare mode or tool-augmented ReAct mode) |
| 3 | BibExtractor |
Extract structured references from model responses |
| 4 | CompositeVerifier |
Verify each reference against Crossref/PubMed/arXiv/DBLP |
| 5 | ObjectiveScorer + RankingCalculator |
Compute verification metrics and rank models |
| 6 | RefReportGenerator + RefChartGenerator |
Generate detailed report and visualization charts |
Dataset
The evaluation dataset is hosted on HuggingFace: OpenJudge/ref-hallucination-arena.
Each query item in the dataset follows this schema:
{
"query": "Please recommend papers on Transformer architectures for NLP.",
"discipline": "computer_science",
"num_refs": 5,
"language": "en",
"year_constraint": {"min_year": 2020}
}
| Field | Required | Description |
|---|---|---|
query |
Yes | The prompt text for reference recommendation |
discipline |
No | Academic discipline for verification routing |
num_refs |
No | Expected number of references (default: 5) |
language |
No | Query language: zh or en (default: zh) |
year_constraint |
No | Time constraint on recommended references |
metadata |
No | Arbitrary extra metadata (dict) |
Year Constraint Formats:
| Format | Example | Meaning |
|---|---|---|
| Exact year | {"exact": 2023} |
Only papers from 2023 |
| Year range | {"min_year": 2020, "max_year": 2024} |
Papers between 2020–2024 |
| After a year | {"min_year": 2020} |
Papers from 2020 onwards |
| Before a year | {"max_year": 2015} |
Papers before 2015 |
You can download the dataset and use it directly, or create your own custom queries following the same format.
How to Run the Evaluation
Follow this workflow to evaluate your models' reference recommendation capabilities:
- Prepare Dataset
Download the official dataset from HuggingFace or create your own query file in JSON/JSONL format.
???+ example "Show Code"
Or download directly from [HuggingFace](https://huggingface.co/datasets/OpenJudge/ref-hallucination-arena).
# Option 1: Use the bundled example queries ls cookbooks/ref_hallucination_arena/examples/queries_example.json # Option 2: Download from HuggingFace pip install huggingface_hub python -c " from huggingface_hub import hf_hub_download hf_hub_download( repo_id='OpenJudge/ref-hallucination-arena', filename='ref_hallucination_query.json', repo_type='dataset', local_dir='./data' ) " - Configure Endpoints
Create a YAML configuration file defining target models, verification settings, and output options.
???+ example "Show Code"
task: description: "Evaluate LLM reference recommendation capabilities" dataset: path: "./data/queries.json" shuffle: false # Whether to shuffle queries before evaluation max_queries: null # Max number of queries to use (null = use all) target_endpoints: # Bare mode (default): direct LLM call model_a: base_url: "https://api.example.com/v1" api_key: "${MODEL_A_API_KEY}" model: "model-a" system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format." max_concurrency: 5 # Per-endpoint concurrency (default: 5) extra_params: temperature: 0.3 model_b: base_url: "https://api.example.com/v1" api_key: "${MODEL_B_API_KEY}" model: "model-b" system_prompt: "You are an academic literature recommendation expert. Recommend {num_refs} real papers in BibTeX format." max_concurrency: 8 extra_params: temperature: 0.3 # Tool-augmented mode: ReAct agent with Tavily web search model_b_with_tools: base_url: "https://api.example.com/v1" api_key: "${MODEL_B_API_KEY}" model: "model-b" extra_params: temperature: 0.3 tool_config: enabled: true tavily_api_key: "${TAVILY_API_KEY}" max_iterations: 10 # ReAct iterations (1-30, default: 10) search_depth: "advanced" # "basic" or "advanced" verification: max_workers: 10 crossref_mailto: "" # Email for Crossref polite pool pubmed_api_key: "" # PubMed API key for higher rate limit timeout: 30 # Per-request timeout in seconds evaluation: timeout: 120 # Model API request timeout in seconds retry_times: 3 # Number of retry attempts output: output_dir: "./evaluation_results/ref_hallucination_arena" report: enabled: true language: "en" - Run Evaluation
Execute the pipeline via CLI or Python API. The pipeline supports checkpoint resume for long-running evaluations.
???+ example "Show Code"
=== "CLI"
=== "Python API"
# Run evaluation with config file python -m cookbooks.ref_hallucination_arena --config config.yaml --save # Resume from checkpoint (default behavior) python -m cookbooks.ref_hallucination_arena --config config.yaml --save # Start fresh, ignore checkpoint python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --saveimport asyncio from cookbooks.ref_hallucination_arena.pipeline import RefArenaPipeline async def main(): pipeline = RefArenaPipeline.from_config("config.yaml") result = await pipeline.evaluate() # Print rankings for rank, (model, score) in enumerate(result.rankings, 1): print(f"{rank}. {model}: {score:.1%}") asyncio.run(main())
Environment Variables
Use ${ENV_VAR} syntax in YAML config to reference environment variables for API keys. Never hardcode sensitive credentials in configuration files.
Interpreting Results
The primary metric is overall accuracy (also called verification rate)—the percentage of references where title, author, and year all exactly match a real paper. Models are ranked by: overall accuracy → year compliance rate → average confidence → completeness (descending):
============================================================
REFERENCE HALLUCINATION ARENA - RANKINGS
============================================================
1. Model A [################----] overall=78.4% title=85.2% author=80.1% doi=52.3% refs=50
2. Model B [###############-----] overall=75.2% title=82.0% author=77.5% doi=48.7% refs=50
3. Model C [##############------] overall=72.8% title=80.3% author=75.0% doi=45.2% refs=50
4. Model D [#############-------] overall=69.5% title=78.1% author=72.8% doi=41.5% refs=50
============================================================
Benchmark Leaderboard
For real-world evaluation results on mainstream LLMs, visit the OpenJudge Leaderboard.
Interpretation:
- > 75% — Excellent: Model rarely hallucinates references
- 60-75% — Good: Most references are real, but some fabrication occurs
- 40-60% — Fair: Significant hallucination, use with caution
- < 40% — Poor: Model frequently fabricates references
Beyond overall rates, examine per-field accuracy for fine-grained insight:
Per-Field Accuracy (Model A):
Title Accuracy : 82.3% # Percentage of titles matching real papers
Author Accuracy : 68.5% # Percentage of correct author lists
Year Accuracy : 71.2% # Percentage of correct publication years
DOI Accuracy : 45.8% # Percentage of valid DOIs
This breakdown reveals that models may get titles right but fabricate author names or DOIs—a common pattern where the model "remembers" a paper's topic but not its exact metadata.
Per-Discipline Performance shows which academic fields are most challenging:
Per-Discipline Overall Accuracy:
Computer Science : 81.2%
Biomedical : 74.5%
Physics : 70.3%
Chemistry : 65.8%
Social Science : 58.1%
Error Analysis
Analyze verification results to understand hallucination patterns and guide model selection:
Verification Status Categories
Each reference receives one of four verification statuses:
| Status | Meaning | Typical Cause |
|---|---|---|
| VERIFIED | Reference confirmed as real | Paper found in academic databases with title, author, and year all strictly matching |
| SUSPECT | Partial match found | Title similar but author/year mismatch; may be a real paper with wrong details |
| NOT_FOUND | No match in any database | Likely a fabricated reference, or a real paper with incorrect metadata |
| ERROR | Verification failed | API timeout, rate limiting, or network issues |
Note: Under the current strict verification logic, a reference is only marked VERIFIED when all provided fields (title, author, year) exactly match a real paper. Partial matches (e.g., correct title but wrong authors) are counted as NOT_FOUND with match details preserved for per-field accuracy analysis.
Common Hallucination Patterns
| Pattern | Description | Detection |
|---|---|---|
| Plausible fabrication | Paper sounds real but does not exist | High title similarity to real papers but no exact match |
| Author swapping | Correct paper title but wrong authors | Title verified but author accuracy low |
| Year shifting | Real paper but wrong publication year | Title/author match but year mismatch |
| DOI invention | Fabricated DOI that follows valid format | DOI format is correct but resolves to nothing |
| Journal confusion | Real paper attributed to wrong venue | Paper exists but published in different journal |
Programmatic Error Analysis
import json
# Load verification results
with open("evaluation_results/ref_hallucination_arena/verification_results.json") as f:
results = json.load(f)
# Analyze hallucination patterns per model
for model_name, model_results in results.items():
total = sum(r["total_refs"] for r in model_results)
verified = sum(r["verified"] for r in model_results)
not_found = sum(r["not_found"] for r in model_results)
print(f"\n{model_name}:")
print(f" Total refs: {total}")
print(f" Verified: {verified} ({verified/total:.1%})")
print(f" Not found: {not_found} ({not_found/total:.1%})")
# Per-discipline breakdown
by_discipline = {}
for r in model_results:
d = r.get("discipline", "unknown")
if d not in by_discipline:
by_discipline[d] = {"total": 0, "verified": 0}
by_discipline[d]["total"] += r["total_refs"]
by_discipline[d]["verified"] += r["verified"]
for d, stats in by_discipline.items():
rate = stats["verified"] / stats["total"] if stats["total"] > 0 else 0
print(f" {d}: {rate:.1%} ({stats['verified']}/{stats['total']})")
Improving Model Performance
Based on error analysis, consider these strategies:
| Error Pattern | Root Cause | Solution |
|---|---|---|
| Low verification rate overall | Model lacks factual grounding | Enable tool-augmented mode with web search, or use RAG-capable models |
| High NOT_FOUND rate with partial matches | Partial knowledge of papers | Strengthen system prompt to require exact metadata |
| Poor DOI accuracy | DOIs are hard to memorize | Ask models to omit DOIs if uncertain |
| Discipline-specific weakness | Domain knowledge gaps | Use domain-specialized models for specific fields |
| Year constraint violations | Model ignores temporal restrictions | Emphasize time constraints in the prompt |
| Tool mode reaches max iterations | Insufficient search depth | Increase max_iterations in tool_config (up to 30) |
Output Files
All results are saved to the configured output directory:
evaluation_results/ref_hallucination_arena/
├── evaluation_report.md # Detailed Markdown report (bilingual zh/en)
├── evaluation_results.json # Final rankings, per-field accuracy, and scores
├── verification_chart.png # Per-field accuracy breakdown bar chart (Title/Author/Year/DOI/Overall)
├── discipline_chart.png # Per-discipline overall accuracy grouped bar chart
├── queries.json # Loaded evaluation queries
├── responses.json # Raw model responses
├── extracted_refs.json # Extracted BibTeX references
├── verification_results.json # Detailed per-reference verification results
└── checkpoint.json # Pipeline checkpoint for resume
Advanced Topics
The CompositeVerifier checks references against four academic databases with discipline-aware routing:
| Source | Coverage | Best For |
|---|---|---|
| Crossref | Broadest coverage (130M+ records) | General academic papers with DOIs |
| PubMed | Biomedical and life sciences | Medical, biological, and health papers |
| arXiv | Preprints in STEM fields | Computer science, physics, mathematics |
| DBLP | Computer science bibliography | CS conferences and journals |
A reference is marked as VERIFIED only when all of the following strict checks pass against a real paper in any of the four sources:
- Title: Normalized exact match (lowercase, strip punctuation/HTML, compare word sequences)
- Author: Every author last name the model provides must appear in the real author list
- Year: Publication year must be identical
The verification order depends on the query's discipline. For example, biomedical queries check Crossref → PubMed → arXiv → DBLP, while computer_science queries check Crossref → DBLP → arXiv → PubMed. When a DOI is present, Crossref is always tried first regardless of discipline.
Evaluations automatically save fine-grained checkpoints. Both response collection (Step 2) and reference verification (Step 4) support per-item checkpointing, so interrupted runs lose at most one item of progress:
# First run (interrupted after verifying 500/1000 items)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save
# Resume from checkpoint (automatically picks up at item 501)
python -m cookbooks.ref_hallucination_arena --config config.yaml --save
# Start fresh (ignore checkpoint)
python -m cookbooks.ref_hallucination_arena --config config.yaml --fresh --save
Checkpoint stages: QUERIES_LOADED → RESPONSES_COLLECTING → RESPONSES_COLLECTED → REFS_EXTRACTED → VERIFICATION_IN_PROGRESS → VERIFICATION_COMPLETE → EVALUATION_COMPLETE
The pipeline supports an optional tool-augmented mode where models use a ReAct agent with Tavily web search to find and verify real papers before recommending them. This enables direct comparison of "bare model" vs. "tool-augmented" hallucination rates for the same model.
target_endpoints:
# Same model, bare mode (no tools)
model_a_bare:
base_url: "https://api.example.com/v1"
api_key: "${MODEL_A_API_KEY}"
model: "model-a"
extra_params:
temperature: 0.3
# Same model, tool-augmented mode
model_a_with_tools:
base_url: "https://api.example.com/v1"
api_key: "${MODEL_A_API_KEY}"
model: "model-a"
extra_params:
temperature: 0.3
tool_config:
enabled: true
tavily_api_key: "${TAVILY_API_KEY}"
max_iterations: 10
search_depth: "advanced"
| Parameter | Default | Description |
|---|---|---|
enabled |
false |
Set to true to activate tool-augmented mode |
tavily_api_key |
null |
Tavily API key (falls back to TAVILY_API_KEY env var) |
max_iterations |
10 |
Maximum ReAct reasoning iterations (1–30) |
search_depth |
"advanced" |
Tavily search depth: "basic" or "advanced" |
When the ReAct agent exhausts its iterations without producing BibTeX output, the pipeline automatically runs a fallback summarization step—one additional LLM call without tools—so the model can synthesize all gathered search results into proper BibTeX format.
Separate Prompts for Tool Mode
When no custom system_prompt is set, the pipeline automatically uses a different default prompt for tool-augmented mode that instructs the model to search and verify papers before recommending them.
The system prompt controls how models format their reference output. Use the {num_refs} placeholder to dynamically insert the expected number of references:
target_endpoints:
my_model:
base_url: "https://api.example.com/v1"
api_key: "${API_KEY}"
model: "my-model"
system_prompt: |
You are an academic literature recommendation expert.
Based on the user's research topic, recommend {num_refs}
real, high-quality academic papers. Output each paper in
standard BibTeX format with title, author, year,
journal/booktitle, and doi fields.
BibTeX Format is Critical
The pipeline extracts references using BibTeX parsing. Ensure your system prompt explicitly requests BibTeX-formatted output for reliable extraction.
When no custom system_prompt is provided, the pipeline uses built-in defaults in both Chinese and English, selected based on the query's language field. Tool-augmented mode uses separate default prompts that include instructions for web search.
Generate a comprehensive Markdown report with concrete examples:
report:
enabled: true # Enable report generation
language: "zh" # "zh" (Chinese) or "en" (English)
include_examples: 3 # Examples per section (1-10)
chart:
enabled: true # Generate visualization charts
orientation: "vertical" # "horizontal" or "vertical"
show_values: true # Show values on bars
highlight_best: true # Highlight best-performing model
The report includes Executive Summary, Per-Field Accuracy Breakdown, Model Rankings, Per-Discipline Analysis, Verification Source Distribution, and Representative Cases.
Best Practices
Do
- Use the official dataset from HuggingFace for reproducible and comparable results
- Set
temperature: 0.3or lower for more deterministic reference generation - Provide
crossref_mailtoto join the Crossref polite pool for better rate limits - Use
--saveflag to persist all intermediate results for later analysis - Include diverse disciplines in your evaluation queries for comprehensive assessment
- Use
{num_refs}placeholder in system prompts to control reference count - Use tool-augmented mode to compare bare vs. search-assisted hallucination rates for the same model
- Set per-endpoint
max_concurrencybased on each provider's rate limit
Don't
- Set
max_concurrencytoo high—this may trigger API rate limits on verification services - Skip checkpoint resumption for large-scale evaluations (hundreds of queries × many models)
- Compare models with different system prompts unless intentionally testing prompt effects
- Ignore per-discipline results—aggregate scores can mask discipline-specific weaknesses
- Set
tool_config.max_iterationstoo high for tool-augmented mode—this increases latency and cost significantly
Next Steps
- Auto Arena — Automatically compare models with generated queries
- Refine Data Quality — Improve model outputs using grader feedback
- Create Custom Graders — Build custom evaluation pipelines