Automatically review academic papers using LLM-powered evaluation. This end-to-end pipeline performs correctness detection, scholarly review, criticality verification, and bibliography validation.
Overview
Paper Review automates the academic paper review process, providing comprehensive evaluation across multiple dimensions. Ideal for pre-submission checks, conference paper screening, research quality assurance, and reference verification.
Multimodal PDF Support
This pipeline leverages native PDF understanding capabilities of modern LLMs, eliminating the need for PDF-to-text conversion and preserving figure, table, and formula context.
The pipeline automates five evaluation phases:
| Phase | Component | Description |
|---|---|---|
| 1 | Safety Checks |
Detect jailbreaking attempts and format violations |
| 2 | CorrectnessGrader |
Identify logical errors, contradictions, and factual issues |
| 3 | ReviewGrader |
Generate comprehensive scholarly review |
| 4 | CriticalityGrader |
Verify and classify detected issues by severity |
| 5 | BibChecker |
Validate bibliography entries against CrossRef |
Quick Start
Get started with Paper Review in just a few lines of code:
import asyncio
import os
from cookbooks.paper_review import PaperReviewPipeline, PipelineConfig, generate_report
async def main():
config = PipelineConfig(
model_name="gemini-3-pro-preview", # Recommended
api_key=os.environ.get("OPENAI_API_KEY", ""),
base_url=os.environ.get("OPENAI_BASE_URL", ""), # OpenAI-compatible proxy
timeout=1500,
enable_safety_checks=True,
enable_correctness=True,
enable_review=True,
enable_criticality=True,
)
pipeline = PaperReviewPipeline(config)
# Review paper and generate report
result, report = await pipeline.review_and_report(
pdf_input="paper.pdf",
paper_name="My Research Paper",
output_path="review_report.md",
)
print(f"Review Score: {result.review.score}/6")
print(f"Report saved to: review_report.md")
asyncio.run(main())
import asyncio
import os
from cookbooks.paper_review import PaperReviewPipeline, PipelineConfig
async def main():
config = PipelineConfig(
model_name="gemini-3-pro-preview", # Recommended
api_key=os.environ.get("OPENAI_API_KEY", ""),
base_url=os.environ.get("OPENAI_BASE_URL", ""), # OpenAI-compatible proxy
enable_bib_verification=True,
crossref_mailto="your-email@example.com", # For CrossRef API
)
pipeline = PaperReviewPipeline(config)
result = await pipeline.review_paper(
pdf_input="paper.pdf",
bib_path="references.bib",
)
# Check bibliography verification results
if result.bib_verification:
for bib_file, summary in result.bib_verification.items():
print(f"Bibliography: {bib_file}")
print(f" Total references: {summary.total_references}")
print(f" Verified: {summary.verified}")
print(f" Suspect: {summary.suspect}")
print(f" Verification rate: {summary.verification_rate:.1%}")
asyncio.run(main())
Review papers directly from arXiv-style TeX source packages:
import asyncio
import os
from cookbooks.paper_review import PaperReviewPipeline, PipelineConfig
async def main():
config = PipelineConfig(
model_name="gemini-3-pro-preview", # Recommended
api_key=os.environ.get("OPENAI_API_KEY", ""),
base_url=os.environ.get("OPENAI_BASE_URL", ""), # OpenAI-compatible proxy
enable_bib_verification=True,
crossref_mailto="your-email@example.com",
)
pipeline = PaperReviewPipeline(config)
# Process .tar.gz or .zip package
result = await pipeline.review_tex_package("paper_source.tar.gz")
# Access TeX package info
if result.tex_info:
print(f"Main TeX file: {result.tex_info.main_tex}")
print(f"Total TeX files: {result.tex_info.total_files}")
print(f"BibTeX files: {result.tex_info.bib_files}")
print(f"Figures: {result.tex_info.figures}")
asyncio.run(main())
Recommended Models
For optimal paper review quality, we recommend using advanced reasoning models:
| Model | Provider | Best For |
|---|---|---|
gemini-3-pro-preview |
Comprehensive review with excellent multimodal understanding | |
gpt-5.2 |
OpenAI | Deep logical analysis and nuanced critique |
Model Characteristics
GPT-5.2 excels at detecting subtle errors and inconsistencies, making it ideal for rigorous correctness checking. In contrast, Gemini-3.0-Pro tends to be overly generous with praise and may overlook critical issues — use it when you need comprehensive coverage but verify findings with a more critical model.
Model Requirements
Paper review requires models with strong reasoning capabilities and native PDF/image understanding. Smaller models may miss subtle logical errors or produce superficial reviews.
Use OpenAI-Compatible Proxy Services
We recommend using OpenAI-compatible API proxy services (such as OpenRouter, Together AI, or other third-party providers) to access various models through a unified interface. This approach offers several benefits:
- Unified API: Access multiple model providers (Google, OpenAI, Anthropic, etc.) through a single OpenAI-compatible endpoint
- Cost Optimization: Many proxy services offer competitive pricing and pay-as-you-go billing
- Simplified Integration: No need to manage multiple API keys and endpoints
- Fallback Support: Easily switch between models without code changes
Component Guide
Pipeline Configuration
The PipelineConfig controls which evaluation phases to run:
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str | "gpt-4o" |
LLM model identifier |
api_key |
str | "" |
API key for the model provider |
base_url |
str | None |
Custom API base URL |
temperature |
float | 0.7 |
Sampling temperature |
timeout |
int | 1500 |
Request timeout in seconds |
enable_safety_checks |
bool | True |
Run jailbreaking and format checks |
enable_correctness |
bool | True |
Run correctness detection |
enable_review |
bool | True |
Generate scholarly review |
enable_criticality |
bool | True |
Verify detected issues |
enable_bib_verification |
bool | True |
Validate bibliography |
crossref_mailto |
str | None |
Email for CrossRef API (higher rate limits) |
Graders
Each grader evaluates a specific aspect of the paper:
CorrectnessGrader
Detects logical errors, contradictions, and factual issues.
Output Score (1-3):
| Score | Meaning |
|---|---|
| 1 | No significant errors detected |
| 2 | Minor issues found |
| 3 | Major errors detected |
from cookbooks.paper_review.graders import CorrectnessGrader
grader = CorrectnessGrader(model)
result = await grader.aevaluate(pdf_data=pdf_base64)
print(f"Score: {result.score}")
print(f"Key issues: {result.metadata.get('key_issues', [])}")
ReviewGrader
Generates a comprehensive scholarly review.
Output Score (1-6): Follows standard academic conference scoring:
| Score | Rating |
|---|---|
| 6 | Strong Accept |
| 5 | Accept |
| 4 | Weak Accept |
| 3 | Borderline |
| 2 | Weak Reject |
| 1 | Reject |
from cookbooks.paper_review.graders import ReviewGrader
grader = ReviewGrader(model)
result = await grader.aevaluate(pdf_data=pdf_base64)
print(f"Score: {result.score}/6")
print(f"Review: {result.metadata.get('review_text', '')}")
CriticalityGrader
Verifies detected issues and classifies them by severity.
Issue Classification:
- Major: Critical issues affecting paper validity
- Minor: Issues that don't invalidate core contributions
- False Positives: Initially flagged issues that are actually valid
from cookbooks.paper_review.graders import CriticalityGrader
grader = CriticalityGrader(model)
result = await grader.aevaluate(
pdf_data=pdf_base64,
correctness_result=correctness_result, # From CorrectnessGrader
)
print(f"Major issues: {result.metadata.get('major_issues', [])}")
print(f"Minor issues: {result.metadata.get('minor_issues', [])}")
print(f"False positives: {result.metadata.get('false_positives', [])}")
BibChecker
Validates bibliography entries against CrossRef database.
from cookbooks.paper_review.processors import BibChecker
checker = BibChecker(mailto="your-email@example.com")
results = checker.check_bib_file("references.bib")
print(f"Total: {results['total_references']}")
print(f"Verified: {results['verified']}")
print(f"Suspect: {results['suspect']}")
Verification Status:
| Status | Description |
|---|---|
verified |
Reference found in CrossRef with matching metadata |
suspect |
Reference not found or metadata mismatch |
error |
Verification failed (API error, malformed entry) |
Output Format
PaperReviewResult
The pipeline returns a structured PaperReviewResult:
class PaperReviewResult:
is_safe: bool # Whether paper passed safety checks
safety_issues: List[str] # Detected safety issues
correctness: CorrectnessResult # Correctness detection result
review: ReviewResult # Scholarly review result
criticality: CriticalityResult # Issue classification result
format_compliant: bool # Whether format is acceptable
bib_verification: Dict[str, BibVerificationSummary] # Per-file results
tex_info: TexPackageInfo # TeX package metadata (if applicable)
metadata: Dict[str, Any] # Additional metadata
Markdown Report
Use generate_report() to create a human-readable Markdown report:
from cookbooks.paper_review import generate_report
report = generate_report(
result,
paper_name="My Research Paper",
output_path="review_report.md" # Optional: save to file
)
Report Sections:
- Paper Information - Basic metadata
- Safety Status - Jailbreaking and format check results
- Correctness Analysis - Detected errors with severity indicators
- Scholarly Review - Full review text and score
- Bibliography Verification - Reference validation summary
- TeX Package Info - Source package details (if applicable)
Severity Indicators:
- 🔴 Major issue (critical)
- 🟡 Minor issue (non-critical)
- 🟢 No issues / Verified
Best Practices
Do
- Use recommended models (gemini-2.5-pro-preview-05-06 or gpt-5.2) for best results
- Provide bibliography files when available for comprehensive verification
- Set a generous timeout (1500s+) for long papers
- Include your email in
crossref_mailtofor higher API rate limits - Review the criticality classification to distinguish real issues from false positives
Don't
- Use smaller/faster models for paper review (quality will suffer)
- Skip safety checks in production (prevents prompt injection)
- Rely solely on automated review (human oversight is essential)
- Ignore
suspectreferences without manual verification
Performance Tips:
- For papers > 30 pages, increase
timeoutto 2000+ seconds - Use TeX package review when source is available (better bibliography extraction)
- Run individual graders separately if you only need specific evaluations
Examples
Find complete examples in the cookbooks/paper_review/examples/ directory:
| Example | Description |
|---|---|
single_paper_review.py |
Basic PDF review with report generation |
bib_verification.py |
Standalone bibliography verification |
tex_package_review.py |
Review from TeX source package |
correctness_check.py |
Correctness detection only |
Related Topics: Zero-Shot Evaluation · Create Custom Graders · Run Grading Tasks