Overview

Why OpenJudge?

OpenJudge is an open-source evaluation framework for AI applications (e.g., AI agents or chatbots) designed to evaluate quality and drive continuous application optimization.

In practice, application excellence depends on a trustworthy evaluation workflow: Collect test data → Define graders → Run evaluation at scale → Analyze weaknesses → Iterate quickly.

OpenJudge provides ready-to-use graders and supports generating scenario-specific rubrics (as graders), making this workflow simpler, more professional, and easy to integrate into your workflow.

It can also convert grading results into reward signals to help you fine-tune and optimize your application.

Key Features

Systematic & Quality-Assured Grader Library: Access 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.
- Multi-Scenario Coverage: Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks via specialized graders. Explore Supported Scenarios→
- Holistic Agent Evaluation: Beyond final outcomes, we assess the entire lifecycle—including trajectories and specific components (Memory, Reflection, Tool Use). Agent Lifecycle Evaluation →
- Quality Assurance: Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. View Benchmark Datasets→
Flexible Grader Building: Choose the build method that fits your requirements:
- Customization: Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. Custom Grader Development Guide →
- Zero-shot Rubrics Generation: Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping. Zero-shot Rubrics Generation Guide →
- Data-driven Rubrics Generation: Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. Data-driven Rubrics Generation Guide →
- Training Judge Models: Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. Train Judge Models →
Easy Integration: Using mainstream observability platforms like LangSmith or Langfuse? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We also provide integrations with training frameworks like VERL for RL training.

Quick Tutorials

Auto Arena

Compare models/agents without test data: Generate queries, collect responses, and rank via pairwise evaluation.

Evaluate An AI Agent

Agent lifecycle evaluation: Assess response, trajectory, tool usage, planning, memory, and reflection.

Build Rewards for Training

Quality reward signals: Aggregate graders with custom weighting for model alignment.

More Tutorials

Built-in Graders

Agent

Agent graders for evaluating various aspects of AI agent behavior. These graders assess action selection, tool usage, memory management, planning, reflection, and overall trajectory quality.

General Tasks

Assess fundamental capabilities such as instruction following, text quality, safety guardrails, and format.

Multimodal

Vision-language graders for evaluating AI responses involving images. These graders assess image-text coherence, image helpfulness, and text-to-image generation quality.

$\text{[math]}$

Math & Code

Specialized graders for evaluating code generation and mathematical problem-solving capabilities. These graders assess syntax correctness, execution results, code style, and mathematical expression accuracy.

Text

Algorithm-based graders for text similarity and matching. Fast, deterministic, and zero-cost evaluation using BLEU, ROUGE, F1, regex, and 15+ similarity algorithms.

Format

Format validation graders for structured outputs. Validate JSON syntax, check length constraints, detect repetition, and verify reasoning tags for chain-of-thought.

Build Graders

Customization

Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader.

Generate Rubrics

Auto-generate evaluation criteria. Use Zero-Shot generation from task descriptions, or Data-Driven generation to learn rubrics from labeled preference data.

Train Judge Models

Massive data and need peak performance? Train dedicated judge models using SFT, Bradley-Terry, or GRPO. Supports both scalar rewards and generative evaluation with reasoning.

Integrations

LangSmith

Build external evaluation pipelines for LangSmith. Wrap OpenJudge graders as LangSmith evaluators and run batch evaluations with GradingRunner.

Langfuse

Fetch traces from Langfuse, evaluate with OpenJudge graders, and push scores back. Supports batch processing and score aggregation.

VERL

Integrate OpenJudge graders as reward functions for VERL RL training. Supports batch processing and async evaluation at scale.

Applications

Data Refinement

Automate the curation of high-quality datasets. Use Graders to filter, rank, and synthesize training data for Supervised Fine-Tuning (SFT).

Pairwise Evaluation

Compare and rank multiple model outputs using LLM-based pairwise comparisons. Compute win rates, generate win matrices, and identify the best-performing models.

Running Graders

Run Grading Tasks

Orchestrate evaluations at scale with GradingRunner. Configure data mapping, control concurrency, and aggregate results from multiple graders into unified scores.

Analyze Grader Results

Transform raw scores into actionable insights. Examine score distributions, measure consistency, and compare performance against ground truth labels.

Validating Graders

Validation Overview

Ensure your graders make accurate judgments. Learn validation workflows, best practices, and metrics for measuring grader quality.

RewardBench2

Validate against the RewardBench2 benchmark for multi-domain response quality evaluation with standardized ground truth.