Overview

Extend OpenJudge beyond built-in evaluators by creating custom graders or training judge models. Build domain-specific evaluation logic that seamlessly integrates with OpenJudge's evaluation pipeline.

Why Build Custom Graders?

While OpenJudge provides 50+ pre-built graders, custom graders enable you to evaluate industry-specific criteria (legal, medical, financial), implement proprietary scoring logic, and train models that learn from your preference data. They also help optimize costs by replacing expensive API judges with self-hosted models while maintaining consistent evaluation standards across applications.

Building Approaches

OpenJudge supports three paths for creating custom graders, each optimized for different scenarios.

Approach	Time to Deploy	Data Required	Best For	Cost Profile
Create Custom Graders	Minutes	None	Quick prototyping, domain-specific logic	Pay-per-query (API) or free (code-based)
Generate from Data	1-4 hours	50-500 examples	Iterative refinement, transparent rubrics	Medium setup + pay-per-query
Train Judge Models	1-3 days	1K-100K pairs	High-volume production (>1M queries/month)	High upfront, 10x lower per-query

Use this decision tree to choose the right approach based on your data availability and requirements:

                         START
                           │
                           ▼
               ┌─────────────────────┐
               │ Have evaluation     │
               │ data with labels?   │
               └──────┬───────┬──────┘
                      │       │
                  YES │       │ NO
                      │       │
                      ▼       ▼
           ┌──────────────┐  ┌──────────────────┐
           │ Want to      │  │ Need evaluation  │
           │ train model? │  │ now?             │
           └────┬────┬────┘  └────┬─────────┬───┘
                │    │            │         │
            YES │    │ NO     YES │         │ NO
                │    │            │         │
                ▼    ▼            ▼         ▼
          ┌──────┐ ┌──────────┐ ┌────────┐ ┌──────────┐
          │Train │ │Generator │ │Custom  │ │ Define   │
          │Model │ │ (Rubric) │ │Graders │ │ criteria │
          └──────┘ └──────────┘ └────────┘ └──────────┘
                │         │           │            │
                └─────────┴───────────┴────────────┘
                              │
                              ▼
                ┌───────────────────────────────┐
                │  Use in evaluation pipeline   │
                │  (GradingRunner, batch eval)  │
                └───────────────────────────────┘

Choose based on your situation:

Have labeled data + need automation? → Train a judge model
Have data + need fast iteration? → Generate rubrics from data
No data + need immediate results? → Create custom graders

Approach 1: Create Custom Graders

Define evaluation logic using LLM judges or code-based functions with no training required. LLM-based graders use models like qwen3-32b with custom prompts for domain-specific criteria. Code-based graders implement deterministic logic—checking response length, keyword presence, format validation, or compliance requirements.

Learn more: Create Custom Graders → | Built-in Graders →

Approach 2: Generate Rubrics as Graders

Automatically generate evaluation rubrics and create graders. Two approaches available: Simple Rubric generates rubrics from task descriptions (zero-shot, no data required), while Iterative Rubric learns from 50-500 labeled examples to extract patterns. Both produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement.

Learn more: Generate Rubrics as Graders →

Approach 3: Train Judge Models

Train neural networks on preference data to learn evaluation criteria automatically. Supports Bradley-Terry (preference pairs), Generative Pointwise (absolute scores), and Generative Pairwise (comparison decisions). Requires 1K-100K examples and 1-3 days but delivers highly consistent evaluation at 10x lower per-query cost—ideal for high-volume scenarios exceeding 1M queries per month.

Learn more: Train Judge Models →

Next Steps

Create Custom Graders — Build graders using LLM or code-based logic
Generate Rubrics as Graders — Automatically generate graders from task description or labeled data
Train Judge Models — Train SFT, Bradley-Terry, or GRPO judge models
Built-in Graders — Explore pre-built graders to customize
Run Grading Tasks — Deploy graders at scale with batch workflows