Extend OpenJudge beyond built-in evaluators by creating custom graders or training judge models. Build domain-specific evaluation logic that seamlessly integrates with OpenJudge's evaluation pipeline.
Why Build Custom Graders?
While OpenJudge provides 50+ pre-built graders, custom graders enable you to evaluate industry-specific criteria (legal, medical, financial), implement proprietary scoring logic, and train models that learn from your preference data. They also help optimize costs by replacing expensive API judges with self-hosted models while maintaining consistent evaluation standards across applications.
Building Approaches
OpenJudge supports three paths for creating custom graders, each optimized for different scenarios.
| Approach | Time to Deploy | Data Required | Best For | Cost Profile |
|---|---|---|---|---|
| Create Custom Graders | Minutes | None | Quick prototyping, domain-specific logic | Pay-per-query (API) or free (code-based) |
| Generate from Data | 1-4 hours | 50-500 examples | Iterative refinement, transparent rubrics | Medium setup + pay-per-query |
| Train Judge Models | 1-3 days | 1K-100K pairs | High-volume production (>1M queries/month) | High upfront, 10x lower per-query |
Use this decision tree to choose the right approach based on your data availability and requirements:
START
│
▼
┌─────────────────────┐
│ Have evaluation │
│ data with labels? │
└──────┬───────┬──────┘
│ │
YES │ │ NO
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Want to │ │ Need evaluation │
│ train model? │ │ now? │
└────┬────┬────┘ └────┬─────────┬───┘
│ │ │ │
YES │ │ NO YES │ │ NO
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────────┐ ┌────────┐ ┌──────────┐
│Train │ │Generator │ │Custom │ │ Define │
│Model │ │ (Rubric) │ │Graders │ │ criteria │
└──────┘ └──────────┘ └────────┘ └──────────┘
│ │ │ │
└─────────┴───────────┴────────────┘
│
▼
┌───────────────────────────────┐
│ Use in evaluation pipeline │
│ (GradingRunner, batch eval) │
└───────────────────────────────┘
Choose based on your situation:
- Have labeled data + need automation? → Train a judge model
- Have data + need fast iteration? → Generate rubrics from data
- No data + need immediate results? → Create custom graders
Approach 1: Create Custom Graders
Define evaluation logic using LLM judges or code-based functions with no training required. LLM-based graders use models like qwen3-32b with custom prompts for domain-specific criteria. Code-based graders implement deterministic logic—checking response length, keyword presence, format validation, or compliance requirements.
Learn more: Create Custom Graders → | Built-in Graders →
Approach 2: Generate Rubrics as Graders
Automatically generate evaluation rubrics and create graders. Two approaches available: Simple Rubric generates rubrics from task descriptions (zero-shot, no data required), while Iterative Rubric learns from 50-500 labeled examples to extract patterns. Both produce explicit rubrics that explain scoring decisions, ideal for scenarios requiring transparency and rapid refinement.
Learn more: Generate Rubrics as Graders →
Approach 3: Train Judge Models
Train neural networks on preference data to learn evaluation criteria automatically. Supports Bradley-Terry (preference pairs), Generative Pointwise (absolute scores), and Generative Pairwise (comparison decisions). Requires 1K-100K examples and 1-3 days but delivers highly consistent evaluation at 10x lower per-query cost—ideal for high-volume scenarios exceeding 1M queries per month.
Learn more: Train Judge Models →
Next Steps
- Create Custom Graders — Build graders using LLM or code-based logic
- Generate Rubrics as Graders — Automatically generate graders from task description or labeled data
- Train Judge Models — Train SFT, Bradley-Terry, or GRPO judge models
- Built-in Graders — Explore pre-built graders to customize
- Run Grading Tasks — Deploy graders at scale with batch workflows