A, B, or tie.
(prompt, A, B, winner) row is the input format for reward-model training and DPO. Agno produces the row; the trainer is out of scope.
Add a rationale
A rationale per comparison gives annotators something to audit and helps debug a noisy reward model.Score against a rubric
When preference should follow explicit criteria, put the rubric in the instructions and keep the output binary.Picking the shape
| You need | Schema |
|---|---|
| Bare preference label | Literal["A", "B", "tie"] |
| Preference plus justification | Add a rationale field |
| Criteria-driven preference | Rubric in instructions, binary output |
Reducing position bias
A single judge can favor whichever response is shown first. Run the comparison twice with A and B swapped, or send both orderings to two providers and adjudicate. See the Quality pipeline for the two-model agreement pattern.Next steps
| Task | Guide |
|---|---|
| Score a single response | LLM as judge |
| Adjudicate disagreements | Quality pipeline |