Preference data for RLHF

Given a prompt and two candidate responses, pick the better one. Constrain the verdict to A, B, or tie.

from typing import Literal

from agno.agent import Agent
from agno.models.google import Gemini
from pydantic import BaseModel, Field


class Preference(BaseModel):
    winner: Literal["A", "B", "tie"] = Field(
        ..., description="Which response is better, or 'tie' if equal"
    )


agent = Agent(
    model=Gemini(id="gemini-3.5-flash"),
    instructions=(
        "Decide which response better answers the prompt. Return 'A', 'B', "
        "or 'tie'. Use 'tie' only when the two are genuinely "
        "indistinguishable in quality."
    ),
    output_schema=Preference,
)


def build_input(prompt: str, a: str, b: str) -> str:
    return f"Prompt:\n{prompt}\n\nResponse A:\n{a}\n\nResponse B:\n{b}"


prompt = "Explain why the sky is blue, in one sentence."
a = "Shorter blue wavelengths scatter more off air molecules, so the sky looks blue."
b = "Because of physics."
result = agent.run(build_input(prompt, a, b)).content
# Preference(winner='A')

Each (prompt, A, B, winner) row is the input format for reward-model training and DPO. Agno produces the row; the trainer is out of scope.

Add a rationale

A rationale per comparison gives annotators something to audit and helps debug a noisy reward model.

from typing import Literal

from pydantic import BaseModel, Field


class Preference(BaseModel):
    winner: Literal["A", "B", "tie"] = Field(..., description="Better response")
    rationale: str = Field(..., description="Why the winner is better")

Score against a rubric

When preference should follow explicit criteria, put the rubric in the instructions and keep the output binary.

instructions = """\
Compare the two responses on these criteria, in priority order:
1. Correctness - is the information accurate
2. Completeness - does it fully answer the prompt
3. Clarity - is it easy to follow

Return the response that wins on the highest-priority criterion where
they differ. Use 'tie' only if they are equal on all three.
"""

Picking the shape

You need	Schema
Bare preference label	`Literal["A", "B", "tie"]`
Preference plus justification	Add a `rationale` field
Criteria-driven preference	Rubric in instructions, binary output

Reducing position bias

A single judge can favor whichever response is shown first. Run the comparison twice with A and B swapped, or send both orderings to two providers and adjudicate. See the Quality pipeline for the two-model agreement pattern.

Next steps

Task	Guide
Score a single response	LLM as judge
Adjudicate disagreements	Quality pipeline

Developer Resources

Pairwise preference cookbook

​Add a rationale

​Score against a rubric

​Picking the shape

​Reducing position bias

​Next steps

​Developer Resources