winference

Win rate calibration under non-transitivity.

When you run an LLM arena and report “Model A beats Model B 62% of the time,” is that number calibrated? And does it still hold when your users ask different questions than your evaluation set?

If model strengths vary across task types, aggregate win rates can exhibit non-transitive preferences: A beats B, B beats C, but C beats A. Standard Bradley-Terry / Elo assumes this doesn’t happen, and when it does, your calibration breaks — especially under distribution shift.

winference provides two approaches to calibrating win rates in the presence of non-transitivity, plus diagnostics to decide which one you need.

Installation

pip install winference

The Two Approaches

A) Hodge Decomposition

Decomposes the pairwise comparison matrix into:

  • Gradient (transitive): a potential s_i per model such that the log-odds ≈ s_i s_j. This part can be calibrated to a scalar ranking.

  • Curl (cyclic): rock-paper-scissors structure that cannot be represented by any linear ranking.

Calibrate win rates from the gradient component. Report the curl fraction as the share of variance your calibration ignores.

Use when: Cycles persist even after conditioning on task category — the non-transitivity is irreducible.

B) Heterogeneous Groups

Test whether model strengths differ across prompt categories (math, creative, coding, …) using a likelihood-ratio test. If so, fit Bradley-Terry per category. Win rates for any target distribution are then:

\[P(A > B | \pi^*) = \sum_k \pi^*_k \cdot \sigma(\theta_{A,k} - \theta_{B,k})\]

This gives you composable calibration: swap in any target distribution without refitting.

Use when: Non-transitivity dissolves when you condition on prompt category.

Quickstart

from winference import (
    TournamentGraph, BradleyTerry, HodgeDecomposition,
    GroupTest, GroupCalibrator, expected_calibration_error,
)
from winference.simulate import simulate_llm_arena

# 1. Simulate (or load) arena data
data = simulate_llm_arena()
comparisons = data["comparisons"]
categories  = data["categories"]
models      = data["models"]

# 2. Graph triage: is non-transitivity a problem?
tg = TournamentGraph(models)
for a, b, w in comparisons:
    tg.add_result(a, b, w)

print(tg.summary())
# → {'nontransitivity_index': 0.83, 'cyclic_triples': 7, ...}

# 3a. Hodge decomposition
hd = HodgeDecomposition(models)
result = hd.fit(tg.win_rate_matrix(), weights=tg.counts)
print(f"Transitive: {result.transitive_variance:.0%}")
print(f"Cyclic:     {result.cyclic_variance:.0%}")

# 3b. Group heterogeneity test
groups = sorted(set(categories))
gt = GroupTest(models, groups)
gt.fit(comparisons, categories)
print(gt.test_result())
# → {'statistic': 342.1, 'p_value': 1.2e-63, 'reject_at_05': True}

# Composable win rates
gc = GroupCalibrator(gt)
p_math_heavy = gc.win_probability(
    "ZetaMath", "DeltaWrite",
    target_distribution={"reasoning": 0.7, "creative_writing": 0.15, "coding": 0.15},
)

Indices and tables