winference ========== **Win rate calibration under non-transitivity.** When you run an LLM arena and report "Model A beats Model B 62% of the time," is that number *calibrated*? And does it still hold when your users ask different questions than your evaluation set? If model strengths vary across task types, aggregate win rates can exhibit **non-transitive** preferences: A beats B, B beats C, but C beats A. Standard Bradley-Terry / Elo assumes this doesn't happen, and when it does, your calibration breaks — especially under distribution shift. ``winference`` provides two approaches to calibrating win rates in the presence of non-transitivity, plus diagnostics to decide which one you need. Installation ------------ .. code-block:: bash pip install winference The Two Approaches ------------------ A) Hodge Decomposition ~~~~~~~~~~~~~~~~~~~~~~ Decomposes the pairwise comparison matrix into: - **Gradient** (transitive): a potential ``s_i`` per model such that the log-odds ≈ ``s_i − s_j``. This part *can* be calibrated to a scalar ranking. - **Curl** (cyclic): rock-paper-scissors structure that *cannot* be represented by any linear ranking. Calibrate win rates from the gradient component. Report the curl fraction as the share of variance your calibration ignores. **Use when:** Cycles persist even after conditioning on task category — the non-transitivity is irreducible. B) Heterogeneous Groups ~~~~~~~~~~~~~~~~~~~~~~~ Test whether model strengths differ across prompt categories (math, creative, coding, ...) using a **likelihood-ratio test**. If so, fit Bradley-Terry per category. Win rates for any target distribution are then: .. math:: P(A > B | \pi^*) = \sum_k \pi^*_k \cdot \sigma(\theta_{A,k} - \theta_{B,k}) This gives you **composable** calibration: swap in any target distribution without refitting. **Use when:** Non-transitivity dissolves when you condition on prompt category. Quickstart ---------- .. code-block:: python from winference import ( TournamentGraph, BradleyTerry, HodgeDecomposition, GroupTest, GroupCalibrator, expected_calibration_error, ) from winference.simulate import simulate_llm_arena # 1. Simulate (or load) arena data data = simulate_llm_arena() comparisons = data["comparisons"] categories = data["categories"] models = data["models"] # 2. Graph triage: is non-transitivity a problem? tg = TournamentGraph(models) for a, b, w in comparisons: tg.add_result(a, b, w) print(tg.summary()) # → {'nontransitivity_index': 0.83, 'cyclic_triples': 7, ...} # 3a. Hodge decomposition hd = HodgeDecomposition(models) result = hd.fit(tg.win_rate_matrix(), weights=tg.counts) print(f"Transitive: {result.transitive_variance:.0%}") print(f"Cyclic: {result.cyclic_variance:.0%}") # 3b. Group heterogeneity test groups = sorted(set(categories)) gt = GroupTest(models, groups) gt.fit(comparisons, categories) print(gt.test_result()) # → {'statistic': 342.1, 'p_value': 1.2e-63, 'reject_at_05': True} # Composable win rates gc = GroupCalibrator(gt) p_math_heavy = gc.win_probability( "ZetaMath", "DeltaWrite", target_distribution={"reasoning": 0.7, "creative_writing": 0.15, "coding": 0.15}, ) .. toctree:: :maxdepth: 2 :caption: API Reference api/tournament api/bradley_terry api/hodge api/groups api/calibration api/simulate Indices and tables ================== * :ref:`genindex` * :ref:`modindex`