Group Testing

Heterogeneous group testing and per-group calibration.

Tests whether model strengths are constant across prompt categories (homogeneous BT) or differ by category (heterogeneous BT). If heterogeneity is warranted, fits per-category BT models and provides composable win rate predictions for any target distribution over categories.

The formal test is a likelihood-ratio test:

H0: theta_{i,k} = theta_i for all i, k (homogeneous) H1: theta_{i,k} free (heterogeneous)

Lambda = -2(l0 - l1) ~ chi2 with (K-1)(N-1) degrees of freedom.

class winference.groups.GroupTest(models, groups)[source]

Bases: object

Likelihood-ratio test for heterogeneity across prompt groups.

Parameters:
  • models (list[str]) – Model identifiers.

  • groups (list[str]) – Unique group/category labels.

Examples

>>> gt = GroupTest(models=["A","B","C"], groups=["math","creative"])
>>> gt.fit(comparisons, group_labels)
>>> print(gt.test_result())
{'statistic': 14.2, 'df': 2, 'p_value': 0.0008}
fit(comparisons, group_labels, reg=0.0001)[source]

Fit null (pooled) and alternative (per-group) BT models.

Parameters:
  • comparisons (list[tuple[str, str, bool]]) – List of (model_a, model_b, a_wins) tuples.

  • group_labels (list[str]) – Category label for each comparison, same length as comparisons.

  • reg (float, default: 0.0001) – Regularisation for BT fitting.

Return type:

Self

Returns:

Self for method chaining.

Raises:

ValueError – If comparisons and group_labels have different lengths.

test_result()[source]

Likelihood-ratio test for group heterogeneity.

Returns:

statistic (LRT statistic Lambda), df (degrees of freedom), p_value (p-value under chi2 null), reject_at_05 (bool).

Return type:

dict[str, float | int | bool]

Raises:

RuntimeError – If fit() has not been called.

per_group_strengths()[source]

Return {group: {model: theta}} for each fitted group.

Return type:

dict[str, dict[str, float]]

class winference.groups.GroupCalibrator(group_test)[source]

Bases: object

Composable win rate calibration using per-group BT models.

After fitting per-group BT models, compute win rates for any target distribution over groups:

P(i > j | pi*) = sum_k pi*_k * sigmoid(theta_{i,k} - theta_{j,k})

This is the key advantage: calibration that transfers under distribution shift.

Parameters:

group_test (GroupTest) – A fitted GroupTest object.

Raises:

RuntimeError – If the GroupTest has not been fitted.

win_probability(model_a, model_b, target_distribution=None)[source]

Composite win probability under a target group distribution.

Parameters:
  • model_a (str) – First model name.

  • model_b (str) – Second model name.

  • target_distribution (dict[str, float] | None, default: None) – Dict mapping group to weight. Weights are normalised internally. If None, uses the empirical distribution from the training data.

Return type:

float

Returns:

Composite win probability P(model_a beats model_b).

win_probability_matrix(target_distribution=None)[source]

NxN composite win probability matrix.

Parameters:

target_distribution (dict[str, float] | None, default: None)

Return type:

ndarray[tuple[Any, ...], dtype[double]]

sensitivity_analysis(model_a, model_b, n_draws=1000, concentration=1.0)[source]

How much does P(a > b) vary as the target distribution changes?

Draws random target distributions from Dirichlet(concentration) and reports the range and std of the composite win probability.

Parameters:
  • model_a (str)

  • model_b (str)

  • n_draws (int, default: 1000)

  • concentration (float, default: 1.0)

Return type:

dict[str, float]