Group Testing¶
Heterogeneous group testing and per-group calibration.
Tests whether model strengths are constant across prompt categories (homogeneous BT) or differ by category (heterogeneous BT). If heterogeneity is warranted, fits per-category BT models and provides composable win rate predictions for any target distribution over categories.
- The formal test is a likelihood-ratio test:
H0: theta_{i,k} = theta_i for all i, k (homogeneous) H1: theta_{i,k} free (heterogeneous)
Lambda = -2(l0 - l1) ~ chi2 with (K-1)(N-1) degrees of freedom.
- class winference.groups.GroupTest(models, groups)[source]¶
Bases:
objectLikelihood-ratio test for heterogeneity across prompt groups.
- Parameters:
Examples
>>> gt = GroupTest(models=["A","B","C"], groups=["math","creative"]) >>> gt.fit(comparisons, group_labels) >>> print(gt.test_result()) {'statistic': 14.2, 'df': 2, 'p_value': 0.0008}
- fit(comparisons, group_labels, reg=0.0001)[source]¶
Fit null (pooled) and alternative (per-group) BT models.
- Parameters:
- Return type:
Self- Returns:
Self for method chaining.
- Raises:
ValueError – If comparisons and group_labels have different lengths.
- class winference.groups.GroupCalibrator(group_test)[source]¶
Bases:
objectComposable win rate calibration using per-group BT models.
After fitting per-group BT models, compute win rates for any target distribution over groups:
P(i > j | pi*) = sum_k pi*_k * sigmoid(theta_{i,k} - theta_{j,k})
This is the key advantage: calibration that transfers under distribution shift.
- Parameters:
group_test (
GroupTest) – A fitted GroupTest object.- Raises:
RuntimeError – If the GroupTest has not been fitted.
- win_probability(model_a, model_b, target_distribution=None)[source]¶
Composite win probability under a target group distribution.
- Parameters:
- Return type:
- Returns:
Composite win probability P(model_a beats model_b).