Calibration Metrics¶

Calibration metrics for pairwise win rate predictions.

When your model says “A beats B with probability 0.65”, does A actually win 65% of the time? These tools check.

winference.calibration.expected_calibration_error(predicted, observed, n_bins=10)[source]¶

Expected Calibration Error (ECE).

Parameters:

predicted (ndarray[tuple[Any, ...], dtype[double]]) – Array of predicted win probabilities, shape (n,).
observed (ndarray[tuple[Any, ...], dtype[double]]) – Array of binary outcomes (0 or 1), shape (n,).
n_bins (int, default: 10) – Number of bins for grouping predictions.

Return type:

float

Returns:

Weighted average |predicted - observed| across bins.

winference.calibration.brier_score(predicted, observed)[source]¶

Brier score: mean squared error of probability predictions.

Lower is better. Perfect calibration → 0. Random guessing → 0.25.

Parameters:

Return type:

float

winference.calibration.log_loss(predicted, observed)[source]¶

Binary cross-entropy loss. Lower is better.

Parameters:

Return type:

float

winference.calibration.reliability_diagram(predicted, observed, n_bins=10, ax=None, label='', color=None)[source]¶

Reliability diagram data and optional plot.

Parameters:

predicted (ndarray[tuple[Any, ...], dtype[double]]) – Array of predicted probabilities.
observed (ndarray[tuple[Any, ...], dtype[double]]) – Array of binary outcomes.
n_bins (int, default: 10) – Number of bins.
ax (Any, default: None) – Matplotlib Axes. If provided, plot on it.
label (str, default: '') – Label for the legend.
color (str | None, default: None) – Color for the plot line.

Return type:

dict[str, Any]

Returns:

Dict with ‘bin_midpoints’, ‘bin_accuracy’, ‘bin_counts’, ‘ece’.

winference.calibration.compare_calibration(methods, observed, n_bins=10)[source]¶

Compare calibration across multiple prediction methods.

Parameters:

methods (dict[str, ndarray[tuple[Any, ...], dtype[double]]]) – Dict mapping method name to predicted probabilities.
observed (ndarray[tuple[Any, ...], dtype[double]]) – Binary outcomes.
n_bins (int, default: 10) – Number of bins for ECE calculation.

Returns:

{ece, brier, logloss}}.

Return type:

dict[str, dict[str, float]]