Calibration Metrics

Calibration metrics for pairwise win rate predictions.

When your model says “A beats B with probability 0.65”, does A actually win 65% of the time? These tools check.

winference.calibration.expected_calibration_error(predicted, observed, n_bins=10)[source]

Expected Calibration Error (ECE).

Parameters:
Return type:

float

Returns:

Weighted average |predicted - observed| across bins.

winference.calibration.brier_score(predicted, observed)[source]

Brier score: mean squared error of probability predictions.

Lower is better. Perfect calibration → 0. Random guessing → 0.25.

Parameters:
Return type:

float

winference.calibration.log_loss(predicted, observed)[source]

Binary cross-entropy loss. Lower is better.

Parameters:
Return type:

float

winference.calibration.reliability_diagram(predicted, observed, n_bins=10, ax=None, label='', color=None)[source]

Reliability diagram data and optional plot.

Parameters:
  • predicted (ndarray[tuple[Any, ...], dtype[double]]) – Array of predicted probabilities.

  • observed (ndarray[tuple[Any, ...], dtype[double]]) – Array of binary outcomes.

  • n_bins (int, default: 10) – Number of bins.

  • ax (Any, default: None) – Matplotlib Axes. If provided, plot on it.

  • label (str, default: '') – Label for the legend.

  • color (str | None, default: None) – Color for the plot line.

Return type:

dict[str, Any]

Returns:

Dict with ‘bin_midpoints’, ‘bin_accuracy’, ‘bin_counts’, ‘ece’.

winference.calibration.compare_calibration(methods, observed, n_bins=10)[source]

Compare calibration across multiple prediction methods.

Parameters:
Returns:

{ece, brier, logloss}}.

Return type:

dict[str, dict[str, float]]