Calibration Metrics¶

This module provides various metrics for evaluating calibration quality.

Calibration Error Metrics¶

Mean Calibration Error¶

calibre.mean_calibration_error(y_true: ndarray, y_pred: ndarray)[source]¶

Calculate the mean calibration error.

Parameters:

y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.

Returns:

mce (float) – Mean calibration error.

Raises:

ValueError – If arrays have different shapes.

Examples

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1])
>>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6])
>>> mean_calibration_error(y_true, y_pred)
0.26

Binned Calibration Error¶

calibre.binned_calibration_error(y_true: ndarray, y_pred: ndarray, x: ndarray | None = None, n_bins: int = 10, strategy: str = 'uniform', return_details: bool = False)[source]¶

Calculate binned calibration error.

Parameters:

y_true – Ground truth values.
y_pred – Predicted values.
x – Input features for binning. If None, y_pred is used for binning.
n_bins – Number of bins.
strategy – Strategy for binning: - ‘uniform’: Bins with uniform widths. - ‘quantile’: Bins with approximately equal counts.
return_details – If True, return bin details (bin centers, counts, mean predictions, mean truths).

Returns:

bce (float or dict) – Binned calibration error. If return_details is True, returns a dictionary with BCE and bin details.

Raises:

ValueError – If arrays have different lengths or unknown binning strategy.

Examples

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1])
>>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6])
>>> binned_calibration_error(y_true, y_pred, n_bins=2)
0.05

Expected Calibration Error¶

calibre.expected_calibration_error(y_true: ndarray, y_pred: ndarray, n_bins: int = 10)[source]¶

Calculate Expected Calibration Error (ECE).

The ECE is a weighted average of the absolute calibration error across bins, where each bin’s weight is proportional to the number of samples in the bin.

Parameters:

y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
n_bins – Number of bins for discretizing predictions.

Returns:

ece (float) – Expected Calibration Error.

Raises:

ValueError – If arrays have different lengths.

Examples

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1])
>>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6])
>>> expected_calibration_error(y_true, y_pred, n_bins=2)
0.12

Maximum Calibration Error¶

calibre.maximum_calibration_error(y_true: ndarray, y_pred: ndarray, n_bins: int = 10)[source]¶

Calculate Maximum Calibration Error (MCE).

The MCE is the maximum absolute difference between the average predicted probability and the fraction of positive samples in any bin.

Parameters:

y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
n_bins – Number of bins for discretizing predictions.

Returns:

mce (float) – Maximum Calibration Error.

Raises:

ValueError – If arrays have different lengths.

Examples

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1])
>>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6])
>>> maximum_calibration_error(y_true, y_pred, n_bins=2)
0.2

Scoring Metrics¶

Brier Score¶

calibre.brier_score(y_true: ndarray, y_pred: ndarray)[source]¶

Calculate the Brier score.

The Brier score is a proper scoring rule that measures the mean squared difference between predicted probabilities and the actual outcomes.

Parameters:

y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.

Returns:

score (float) – Brier score (lower is better).

Raises:

ValueError – If arrays have different lengths.

Examples

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1])
>>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6])
>>> brier_score(y_true, y_pred)
0.142

Calibration Curve¶

calibre.calibration_curve(y_true: ndarray, y_pred: ndarray, n_bins: int = 10, strategy: str = 'uniform')[source]¶

Compute the calibration curve for binary classification.

Parameters:

y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
n_bins – Number of bins for discretizing predictions.
strategy – Strategy for binning: - ‘uniform’: Bins with uniform widths. - ‘quantile’: Bins with approximately equal counts.

Returns:

prob_true (ndarray of shape (n_bins,)) – The true fraction of positive samples in each bin.
prob_pred (ndarray of shape (n_bins,)) – The mean predicted probability in each bin.
counts (ndarray of shape (n_bins,)) – The number of samples in each bin.

Raises:

ValueError – If arrays have different lengths or unknown binning strategy.

Examples

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0])
>>> y_pred = np.array([0.1, 0.9, 0.8, 0.3, 0.7, 0.2, 0.6, 0.4, 0.9, 0.1])
>>> prob_true, prob_pred, counts = calibration_curve(y_true, y_pred, n_bins=5)

Statistical Metrics¶

Correlation Metrics¶

calibre.correlation_metrics(y_true: ndarray, y_pred: ndarray, x: ndarray | None = None, y_orig: ndarray | None = None)[source]¶

Calculate correlation metrics between various signals.

Parameters:

y_true – Ground truth values.
y_pred – Predicted/calibrated values.
x – Input features.
y_orig – Original uncalibrated predictions.

Returns:

correlations (dict) – Dictionary of correlation metrics.

Examples

>>> import numpy as np
>>> y_true = np.array([0, 1, 1, 0, 1])
>>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6])
>>> y_orig = np.array([0.1, 0.6, 0.9, 0.3, 0.5])
>>> correlation_metrics(y_true, y_pred, y_orig=y_orig)
{'spearman_corr_to_y_true': 0.6708203932499371, 'spearman_corr_to_y_orig': 0.9}

Unique Value Counts¶

calibre.unique_value_counts(y_pred: ndarray, y_orig: ndarray | None = None, precision: int = 6)[source]¶

Count unique values in predictions.

Parameters:

y_pred – Predicted/calibrated values.
y_orig – Original uncalibrated predictions.
precision – Decimal precision for rounding.

Returns:

counts (dict) – Dictionary with counts of unique values.

Examples

>>> import numpy as np
>>> y_pred = np.array([0.2, 0.7, 0.8, 0.2, 0.7])
>>> y_orig = np.array([0.1, 0.6, 0.9, 0.2, 0.5])
>>> unique_value_counts(y_pred, y_orig)
{'n_unique_y_pred': 3, 'n_unique_y_orig': 5, 'unique_value_ratio': 0.6}

Usage Examples¶

Basic Evaluation¶

from calibre import (
    mean_calibration_error,
    expected_calibration_error,
    brier_score
)
import numpy as np

# Example data
y_true = np.array([0, 0, 1, 1, 1])
y_pred = np.array([0.1, 0.3, 0.6, 0.8, 0.9])

# Calculate metrics
mce = mean_calibration_error(y_true, y_pred)
ece = expected_calibration_error(y_true, y_pred, n_bins=5)
bs = brier_score(y_true, y_pred)

print(f"Mean Calibration Error: {mce:.4f}")
print(f"Expected Calibration Error: {ece:.4f}")
print(f"Brier Score: {bs:.4f}")

Comprehensive Evaluation¶

from calibre import (
    binned_calibration_error,
    correlation_metrics,
    unique_value_counts
)

# Binned calibration with details
bce, details = binned_calibration_error(
    y_true, y_pred,
    n_bins=10,
    return_details=True
)

print(f"Binned Calibration Error: {bce:.4f}")
print(f"Bin centers: {details['bin_centers']}")
print(f"Bin accuracies: {details['bin_accuracies']}")

# Correlation analysis
corr = correlation_metrics(y_true, y_pred)
print(f"Spearman correlation: {corr['spearman_corr_to_y_true']:.4f}")

# Granularity analysis
counts = unique_value_counts(y_pred)
print(f"Unique values: {counts['n_unique_y_pred']}")

Plotting Calibration Curves¶

import matplotlib.pyplot as plt
from calibre import calibration_curve

# Generate calibration curve data
fraction_pos, mean_pred, counts = calibration_curve(
    y_true, y_pred, n_bins=10
)

# Plot reliability diagram
plt.figure(figsize=(8, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.plot(mean_pred, fraction_pos, 'bo-', label='Model')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Plot')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()