Calibration Metrics¶
This module provides various metrics for evaluating calibration quality.
Calibration Error Metrics¶
Mean Calibration Error¶
- calibre.mean_calibration_error(y_true: ndarray, y_pred: ndarray)[source]¶
Calculate the mean calibration error.
- Parameters:
y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
- Returns:
mce (float) – Mean calibration error.
- Raises:
ValueError – If arrays have different shapes.
Examples
>>> import numpy as np >>> y_true = np.array([0, 1, 1, 0, 1]) >>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6]) >>> mean_calibration_error(y_true, y_pred) 0.26
Binned Calibration Error¶
- calibre.binned_calibration_error(y_true: ndarray, y_pred: ndarray, x: ndarray | None = None, n_bins: int = 10, strategy: str = 'uniform', return_details: bool = False)[source]¶
Calculate binned calibration error.
- Parameters:
y_true – Ground truth values.
y_pred – Predicted values.
x – Input features for binning. If None, y_pred is used for binning.
n_bins – Number of bins.
strategy – Strategy for binning: - ‘uniform’: Bins with uniform widths. - ‘quantile’: Bins with approximately equal counts.
return_details – If True, return bin details (bin centers, counts, mean predictions, mean truths).
- Returns:
bce (float or dict) – Binned calibration error. If return_details is True, returns a dictionary with BCE and bin details.
- Raises:
ValueError – If arrays have different lengths or unknown binning strategy.
Examples
>>> import numpy as np >>> y_true = np.array([0, 1, 1, 0, 1]) >>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6]) >>> binned_calibration_error(y_true, y_pred, n_bins=2) 0.05
Expected Calibration Error¶
- calibre.expected_calibration_error(y_true: ndarray, y_pred: ndarray, n_bins: int = 10)[source]¶
Calculate Expected Calibration Error (ECE).
The ECE is a weighted average of the absolute calibration error across bins, where each bin’s weight is proportional to the number of samples in the bin.
- Parameters:
y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
n_bins – Number of bins for discretizing predictions.
- Returns:
ece (float) – Expected Calibration Error.
- Raises:
ValueError – If arrays have different lengths.
Examples
>>> import numpy as np >>> y_true = np.array([0, 1, 1, 0, 1]) >>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6]) >>> expected_calibration_error(y_true, y_pred, n_bins=2) 0.12
Maximum Calibration Error¶
- calibre.maximum_calibration_error(y_true: ndarray, y_pred: ndarray, n_bins: int = 10)[source]¶
Calculate Maximum Calibration Error (MCE).
The MCE is the maximum absolute difference between the average predicted probability and the fraction of positive samples in any bin.
- Parameters:
y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
n_bins – Number of bins for discretizing predictions.
- Returns:
mce (float) – Maximum Calibration Error.
- Raises:
ValueError – If arrays have different lengths.
Examples
>>> import numpy as np >>> y_true = np.array([0, 1, 1, 0, 1]) >>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6]) >>> maximum_calibration_error(y_true, y_pred, n_bins=2) 0.2
Scoring Metrics¶
Brier Score¶
- calibre.brier_score(y_true: ndarray, y_pred: ndarray)[source]¶
Calculate the Brier score.
The Brier score is a proper scoring rule that measures the mean squared difference between predicted probabilities and the actual outcomes.
- Parameters:
y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
- Returns:
score (float) – Brier score (lower is better).
- Raises:
ValueError – If arrays have different lengths.
Examples
>>> import numpy as np >>> y_true = np.array([0, 1, 1, 0, 1]) >>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6]) >>> brier_score(y_true, y_pred) 0.142
Calibration Curve¶
- calibre.calibration_curve(y_true: ndarray, y_pred: ndarray, n_bins: int = 10, strategy: str = 'uniform')[source]¶
Compute the calibration curve for binary classification.
- Parameters:
y_true – Ground truth values (0 or 1 for binary classification).
y_pred – Predicted probabilities.
n_bins – Number of bins for discretizing predictions.
strategy – Strategy for binning: - ‘uniform’: Bins with uniform widths. - ‘quantile’: Bins with approximately equal counts.
- Returns:
prob_true (ndarray of shape (n_bins,)) – The true fraction of positive samples in each bin.
prob_pred (ndarray of shape (n_bins,)) – The mean predicted probability in each bin.
counts (ndarray of shape (n_bins,)) – The number of samples in each bin.
- Raises:
ValueError – If arrays have different lengths or unknown binning strategy.
Examples
>>> import numpy as np >>> y_true = np.array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0]) >>> y_pred = np.array([0.1, 0.9, 0.8, 0.3, 0.7, 0.2, 0.6, 0.4, 0.9, 0.1]) >>> prob_true, prob_pred, counts = calibration_curve(y_true, y_pred, n_bins=5)
Statistical Metrics¶
Correlation Metrics¶
- calibre.correlation_metrics(y_true: ndarray, y_pred: ndarray, x: ndarray | None = None, y_orig: ndarray | None = None)[source]¶
Calculate correlation metrics between various signals.
- Parameters:
y_true – Ground truth values.
y_pred – Predicted/calibrated values.
x – Input features.
y_orig – Original uncalibrated predictions.
- Returns:
correlations (dict) – Dictionary of correlation metrics.
Examples
>>> import numpy as np >>> y_true = np.array([0, 1, 1, 0, 1]) >>> y_pred = np.array([0.2, 0.7, 0.8, 0.4, 0.6]) >>> y_orig = np.array([0.1, 0.6, 0.9, 0.3, 0.5]) >>> correlation_metrics(y_true, y_pred, y_orig=y_orig) {'spearman_corr_to_y_true': 0.6708203932499371, 'spearman_corr_to_y_orig': 0.9}
Unique Value Counts¶
- calibre.unique_value_counts(y_pred: ndarray, y_orig: ndarray | None = None, precision: int = 6)[source]¶
Count unique values in predictions.
- Parameters:
y_pred – Predicted/calibrated values.
y_orig – Original uncalibrated predictions.
precision – Decimal precision for rounding.
- Returns:
counts (dict) – Dictionary with counts of unique values.
Examples
>>> import numpy as np >>> y_pred = np.array([0.2, 0.7, 0.8, 0.2, 0.7]) >>> y_orig = np.array([0.1, 0.6, 0.9, 0.2, 0.5]) >>> unique_value_counts(y_pred, y_orig) {'n_unique_y_pred': 3, 'n_unique_y_orig': 5, 'unique_value_ratio': 0.6}
Usage Examples¶
Basic Evaluation¶
from calibre import (
mean_calibration_error,
expected_calibration_error,
brier_score
)
import numpy as np
# Example data
y_true = np.array([0, 0, 1, 1, 1])
y_pred = np.array([0.1, 0.3, 0.6, 0.8, 0.9])
# Calculate metrics
mce = mean_calibration_error(y_true, y_pred)
ece = expected_calibration_error(y_true, y_pred, n_bins=5)
bs = brier_score(y_true, y_pred)
print(f"Mean Calibration Error: {mce:.4f}")
print(f"Expected Calibration Error: {ece:.4f}")
print(f"Brier Score: {bs:.4f}")
Comprehensive Evaluation¶
from calibre import (
binned_calibration_error,
correlation_metrics,
unique_value_counts
)
# Binned calibration with details
bce, details = binned_calibration_error(
y_true, y_pred,
n_bins=10,
return_details=True
)
print(f"Binned Calibration Error: {bce:.4f}")
print(f"Bin centers: {details['bin_centers']}")
print(f"Bin accuracies: {details['bin_accuracies']}")
# Correlation analysis
corr = correlation_metrics(y_true, y_pred)
print(f"Spearman correlation: {corr['spearman_corr_to_y_true']:.4f}")
# Granularity analysis
counts = unique_value_counts(y_pred)
print(f"Unique values: {counts['n_unique_y_pred']}")
Plotting Calibration Curves¶
import matplotlib.pyplot as plt
from calibre import calibration_curve
# Generate calibration curve data
fraction_pos, mean_pred, counts = calibration_curve(
y_true, y_pred, n_bins=10
)
# Plot reliability diagram
plt.figure(figsize=(8, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.plot(mean_pred, fraction_pos, 'bo-', label='Model')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Plot')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()