{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Performance Comparison\n", "\n", "This notebook provides systematic performance comparison of different calibration methods across various scenarios.\n", "\n", "**What you'll learn:**\n", "1. **Method Comparison**: How different calibrators perform on the same data\n", "2. **Scenario Analysis**: Performance across overconfident, underconfident, and distorted predictions\n", "3. **Computational Efficiency**: Speed and memory usage comparison\n", "4. **Method Selection**: Guidelines for choosing the right calibrator\n", "\n", "**When to use this notebook:** Use this to understand which calibration method works best for your type of data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.calibration import CalibratedClassifierCV\n", "\n", "# Import all calibre calibrators\n", "from calibre import (\n", " IsotonicCalibrator,\n", " NearlyIsotonicCalibrator, \n", " SplineCalibrator,\n", " RelaxedPAVACalibrator,\n", " RegularizedIsotonicCalibrator,\n", " SmoothedIsotonicCalibrator\n", ")\n", "\n", "# Import metrics\n", "from calibre import (\n", " mean_calibration_error, \n", " expected_calibration_error,\n", " brier_score,\n", " calibration_curve\n", ")\n", "\n", "np.random.seed(42)\n", "plt.style.use('default')\n", "print(\"โœ… All imports successful!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Generate Test Scenarios\n", "\n", "We'll create different types of miscalibrated predictions that commonly occur in ML:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def generate_overconfident_predictions(n=1000):\n", " \"\"\"Simulate overconfident neural network predictions.\"\"\"\n", " # True probabilities\n", " p_true = np.random.beta(2, 2, n) \n", " y_true = np.random.binomial(1, p_true)\n", " \n", " # Overconfident predictions (push toward extremes)\n", " y_pred = np.clip(p_true ** 0.5, 0.01, 0.99)\n", " \n", " return y_pred, y_true\n", "\n", "def generate_underconfident_predictions(n=1000):\n", " \"\"\"Simulate underconfident random forest predictions.\"\"\"\n", " # True probabilities\n", " p_true = np.random.beta(2, 2, n)\n", " y_true = np.random.binomial(1, p_true)\n", " \n", " # Underconfident predictions (shrink toward 0.5)\n", " y_pred = 0.5 + 0.4 * (p_true - 0.5)\n", " y_pred = np.clip(y_pred, 0.01, 0.99)\n", " \n", " return y_pred, y_true\n", "\n", "def generate_temperature_scaled_predictions(n=1000):\n", " \"\"\"Simulate predictions that need temperature scaling.\"\"\"\n", " # True probabilities \n", " p_true = np.random.beta(2, 2, n)\n", " y_true = np.random.binomial(1, p_true)\n", " \n", " # Apply temperature scaling effect\n", " logits = np.log(p_true / (1 - p_true + 1e-8))\n", " scaled_logits = logits / 2.0 # Temperature = 2.0\n", " y_pred = 1 / (1 + np.exp(-scaled_logits))\n", " \n", " return y_pred, y_true\n", "\n", "# Generate test scenarios\n", "scenarios = {\n", " 'Overconfident NN': generate_overconfident_predictions(),\n", " 'Underconfident RF': generate_underconfident_predictions(),\n", " 'Temperature Scaled': generate_temperature_scaled_predictions()\n", "}\n", "\n", "print(\"๐Ÿ“Š Generated test scenarios:\")\n", "for name, (y_pred, y_true) in scenarios.items():\n", " ece = expected_calibration_error(y_true, y_pred)\n", " print(f\"{name:18}: ECE = {ece:.4f}, Range = [{y_pred.min():.3f}, {y_pred.max():.3f}]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Define Calibrators to Compare\n", "\n", "Let's compare all available calibration methods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define calibrators to test\n", "calibrators = {\n", " 'Isotonic': IsotonicCalibrator(),\n", " 'Nearly Isotonic': NearlyIsotonicCalibrator(),\n", " 'Spline': SplineCalibrator(n_splines=10), \n", " 'Relaxed PAVA': RelaxedPAVACalibrator(),\n", " 'Regularized': RegularizedIsotonicCalibrator(),\n", " 'Smoothed': SmoothedIsotonicCalibrator()\n", "}\n", "\n", "# Also compare against sklearn's implementation\n", "from sklearn.isotonic import IsotonicRegression\n", "\n", "def sklearn_isotonic_calibrate(y_pred_train, y_train, y_pred_test):\n", " \"\"\"Sklearn isotonic regression for comparison.\"\"\"\n", " iso = IsotonicRegression(out_of_bounds='clip')\n", " iso.fit(y_pred_train, y_train)\n", " return iso.transform(y_pred_test)\n", "\n", "print(f\"๐Ÿ“‹ Testing {len(calibrators)} calibration methods\")\n", "for name in calibrators.keys():\n", " print(f\" โ€ข {name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Performance Comparison Across Scenarios\n", "\n", "Now let's systematically compare all methods on all scenarios:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def evaluate_calibrator(calibrator, y_pred_train, y_train, y_pred_test, y_test):\n", " \"\"\"Evaluate a single calibrator and return metrics.\"\"\"\n", " try:\n", " # Time the fitting\n", " start_time = time.time()\n", " calibrator.fit(y_pred_train, y_train)\n", " fit_time = time.time() - start_time\n", " \n", " # Time the transformation\n", " start_time = time.time() \n", " y_pred_cal = calibrator.transform(y_pred_test)\n", " transform_time = time.time() - start_time\n", " \n", " # Calculate metrics\n", " ece = expected_calibration_error(y_test, y_pred_cal)\n", " mce = mean_calibration_error(y_test, y_pred_cal)\n", " brier = brier_score(y_test, y_pred_cal)\n", " \n", " # Check bounds and monotonicity\n", " bounds_valid = np.all(y_pred_cal >= 0) and np.all(y_pred_cal <= 1)\n", " \n", " # Test monotonicity on sorted data\n", " x_test = np.linspace(0, 1, 100)\n", " y_mono_test = calibrator.transform(x_test)\n", " violations = np.sum(np.diff(y_mono_test) < -1e-8)\n", " \n", " return {\n", " 'ece': ece,\n", " 'mce': mce, \n", " 'brier': brier,\n", " 'fit_time': fit_time,\n", " 'transform_time': transform_time,\n", " 'bounds_valid': bounds_valid,\n", " 'monotonicity_violations': violations,\n", " 'calibrated_predictions': y_pred_cal\n", " }\n", " except Exception as e:\n", " return {\n", " 'error': str(e),\n", " 'ece': np.inf,\n", " 'mce': np.inf,\n", " 'brier': np.inf,\n", " 'fit_time': np.inf,\n", " 'transform_time': np.inf,\n", " 'bounds_valid': False,\n", " 'monotonicity_violations': np.inf\n", " }\n", "\n", "# Run comparison\n", "results = {}\n", "\n", "for scenario_name, (y_pred, y_true) in scenarios.items():\n", " print(f\"\\n๐Ÿงช Testing scenario: {scenario_name}\")\n", " \n", " # Split data for calibration\n", " y_pred_train, y_pred_test, y_train, y_test = train_test_split(\n", " y_pred, y_true, test_size=0.5, random_state=42\n", " )\n", " \n", " # Baseline (uncalibrated)\n", " baseline_ece = expected_calibration_error(y_test, y_pred_test)\n", " baseline_mce = mean_calibration_error(y_test, y_pred_test)\n", " baseline_brier = brier_score(y_test, y_pred_test)\n", " \n", " scenario_results = {\n", " 'Uncalibrated': {\n", " 'ece': baseline_ece,\n", " 'mce': baseline_mce,\n", " 'brier': baseline_brier,\n", " 'fit_time': 0,\n", " 'transform_time': 0,\n", " 'bounds_valid': True,\n", " 'monotonicity_violations': 0\n", " }\n", " }\n", " \n", " # Test each calibrator\n", " for cal_name, calibrator in calibrators.items():\n", " print(f\" Testing {cal_name}...\", end='')\n", " result = evaluate_calibrator(calibrator, y_pred_train, y_train, y_pred_test, y_test)\n", " scenario_results[cal_name] = result\n", " \n", " if 'error' in result:\n", " print(f\" โŒ Failed: {result['error']}\")\n", " else:\n", " improvement = baseline_ece - result['ece']\n", " print(f\" โœ… ECE: {result['ece']:.4f} (ฮ”{improvement:+.4f})\")\n", " \n", " results[scenario_name] = scenario_results\n", "\n", "print(\"\\nโœ… Performance comparison complete!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Create Performance Summary\n", "\n", "Let's visualize the results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create summary DataFrame\n", "summary_data = []\n", "\n", "for scenario, scenario_results in results.items():\n", " for method, metrics in scenario_results.items():\n", " if 'error' not in metrics:\n", " summary_data.append({\n", " 'Scenario': scenario,\n", " 'Method': method,\n", " 'ECE': metrics['ece'],\n", " 'MCE': metrics['mce'],\n", " 'Brier Score': metrics['brier'],\n", " 'Fit Time (s)': metrics['fit_time'],\n", " 'Transform Time (s)': metrics['transform_time'],\n", " 'Bounds Valid': metrics['bounds_valid'],\n", " 'Violations': metrics['monotonicity_violations']\n", " })\n", "\n", "df_summary = pd.DataFrame(summary_data)\n", "\n", "# Create comprehensive visualization\n", "fig, axes = plt.subplots(2, 3, figsize=(18, 12))\n", "axes = axes.flatten()\n", "\n", "# 1. ECE comparison by scenario\n", "scenarios_list = list(scenarios.keys())\n", "methods = [m for m in df_summary['Method'].unique() if m != 'Uncalibrated']\n", "\n", "ece_matrix = []\n", "for scenario in scenarios_list:\n", " row = []\n", " for method in methods:\n", " ece = df_summary[(df_summary['Scenario'] == scenario) & \n", " (df_summary['Method'] == method)]['ECE'].values\n", " row.append(ece[0] if len(ece) > 0 else np.nan)\n", " ece_matrix.append(row)\n", "\n", "im = axes[0].imshow(ece_matrix, cmap='RdYlGn_r', aspect='auto')\n", "axes[0].set_xticks(range(len(methods)))\n", "axes[0].set_xticklabels(methods, rotation=45, ha='right')\n", "axes[0].set_yticks(range(len(scenarios_list)))\n", "axes[0].set_yticklabels(scenarios_list)\n", "axes[0].set_title('Expected Calibration Error (ECE)')\n", "plt.colorbar(im, ax=axes[0], label='ECE')\n", "\n", "# 2. ECE improvement (relative to uncalibrated)\n", "improvement_data = []\n", "for scenario in scenarios_list:\n", " uncal_ece = df_summary[(df_summary['Scenario'] == scenario) & \n", " (df_summary['Method'] == 'Uncalibrated')]['ECE'].values[0]\n", " row = []\n", " for method in methods:\n", " cal_ece = df_summary[(df_summary['Scenario'] == scenario) & \n", " (df_summary['Method'] == method)]['ECE'].values\n", " if len(cal_ece) > 0:\n", " improvement = (uncal_ece - cal_ece[0]) / uncal_ece * 100\n", " row.append(improvement)\n", " else:\n", " row.append(0)\n", " improvement_data.append(row)\n", "\n", "im2 = axes[1].imshow(improvement_data, cmap='RdYlGn', aspect='auto', vmin=0)\n", "axes[1].set_xticks(range(len(methods)))\n", "axes[1].set_xticklabels(methods, rotation=45, ha='right')\n", "axes[1].set_yticks(range(len(scenarios_list)))\n", "axes[1].set_yticklabels(scenarios_list)\n", "axes[1].set_title('ECE Improvement (%)')\n", "plt.colorbar(im2, ax=axes[1], label='Improvement %')\n", "\n", "# 3. Computational efficiency\n", "fit_times = df_summary[df_summary['Method'] != 'Uncalibrated'].groupby('Method')['Fit Time (s)'].mean()\n", "bars = axes[2].bar(range(len(fit_times)), fit_times.values)\n", "axes[2].set_xticks(range(len(fit_times)))\n", "axes[2].set_xticklabels(fit_times.index, rotation=45, ha='right')\n", "axes[2].set_title('Average Fit Time')\n", "axes[2].set_ylabel('Time (seconds)')\n", "\n", "# 4. Brier Score comparison\n", "brier_by_method = df_summary.groupby('Method')['Brier Score'].mean().sort_values()\n", "axes[3].bar(range(len(brier_by_method)), brier_by_method.values, \n", " color='lightcoral')\n", "axes[3].set_xticks(range(len(brier_by_method)))\n", "axes[3].set_xticklabels(brier_by_method.index, rotation=45, ha='right')\n", "axes[3].set_title('Average Brier Score')\n", "axes[3].set_ylabel('Brier Score (lower is better)')\n", "\n", "# 5. Monotonicity violations\n", "violations = df_summary[df_summary['Method'] != 'Uncalibrated'].groupby('Method')['Violations'].max()\n", "colors = ['red' if v > 0 else 'green' for v in violations.values]\n", "axes[4].bar(range(len(violations)), violations.values, color=colors)\n", "axes[4].set_xticks(range(len(violations)))\n", "axes[4].set_xticklabels(violations.index, rotation=45, ha='right')\n", "axes[4].set_title('Monotonicity Violations (max)')\n", "axes[4].set_ylabel('Number of violations')\n", "\n", "# 6. Overall ranking\n", "# Calculate composite score (lower is better)\n", "ranking_data = df_summary[df_summary['Method'] != 'Uncalibrated'].groupby('Method').agg({\n", " 'ECE': 'mean',\n", " 'Brier Score': 'mean', \n", " 'Fit Time (s)': 'mean',\n", " 'Violations': 'max'\n", "})\n", "\n", "# Normalize and combine (simple equal weighting)\n", "ranking_data_norm = ranking_data.copy()\n", "for col in ranking_data_norm.columns:\n", " ranking_data_norm[col] = (ranking_data_norm[col] - ranking_data_norm[col].min()) / \\\n", " (ranking_data_norm[col].max() - ranking_data_norm[col].min() + 1e-8)\n", "\n", "composite_score = ranking_data_norm.mean(axis=1).sort_values()\n", "axes[5].bar(range(len(composite_score)), composite_score.values, color='gold')\n", "axes[5].set_xticks(range(len(composite_score)))\n", "axes[5].set_xticklabels(composite_score.index, rotation=45, ha='right')\n", "axes[5].set_title('Overall Ranking (lower is better)')\n", "axes[5].set_ylabel('Composite Score')\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"๐Ÿ“Š Performance visualization complete!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Method Selection Guidelines\n", "\n", "Based on the results, here are guidelines for choosing the right calibrator:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "print(\"๐Ÿ“‹ CALIBRATION METHOD SELECTION GUIDE\")\nprint(\"=\" * 50)\n\n# Find best performer for each metric (fix indexing)\ncalibrated_methods = df_summary[df_summary['Method'] != 'Uncalibrated']\nif len(calibrated_methods) > 0:\n best_ece = calibrated_methods.loc[calibrated_methods['ECE'].idxmin(), 'Method']\n best_brier = calibrated_methods.loc[calibrated_methods['Brier Score'].idxmin(), 'Method']\n fastest = calibrated_methods.loc[calibrated_methods['Fit Time (s)'].idxmin(), 'Method']\n \n # Define violations properly\n violations = df_summary[df_summary['Method'] != 'Uncalibrated'].groupby('Method')['Violations'].max()\n most_robust = violations[violations == 0].index[0] if (violations == 0).any() else violations.idxmin()\n \n print(f\"๐Ÿ† Best ECE (Calibration Quality): {best_ece}\")\n print(f\"๐Ÿ† Best Brier Score (Overall Accuracy): {best_brier}\") \n print(f\"โšก Fastest Fitting: {fastest}\")\n print(f\"๐Ÿ›ก๏ธ Most Robust (Monotonicity): {most_robust}\")\nelse:\n print(\"โš ๏ธ No calibrated methods found in results\")\n\nprint(\"\\n๐ŸŽฏ RECOMMENDATIONS:\")\n\n# Calculate average improvements\nmethods = [m for m in df_summary['Method'].unique() if m != 'Uncalibrated']\nscenarios_list = list(scenarios.keys())\n\navg_improvements = {}\nfor method in methods:\n improvements = []\n for scenario in scenarios_list:\n uncal_data = df_summary[(df_summary['Scenario'] == scenario) & \n (df_summary['Method'] == 'Uncalibrated')]\n cal_data = df_summary[(df_summary['Scenario'] == scenario) & \n (df_summary['Method'] == method)]\n \n if len(uncal_data) > 0 and len(cal_data) > 0:\n uncal_ece = uncal_data['ECE'].values[0]\n cal_ece = cal_data['ECE'].values[0]\n improvement = uncal_ece - cal_ece\n improvements.append(improvement)\n \n if improvements:\n avg_improvements[method] = np.mean(improvements)\n\n# Sort by average improvement\nsorted_methods = sorted(avg_improvements.items(), key=lambda x: x[1], reverse=True)\n\nprint(\"\\n๐Ÿฅ‡ OVERALL RANKING (by ECE improvement):\")\nfor i, (method, improvement) in enumerate(sorted_methods):\n method_data = df_summary[df_summary['Method'] == method]\n if len(method_data) > 0:\n fit_time = method_data['Fit Time (s)'].mean()\n violations_count = method_data['Violations'].max()\n \n print(f\"{i+1}. {method}:\")\n print(f\" โ€ข Avg ECE improvement: {improvement:.4f}\")\n print(f\" โ€ข Avg fit time: {fit_time:.4f}s\")\n print(f\" โ€ข Monotonicity violations: {violations_count}\")\n\nprint(\"\\n๐Ÿ’ก USAGE GUIDELINES:\")\nprint(\"โ€ข **General purpose**: Use IsotonicCalibrator (classic, reliable)\")\nprint(\"โ€ข **Best performance**: Use RegularizedIsotonicCalibrator (often best ECE)\")\nprint(\"โ€ข **Smooth curves**: Use SplineCalibrator (no staircase effects)\")\nprint(\"โ€ข **Speed critical**: Use IsotonicCalibrator (fastest)\")\nprint(\"โ€ข **Small datasets**: Use RelaxedPAVACalibrator (handles limited data)\")\nprint(\"โ€ข **Noise robustness**: Use SmoothedIsotonicCalibrator (reduces overfitting)\")\n\nprint(\"\\nโš ๏ธ IMPORTANT NOTES:\")\nprint(\"โ€ข Always enable diagnostics to understand calibration behavior\")\nprint(\"โ€ข Test multiple methods and pick the best for your specific data\")\nprint(\"โ€ข Consider computational constraints for real-time applications\")\nprint(\"โ€ข Validate on held-out data to avoid overfitting to calibration set\")\n\nprint(\"\\n\" + \"=\" * 50)" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "๐ŸŽฏ **Performance Summary:**\n", "- All methods significantly improve calibration over uncalibrated predictions\n", "- Different methods excel in different scenarios\n", "- Computational overhead is generally minimal\n", "\n", "๐Ÿ“Š **Method Characteristics:**\n", "- **Isotonic**: Fast, reliable baseline\n", "- **Nearly Isotonic**: Flexible, handles challenging cases\n", "- **Spline**: Smooth curves, good for visualization\n", "- **Regularized**: Often best calibration quality\n", "- **Relaxed PAVA**: Robust to small datasets\n", "- **Smoothed**: Reduces staircase effects\n", "\n", "๐Ÿ” **Selection Strategy:**\n", "1. Start with IsotonicCalibrator for baseline\n", "2. Try RegularizedIsotonicCalibrator for best performance\n", "3. Use SplineCalibrator if you need smooth curves\n", "4. Enable diagnostics to understand behavior\n", "5. Validate on separate test data\n", "\n", "โžก๏ธ **Next Steps:**\n", "- Apply these insights to your specific use case\n", "- Experiment with different scenarios\n", "- Use diagnostics to troubleshoot edge cases" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }