Performance Benchmarks ==================== This section provides performance comparisons and benchmarks for different calibration methods. Benchmark Notebook ------------------ The most comprehensive benchmarks are available in the interactive Jupyter notebook: - **Location**: ``examples/benchmark.ipynb`` in the repository - **Content**: Visual comparisons, quantitative metrics, performance analysis - **Usage**: Clone the repository and run the notebook locally Accessing the Benchmark Notebook ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Clone repository git clone https://github.com/finite-sample/calibre.git cd calibre # Install dependencies pip install -e ".[dev]" # Start Jupyter jupyter notebook examples/benchmark.ipynb Method Comparison Summary ------------------------- Based on extensive benchmarking across different datasets and scenarios: Performance Summary Table ~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: Calibration Method Performance :header-rows: 1 :widths: 25 15 15 15 15 15 * - Method - Calibration Error - Granularity Preservation - Computational Speed - Robustness - Use Case * - Nearly Isotonic (strict) - ★★★★★ - ★★☆☆☆ - ★★★☆☆ - ★★★★☆ - High-stakes decisions * - Nearly Isotonic (relaxed) - ★★★★☆ - ★★★★☆ - ★★★☆☆ - ★★★★☆ - Balanced approach * - I-Spline - ★★★★☆ - ★★★☆☆ - ★★☆☆☆ - ★★★☆☆ - Smooth calibration * - Relaxed PAVA - ★★★☆☆ - ★★★★★ - ★★★★★ - ★★★★★ - Large datasets * - Regularized Isotonic - ★★★☆☆ - ★★★☆☆ - ★★★★☆ - ★★★☆☆ - Smooth results needed * - Smoothed Isotonic - ★★★☆☆ - ★★★☆☆ - ★★★★☆ - ★★★★☆ - Visualization Detailed Performance Analysis ---------------------------- Calibration Error Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from calibre import ( NearlyIsotonicRegression, ISplineCalibrator, RelaxedPAVA, RegularizedIsotonicRegression, SmoothedIsotonicRegression, mean_calibration_error, expected_calibration_error ) def comprehensive_benchmark(n_datasets=10, n_samples=2000): """Run comprehensive benchmark across multiple datasets.""" calibrators = { 'Nearly Isotonic (λ=10)': NearlyIsotonicRegression(lam=10.0, method='path'), 'Nearly Isotonic (λ=1)': NearlyIsotonicRegression(lam=1.0, method='path'), 'Nearly Isotonic (λ=0.1)': NearlyIsotonicRegression(lam=0.1, method='path'), 'I-Spline': ISplineCalibrator(n_splines=10, degree=3, cv=3), 'Relaxed PAVA': RelaxedPAVA(percentile=10, adaptive=True), 'Regularized Isotonic': RegularizedIsotonicRegression(alpha=0.1), 'Smoothed Isotonic': SmoothedIsotonicRegression(window_length=7, poly_order=3) } results = {name: {'mce': [], 'ece': [], 'time': []} for name in calibrators.keys()} for dataset_idx in range(n_datasets): print(f"\\rProcessing dataset {dataset_idx + 1}/{n_datasets}", end='') # Generate dataset with varying characteristics X, y = make_classification( n_samples=n_samples, n_features=20, n_informative=15, n_redundant=2, random_state=dataset_idx * 42 ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=dataset_idx ) # Train base model model = RandomForestClassifier(n_estimators=100, random_state=dataset_idx) model.fit(X_train, y_train) y_pred = model.predict_proba(X_test)[:, 1] # Test each calibrator for name, calibrator in calibrators.items(): try: import time start_time = time.time() # Fit and transform calibrator.fit(y_pred, y_test) y_cal = calibrator.transform(y_pred) end_time = time.time() # Calculate metrics mce = mean_calibration_error(y_test, y_cal) ece = expected_calibration_error(y_test, y_cal, n_bins=10) results[name]['mce'].append(mce) results[name]['ece'].append(ece) results[name]['time'].append(end_time - start_time) except Exception as e: print(f"\\nError with {name}: {e}") results[name]['mce'].append(np.nan) results[name]['ece'].append(np.nan) results[name]['time'].append(np.nan) print() # New line after progress return results # Run benchmark benchmark_results = comprehensive_benchmark(n_datasets=5, n_samples=1000) # Display results print("\\nBenchmark Results (Mean ± Std):") print(f"{'Method':<25} {'MCE':<15} {'ECE':<15} {'Time (ms)':<15}") print("-" * 75) for name, metrics in benchmark_results.items(): mce_mean = np.nanmean(metrics['mce']) mce_std = np.nanstd(metrics['mce']) ece_mean = np.nanmean(metrics['ece']) ece_std = np.nanstd(metrics['ece']) time_mean = np.nanmean(metrics['time']) * 1000 # Convert to ms time_std = np.nanstd(metrics['time']) * 1000 print(f"{name:<25} {mce_mean:.3f}±{mce_std:.3f} " f"{ece_mean:.3f}±{ece_std:.3f} {time_mean:.1f}±{time_std:.1f}") Scalability Analysis ~~~~~~~~~~~~~~~~~~~ .. code-block:: python def scalability_benchmark(): """Test performance across different dataset sizes.""" dataset_sizes = [500, 1000, 2000, 5000, 10000] methods = { 'Nearly Isotonic': NearlyIsotonicRegression(lam=1.0, method='path'), 'Relaxed PAVA': RelaxedPAVA(percentile=10), 'Regularized Isotonic': RegularizedIsotonicRegression(alpha=0.1) } timing_results = {method: [] for method in methods.keys()} for n_samples in dataset_sizes: print(f"Testing with {n_samples} samples...") # Generate data X, y = make_classification(n_samples=n_samples, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) # Train model model = RandomForestClassifier(n_estimators=50, random_state=42) model.fit(X_train, y_train) y_pred = model.predict_proba(X_test)[:, 1] for method_name, calibrator in methods.items(): import time # Time the calibration process start_time = time.time() calibrator.fit(y_pred, y_test) y_cal = calibrator.transform(y_pred) end_time = time.time() timing_results[method_name].append(end_time - start_time) # Plot results plt.figure(figsize=(10, 6)) for method_name, times in timing_results.items(): plt.plot(dataset_sizes, times, 'o-', label=method_name, linewidth=2) plt.xlabel('Dataset Size') plt.ylabel('Time (seconds)') plt.title('Calibration Method Scalability') plt.legend() plt.grid(True, alpha=0.3) plt.yscale('log') plt.show() return timing_results # Run scalability test scalability_results = scalability_benchmark() Dataset-Specific Performance --------------------------- Performance on Different Data Types ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python def dataset_specific_benchmark(): """Test performance on different types of datasets.""" datasets = { 'balanced': lambda: make_classification( n_samples=2000, n_features=20, weights=[0.5, 0.5], random_state=42 ), 'imbalanced': lambda: make_classification( n_samples=2000, n_features=20, weights=[0.9, 0.1], random_state=42 ), 'high_dim': lambda: make_classification( n_samples=2000, n_features=100, n_informative=20, random_state=42 ), 'low_info': lambda: make_classification( n_samples=2000, n_features=20, n_informative=5, n_redundant=10, random_state=42 ) } calibrators = { 'Nearly Isotonic': NearlyIsotonicRegression(lam=1.0), 'Relaxed PAVA': RelaxedPAVA(percentile=10), 'I-Spline': ISplineCalibrator(n_splines=8, cv=3) } results = {} for dataset_name, dataset_func in datasets.items(): print(f"\\nTesting on {dataset_name} dataset:") X, y = dataset_func() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) # Train model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict_proba(X_test)[:, 1] dataset_results = {} for cal_name, calibrator in calibrators.items(): try: calibrator.fit(y_pred, y_test) y_cal = calibrator.transform(y_pred) mce = mean_calibration_error(y_test, y_cal) ece = expected_calibration_error(y_test, y_cal) dataset_results[cal_name] = {'mce': mce, 'ece': ece} print(f" {cal_name}: MCE={mce:.4f}, ECE={ece:.4f}") except Exception as e: print(f" {cal_name}: Failed - {e}") dataset_results[cal_name] = {'mce': np.nan, 'ece': np.nan} results[dataset_name] = dataset_results return results # Run dataset-specific benchmark dataset_results = dataset_specific_benchmark() Robustness Analysis ------------------ Noise Sensitivity ~~~~~~~~~~~~~~~~ .. code-block:: python def noise_sensitivity_test(): """Test calibrator robustness to different noise levels.""" noise_levels = [0.0, 0.05, 0.1, 0.2, 0.3] calibrators = { 'Nearly Isotonic': NearlyIsotonicRegression(lam=1.0), 'Relaxed PAVA': RelaxedPAVA(percentile=15), # Slightly higher for noise 'Regularized Isotonic': RegularizedIsotonicRegression(alpha=0.5) } results = {name: [] for name in calibrators.keys()} for noise_level in noise_levels: print(f"Testing noise level: {noise_level}") # Generate clean data X, y = make_classification(n_samples=2000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) # Train model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred_clean = model.predict_proba(X_test)[:, 1] # Add noise to predictions noise = np.random.normal(0, noise_level, len(y_pred_clean)) y_pred_noisy = np.clip(y_pred_clean + noise, 0, 1) for name, calibrator in calibrators.items(): try: calibrator.fit(y_pred_noisy, y_test) y_cal = calibrator.transform(y_pred_noisy) mce = mean_calibration_error(y_test, y_cal) results[name].append(mce) except: results[name].append(np.nan) # Plot results plt.figure(figsize=(10, 6)) for name, mce_values in results.items(): plt.plot(noise_levels, mce_values, 'o-', label=name, linewidth=2) plt.xlabel('Noise Level') plt.ylabel('Mean Calibration Error') plt.title('Robustness to Prediction Noise') plt.legend() plt.grid(True, alpha=0.3) plt.show() return results # Run noise sensitivity test noise_results = noise_sensitivity_test() Memory Usage Analysis -------------------- .. code-block:: python import psutil import os def memory_usage_benchmark(): """Analyze memory usage of different calibrators.""" def get_memory_usage(): process = psutil.Process(os.getpid()) return process.memory_info().rss / 1024 / 1024 # MB # Generate large dataset X, y = make_classification(n_samples=50000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict_proba(X_test)[:, 1] calibrators = { 'Nearly Isotonic (CVX)': NearlyIsotonicRegression(lam=1.0, method='cvx'), 'Nearly Isotonic (Path)': NearlyIsotonicRegression(lam=1.0, method='path'), 'Relaxed PAVA': RelaxedPAVA(percentile=10), 'Regularized Isotonic': RegularizedIsotonicRegression(alpha=0.1) } memory_results = {} for name, calibrator in calibrators.items(): print(f"Testing memory usage for {name}...") # Measure baseline memory baseline_memory = get_memory_usage() try: # Fit calibrator calibrator.fit(y_pred, y_test) # Measure peak memory peak_memory = get_memory_usage() # Transform data y_cal = calibrator.transform(y_pred) # Measure final memory final_memory = get_memory_usage() memory_results[name] = { 'peak_usage': peak_memory - baseline_memory, 'final_usage': final_memory - baseline_memory } except Exception as e: print(f"Failed: {e}") memory_results[name] = {'peak_usage': np.nan, 'final_usage': np.nan} # Display results print("\\nMemory Usage Results:") print(f"{'Method':<25} {'Peak (MB)':<12} {'Final (MB)':<12}") print("-" * 50) for name, usage in memory_results.items(): print(f"{name:<25} {usage['peak_usage']:<12.1f} {usage['final_usage']:<12.1f}") return memory_results # Run memory benchmark memory_results = memory_usage_benchmark() Benchmark Reproduction --------------------- To reproduce these benchmarks: 1. **Install Calibre with development dependencies**: .. code-block:: bash pip install -e ".[dev]" 2. **Run the interactive benchmark notebook**: .. code-block:: bash jupyter notebook examples/benchmark.ipynb 3. **Execute individual benchmark functions** from this documentation 4. **Customize benchmarks** for your specific datasets and use cases The benchmark notebook provides additional visualizations, interactive plots, and more detailed analysis that complements the examples shown here.