🔬 Interactive Demo: Deep Dive into Threshold Behavior¶
This interactive notebook lets you explore the mathematical foundations behind optimal threshold selection. Perfect for understanding why the library works so well!
🎯 Key Learning Objectives¶
Piecewise-constant: Why metrics only change at specific points
Breakpoints: The unique probabilities where metrics can change
Optimization challenges: Why continuous methods can fail
Algorithm insights: How smart methods guarantee global optimum
🚀 Quick Start¶
Run the cells below to start exploring. Use the interactive widgets to see how different data characteristics affect optimal thresholds.
[1]:
# Import optimal_cutoffs functions
import sys
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
from scipy import optimize
from optimal_cutoffs import optimize_thresholds
from optimal_cutoffs.metrics import compute_metric_at_threshold
# Create alias for backward compatibility with notebook code
def _metric_score(y_true, y_prob, threshold, metric):
return compute_metric_at_threshold(y_true, y_prob, threshold, metric)
# Set up matplotlib - fallback to inline if widget fails
try:
get_ipython().run_line_magic('matplotlib', 'inline')
INTERACTIVE_MODE = False
print("Using static plots for documentation build")
except Exception:
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
INTERACTIVE_MODE = False
print("Using non-interactive backend")
plt.style.use('default')
Using static plots for documentation build
1. Basic Demonstration¶
Let’s start with a simple example to see the piecewise-constant nature:
[2]:
# Example data
y_true = np.array([0, 0, 1, 1, 0, 1, 0])
y_prob = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9])
print("Example data:")
print(f"True labels: {y_true}")
print(f"Probabilities: {y_prob}")
print(f"\nUnique probabilities (breakpoints): {np.unique(y_prob)}")
Example data:
True labels: [0 0 1 1 0 1 0]
Probabilities: [0.1 0.3 0.4 0.6 0.7 0.8 0.9]
Unique probabilities (breakpoints): [0.1 0.3 0.4 0.6 0.7 0.8 0.9]
[3]:
def plot_piecewise_metric(y_true, y_prob, metric='f1', title_suffix=''):
"""Plot a metric vs threshold showing piecewise-constant behavior."""
# Generate dense threshold grid for smooth plotting
thresholds = np.linspace(0.05, 0.95, 500)
scores = [_metric_score(y_true, y_prob, t, metric) for t in thresholds]
# Find breakpoints (unique probabilities)
breakpoints = np.unique(y_prob)
breakpoint_scores = [_metric_score(y_true, y_prob, t, metric) for t in breakpoints]
# Find optimal threshold
result = optimize_thresholds(y_true, y_prob, metric=metric, method='sort_scan')
optimal_threshold = result.thresholds[0] # Get scalar value from array
optimal_score = _metric_score(y_true, y_prob, optimal_threshold, metric)
# Create plot
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
# Plot the metric function
ax.plot(thresholds, scores, 'b-', linewidth=2, label=f'{metric.upper()} Score')
# Mark breakpoints
ax.scatter(breakpoints, breakpoint_scores, color='red', s=80, zorder=5,
label=f'Breakpoints ({len(breakpoints)} points)')
# Mark optimal
ax.scatter([optimal_threshold], [optimal_score], color='green', s=150,
marker='*', zorder=6, label=f'Optimal (t={optimal_threshold:.3f})')
# Add vertical lines at breakpoints
for bp in breakpoints:
ax.axvline(x=bp, color='red', linestyle='--', alpha=0.3)
ax.set_xlabel('Decision Threshold')
ax.set_ylabel(f'{metric.upper()} Score')
ax.set_title(f'Piecewise-Constant Nature of {metric.upper()} Score{title_suffix}')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_ylim(0, 1.05)
plt.tight_layout()
plt.show()
return fig, optimal_threshold, optimal_score
# Plot F1 score for our example
fig, opt_thresh, opt_score = plot_piecewise_metric(y_true, y_prob, 'f1')
print(f"\nOptimal F1 threshold: {opt_thresh:.3f} (F1 = {opt_score:.3f})")
Optimal F1 threshold: 0.350 (F1 = 0.750)
2. Interactive Exploration¶
Use the sliders below to see how changing the data affects the piecewise-constant structure:
[4]:
def create_static_demo():
"""Create static examples showing piecewise-constant behavior with different data characteristics."""
print("📊 STATIC EXAMPLES: Different Data Characteristics")
print("=" * 55)
# Example 1: Small imbalanced dataset
print("\n1️⃣ Small Imbalanced Dataset (5 samples, 20% positive)")
np.random.seed(42)
y_ex1 = np.array([0, 0, 0, 1, 1])
p_ex1 = np.array([0.1, 0.3, 0.4, 0.7, 0.9])
fig1, opt1, score1 = plot_piecewise_metric(y_ex1, p_ex1, 'f1',
title_suffix='\nSmall Imbalanced Dataset')
print(f" → Optimal F1: {opt1:.3f} (score = {score1:.3f})")
print(f" → Breakpoints: {len(np.unique(p_ex1))} unique probabilities")
# Example 2: Larger balanced dataset
print("\n2️⃣ Larger Balanced Dataset (20 samples, ~50% positive)")
np.random.seed(123)
y_ex2 = np.random.randint(0, 2, 20)
p_ex2 = np.random.beta(2, 2, 20) # Bell-shaped distribution
# Sort for cleaner visualization
sort_idx = np.argsort(p_ex2)
y_ex2, p_ex2 = y_ex2[sort_idx], p_ex2[sort_idx]
fig2, opt2, score2 = plot_piecewise_metric(y_ex2, p_ex2, 'f1',
title_suffix='\nLarger Balanced Dataset')
print(f" → Optimal F1: {opt2:.3f} (score = {score2:.3f})")
print(f" → Breakpoints: {len(np.unique(p_ex2))} unique probabilities")
# Example 3: Precision vs Recall trade-off
print("\n3️⃣ Precision vs Recall Comparison")
y_ex3 = np.array([0, 0, 1, 1, 0, 1, 0, 1])
p_ex3 = np.array([0.1, 0.3, 0.4, 0.6, 0.65, 0.8, 0.85, 0.9])
# Compare different metrics on same data
metrics_to_compare = ['precision', 'recall', 'f1']
print(f" Data: {len(y_ex3)} samples, {y_ex3.sum()} positive")
for metric in metrics_to_compare:
result = optimize_thresholds(y_ex3, p_ex3, metric=metric)
optimal_thresh = result.thresholds[0]
optimal_score = _metric_score(y_ex3, p_ex3, optimal_thresh, metric)
print(f" → {metric.capitalize()}: t={optimal_thresh:.3f}, score={optimal_score:.3f}")
# Plot the trade-off
thresholds = np.linspace(0.05, 0.95, 100)
precision_scores = [_metric_score(y_ex3, p_ex3, t, 'precision') for t in thresholds]
recall_scores = [_metric_score(y_ex3, p_ex3, t, 'recall') for t in thresholds]
f1_scores = [_metric_score(y_ex3, p_ex3, t, 'f1') for t in thresholds]
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.plot(thresholds, precision_scores, 'g-', linewidth=2, label='Precision')
ax.plot(thresholds, recall_scores, 'r-', linewidth=2, label='Recall')
ax.plot(thresholds, f1_scores, 'b-', linewidth=2, label='F1 Score')
# Mark optimal points
for metric, color in zip(['precision', 'recall', 'f1'], ['green', 'red', 'blue']):
result = optimize_thresholds(y_ex3, p_ex3, metric=metric)
opt_t = result.thresholds[0]
opt_s = _metric_score(y_ex3, p_ex3, opt_t, metric)
ax.scatter([opt_t], [opt_s], color=color, s=150, marker='*',
edgecolors='black', zorder=5)
ax.set_xlabel('Decision Threshold')
ax.set_ylabel('Metric Score')
ax.set_title('Precision vs Recall Trade-off\nStars show optimal thresholds for each metric')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_ylim(0, 1.05)
plt.tight_layout()
plt.show()
print("\n💡 Key Insights:")
print(" • Precision optimal: High threshold (fewer false positives)")
print(" • Recall optimal: Low threshold (fewer false negatives)")
print(" • F1 optimal: Balanced trade-off between precision and recall")
# Run the static demo
create_static_demo()
📊 STATIC EXAMPLES: Different Data Characteristics
=======================================================
1️⃣ Small Imbalanced Dataset (5 samples, 20% positive)
→ Optimal F1: 0.550 (score = 1.000)
→ Breakpoints: 5 unique probabilities
2️⃣ Larger Balanced Dataset (20 samples, ~50% positive)
→ Optimal F1: 0.221 (score = 0.621)
→ Breakpoints: 20 unique probabilities
3️⃣ Precision vs Recall Comparison
Data: 8 samples, 4 positive
→ Precision: t=0.875, score=1.000
→ Recall: t=0.350, score=1.000
→ F1: t=0.350, score=0.800
💡 Key Insights:
• Precision optimal: High threshold (fewer false positives)
• Recall optimal: Low threshold (fewer false negatives)
• F1 optimal: Balanced trade-off between precision and recall
3. Optimization Methods Comparison¶
Let’s compare different optimization approaches on the same data:
[5]:
def compare_optimization_methods(y_true, y_prob, metric='f1'):
"""Compare different threshold optimization methods."""
print(f"Comparing optimization methods for {metric.upper()} score...\n")
# Method 1: Sort-scan algorithm (our recommended approach)
result_sort_scan = optimize_thresholds(y_true, y_prob, metric=metric, method='sort_scan')
thresh_sort_scan = result_sort_scan.thresholds[0] # Get scalar value from array
score_sort_scan = _metric_score(y_true, y_prob, thresh_sort_scan, metric)
# Method 2: scipy.optimize.minimize_scalar (continuous optimization)
result = optimize.minimize_scalar(
lambda t: -_metric_score(y_true, y_prob, t, metric),
bounds=(0, 1),
method='bounded'
)
thresh_minimize = result.x
score_minimize = _metric_score(y_true, y_prob, thresh_minimize, metric)
# Method 3: With fallback (what our 'minimize' method actually does)
result_fallback = optimize_thresholds(y_true, y_prob, metric=metric, method='minimize')
thresh_fallback = result_fallback.thresholds[0] # Get scalar value from array
score_fallback = _metric_score(y_true, y_prob, thresh_fallback, metric)
# Display results
methods = [
('Sort-Scan Algorithm', thresh_sort_scan, score_sort_scan),
('minimize_scalar Only', thresh_minimize, score_minimize),
('With Fallback', thresh_fallback, score_fallback)
]
for name, threshold, score in methods:
print(f"{name:18} | Threshold: {threshold:.4f} | {metric.upper()}: {score:.4f}")
# Create visualization
thresholds = np.linspace(0.01, 0.99, 500)
scores = [_metric_score(y_true, y_prob, t, metric) for t in thresholds]
unique_probs = np.unique(y_prob)
unique_scores = [_metric_score(y_true, y_prob, t, metric) for t in unique_probs]
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
# Plot metric function
ax.plot(thresholds, scores, 'b-', linewidth=1.5, alpha=0.7, label=f'{metric.upper()} Score')
# Plot breakpoints
ax.scatter(unique_probs, unique_scores, color='lightcoral', s=30, alpha=0.6,
label=f'Breakpoints ({len(unique_probs)} points)')
# Plot results from different methods
colors = ['green', 'red', 'blue']
markers = ['*', 'x', 'D']
for (name, threshold, score), color, marker in zip(methods, colors, markers, strict=False):
ax.scatter([threshold], [score], color=color, s=120, marker=marker,
zorder=5, label=f'{name}\n(t={threshold:.3f})', edgecolors='black')
ax.set_xlabel('Decision Threshold')
ax.set_ylabel(f'{metric.upper()} Score')
ax.set_title('Comparison of Optimization Methods')
ax.grid(True, alpha=0.3)
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
return methods
# Test with example data
np.random.seed(123)
n = 15
y_test = np.random.randint(0, 2, n)
y_prob_test = np.random.beta(2, 2, n)
print(f"Test data: {n} samples with {len(np.unique(y_prob_test))} unique probabilities\n")
results = compare_optimization_methods(y_test, y_prob_test, 'f1')
Test data: 15 samples with 15 unique probabilities
Comparing optimization methods for F1 score...
Sort-Scan Algorithm | Threshold: 0.1977 | F1: 0.6316
minimize_scalar Only | Threshold: 0.2016 | F1: 0.6316
With Fallback | Threshold: 0.1855 | F1: 0.6316
/tmp/ipykernel_3060/1158437657.py:56: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x'). Matplotlib is ignoring the edgecolor in favor of the facecolor. This behavior may change in the future.
ax.scatter([threshold], [score], color=color, s=120, marker=marker,
4. Why the Fallback Mechanism Works¶
The key insight is that the optimal threshold must be one of the unique predicted probabilities. Here’s why:
[6]:
def demonstrate_optimal_at_breakpoints():
"""Show that the optimal threshold is always at a breakpoint."""
# Create example with clear optimal point
y_true = np.array([0, 0, 1, 1, 0, 1])
y_prob = np.array([0.2, 0.3, 0.6, 0.7, 0.8, 0.9])
print("Demonstrating that optimal threshold is at a breakpoint...\n")
print(f"Data: labels = {y_true}")
print(f" probs = {y_prob}\n")
# Import the correct function
from optimal_cutoffs.metrics_core import confusion_matrix_at_threshold
# Evaluate F1 at each unique probability
unique_probs = np.unique(y_prob)
print("F1 score at each unique probability (breakpoint):")
for _i, prob in enumerate(unique_probs):
f1 = _metric_score(y_true, y_prob, prob, 'f1')
tp, tn, fp, fn = confusion_matrix_at_threshold(y_true, y_prob, prob)
print(f" t = {prob:.1f}: F1 = {f1:.3f} | TP={tp}, TN={tn}, FP={fp}, FN={fn}")
# Find optimal
result = optimize_thresholds(y_true, y_prob, metric='f1')
optimal_thresh = result.thresholds[0] # Get scalar value from array
optimal_f1 = _metric_score(y_true, y_prob, optimal_thresh, 'f1')
print(f"\n→ Optimal: t = {optimal_thresh:.1f}, F1 = {optimal_f1:.3f}")
# Now test a threshold between breakpoints
between_thresh = 0.65 # Between 0.6 and 0.7
between_f1 = _metric_score(y_true, y_prob, between_thresh, 'f1')
print(f"\nFor comparison, at t = {between_thresh:.2f} (between breakpoints):")
print(f" F1 = {between_f1:.3f} (same as t = 0.6 because both give same predictions)")
# Visualize predictions at different thresholds
print("\nPrediction vectors:")
for thresh in [0.6, 0.65, 0.7]:
predictions = (y_prob >= thresh).astype(int)
print(f" t = {thresh:.2f}: {predictions}")
print("\n→ Note: t=0.6 and t=0.65 give the same predictions, hence same F1!")
demonstrate_optimal_at_breakpoints()
Demonstrating that optimal threshold is at a breakpoint...
Data: labels = [0 0 1 1 0 1]
probs = [0.2 0.3 0.6 0.7 0.8 0.9]
F1 score at each unique probability (breakpoint):
t = 0.2: F1 = 0.750 | TP=3.0, TN=1.0, FP=2.0, FN=0.0
t = 0.3: F1 = 0.857 | TP=3.0, TN=2.0, FP=1.0, FN=0.0
t = 0.6: F1 = 0.667 | TP=2.0, TN=2.0, FP=1.0, FN=1.0
t = 0.7: F1 = 0.400 | TP=1.0, TN=2.0, FP=1.0, FN=2.0
t = 0.8: F1 = 0.500 | TP=1.0, TN=3.0, FP=0.0, FN=2.0
t = 0.9: F1 = 0.000 | TP=0.0, TN=3.0, FP=0.0, FN=3.0
→ Optimal: t = 0.4, F1 = 0.857
For comparison, at t = 0.65 (between breakpoints):
F1 = 0.667 (same as t = 0.6 because both give same predictions)
Prediction vectors:
t = 0.60: [0 0 1 1 1 1]
t = 0.65: [0 0 0 1 1 1]
t = 0.70: [0 0 0 1 1 1]
→ Note: t=0.6 and t=0.65 give the same predictions, hence same F1!
5. Multiple Metrics Comparison¶
Different metrics often have different optimal thresholds:
[7]:
def compare_multiple_metrics(y_true, y_prob):
"""Show how different metrics have different optimal thresholds."""
metrics = ['accuracy', 'f1', 'precision', 'recall']
colors = ['blue', 'red', 'green', 'orange']
thresholds = np.linspace(0.05, 0.95, 200)
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
results = {}
for metric, color in zip(metrics, colors, strict=False):
# Calculate scores across threshold range
scores = [_metric_score(y_true, y_prob, t, metric) for t in thresholds]
ax.plot(thresholds, scores, color=color, linewidth=2, label=metric.capitalize())
# Find optimal threshold
result = optimize_thresholds(y_true, y_prob, metric=metric)
optimal_thresh = result.thresholds[0] # Fixed: was result.threshold
optimal_score = _metric_score(y_true, y_prob, optimal_thresh, metric)
# Mark optimal point
ax.scatter([optimal_thresh], [optimal_score], color=color, s=150,
marker='*', zorder=5, edgecolors='black', linewidth=1)
results[metric] = (optimal_thresh, optimal_score)
# Add breakpoint lines
unique_probs = np.unique(y_prob)
for prob in unique_probs:
ax.axvline(x=prob, color='gray', linestyle='--', alpha=0.3)
ax.set_xlabel('Decision Threshold')
ax.set_ylabel('Metric Score')
ax.set_title('Different Metrics Have Different Optimal Thresholds\n' +
'(Stars show optimal points, dashed lines show breakpoints)')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_ylim(0, 1.05)
plt.tight_layout()
plt.show()
# Print results
print("Optimal thresholds by metric:")
for metric, (thresh, score) in results.items():
print(f" {metric:9}: t = {thresh:.3f}, score = {score:.3f}")
return results
# Demo with well-separated data
y_demo = np.array([0, 0, 0, 1, 1, 1])
p_demo = np.array([0.1, 0.3, 0.4, 0.6, 0.8, 0.9])
print(f"Demo data: labels = {y_demo}")
print(f" probs = {p_demo}\n")
metric_results = compare_multiple_metrics(y_demo, p_demo)
Demo data: labels = [0 0 0 1 1 1]
probs = [0.1 0.3 0.4 0.6 0.8 0.9]
Optimal thresholds by metric:
accuracy : t = 0.406, score = 1.000
f1 : t = 0.500, score = 1.000
precision: t = 0.850, score = 1.000
recall : t = 0.500, score = 1.000
6. Practical Implications¶
Key Takeaways¶
Piecewise-Constant Nature: Classification metrics only change at unique probability values
Optimization Challenge: Continuous optimizers can get stuck in flat regions and miss the global optimum
Smart Solution: Evaluate metrics at all unique probabilities (guaranteed global optimum)
Fallback Mechanism: Combine continuous optimization with discrete evaluation for robustness
Metric Differences: Different metrics often have different optimal thresholds
When This Matters Most¶
Imbalanced datasets: Default 0.5 threshold is often far from optimal
Cost-sensitive decisions: When false positives and false negatives have different costs
Metric optimization: When you need to maximize a specific metric (F1, precision, recall)
Model deployment: When converting probabilities to hard predictions
Computational Efficiency¶
The smart brute force approach is actually very efficient:
Time complexity: O(k) where k = number of unique probabilities
Typical case: k ≪ n (much fewer unique probabilities than samples)
Guaranteed optimum: No risk of local minima or convergence issues
[8]:
# Final demonstration: efficiency comparison
import time
def efficiency_demo():
"""Demonstrate the efficiency of sort-scan vs continuous optimization."""
# Generate larger dataset
np.random.seed(42)
n_samples = 1000
y_large = np.random.randint(0, 2, n_samples)
p_large = np.random.beta(2, 2, n_samples)
n_unique = len(np.unique(p_large))
print(f"Efficiency test with {n_samples} samples, {n_unique} unique probabilities\n")
methods = [
('sort_scan', 'Sort-Scan Algorithm'),
('minimize', 'Minimize with Fallback'),
('gradient', 'Gradient Method')
]
for method_code, method_name in methods:
start_time = time.time()
result = optimize_thresholds(y_large, p_large, metric='f1', method=method_code)
end_time = time.time()
threshold = result.thresholds[0] # Get scalar value from array
score = _metric_score(y_large, p_large, threshold, 'f1')
duration = end_time - start_time
print(f"{method_name:20} | Time: {duration:.4f}s | F1: {score:.4f} | Threshold: {threshold:.4f}")
print(f"\n→ Sort-scan algorithm evaluates only {n_unique} points vs {n_samples} samples!")
efficiency_demo()
Gradient optimization is ineffective for piecewise-constant metrics. Use sort_scan instead.
Efficiency test with 1000 samples, 1000 unique probabilities
Sort-Scan Algorithm | Time: 0.0006s | F1: 0.6759 | Threshold: 0.0146
Minimize with Fallback | Time: 0.0068s | F1: 0.6730 | Threshold: 0.1186
Gradient Method | Time: 0.0012s | F1: 0.5267 | Threshold: 0.5031
→ Sort-scan algorithm evaluates only 1000 points vs 1000 samples!
Conclusion¶
This notebook demonstrated the piecewise-constant nature of classification metrics and why this creates challenges for traditional optimization methods. The optimal-classification-cutoffs library addresses these challenges through:
Smart algorithms that leverage the mathematical structure of the problem
Fallback mechanisms that ensure robust optimization
Efficient implementation that scales well with dataset size
What’s Next?¶
Now that you understand the mathematical foundations:
Apply to your data: Use the techniques from our examples
01_quickstart.py: Get immediate performance improvements
02_business_value.py: Optimize for real business metrics
03_multiclass.py: Handle complex multi-class scenarios