🔬 Interactive Demo: Deep Dive into Threshold Behavior

ROI: Understand why optimal thresholds work and build intuition
Time: 20+ minutes of exploration and discovery
Previous: Run 01_quickstart.py → 02_business_value.py → 03_multiclass.py first

This interactive notebook lets you explore the mathematical foundations behind optimal threshold selection. Perfect for understanding why the library works so well!

🎯 Key Learning Objectives

  • Piecewise-constant: Why metrics only change at specific points

  • Breakpoints: The unique probabilities where metrics can change

  • Optimization challenges: Why continuous methods can fail

  • Algorithm insights: How smart methods guarantee global optimum

🚀 Quick Start

Run the cells below to start exploring. Use the interactive widgets to see how different data characteristics affect optimal thresholds.

[1]:
# Import optimal_cutoffs functions
import sys

import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display
from scipy import optimize

from optimal_cutoffs import optimize_thresholds
from optimal_cutoffs.metrics import compute_metric_at_threshold

# Create alias for backward compatibility with notebook code
def _metric_score(y_true, y_prob, threshold, metric):
    return compute_metric_at_threshold(y_true, y_prob, threshold, metric)

# Set up matplotlib - fallback to inline if widget fails
try:
    get_ipython().run_line_magic('matplotlib', 'inline')
    INTERACTIVE_MODE = False
    print("Using static plots for documentation build")
except Exception:
    import matplotlib
    matplotlib.use('Agg')  # Non-interactive backend
    INTERACTIVE_MODE = False
    print("Using non-interactive backend")

plt.style.use('default')
Using static plots for documentation build

1. Basic Demonstration

Let’s start with a simple example to see the piecewise-constant nature:

[2]:
# Example data
y_true = np.array([0, 0, 1, 1, 0, 1, 0])
y_prob = np.array([0.1, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9])

print("Example data:")
print(f"True labels:  {y_true}")
print(f"Probabilities: {y_prob}")
print(f"\nUnique probabilities (breakpoints): {np.unique(y_prob)}")
Example data:
True labels:  [0 0 1 1 0 1 0]
Probabilities: [0.1 0.3 0.4 0.6 0.7 0.8 0.9]

Unique probabilities (breakpoints): [0.1 0.3 0.4 0.6 0.7 0.8 0.9]
[3]:
def plot_piecewise_metric(y_true, y_prob, metric='f1', title_suffix=''):
    """Plot a metric vs threshold showing piecewise-constant behavior."""

    # Generate dense threshold grid for smooth plotting
    thresholds = np.linspace(0.05, 0.95, 500)
    scores = [_metric_score(y_true, y_prob, t, metric) for t in thresholds]

    # Find breakpoints (unique probabilities)
    breakpoints = np.unique(y_prob)
    breakpoint_scores = [_metric_score(y_true, y_prob, t, metric) for t in breakpoints]

    # Find optimal threshold
    result = optimize_thresholds(y_true, y_prob, metric=metric, method='sort_scan')
    optimal_threshold = result.thresholds[0]  # Get scalar value from array
    optimal_score = _metric_score(y_true, y_prob, optimal_threshold, metric)

    # Create plot
    fig, ax = plt.subplots(1, 1, figsize=(12, 6))

    # Plot the metric function
    ax.plot(thresholds, scores, 'b-', linewidth=2, label=f'{metric.upper()} Score')

    # Mark breakpoints
    ax.scatter(breakpoints, breakpoint_scores, color='red', s=80, zorder=5,
              label=f'Breakpoints ({len(breakpoints)} points)')

    # Mark optimal
    ax.scatter([optimal_threshold], [optimal_score], color='green', s=150,
              marker='*', zorder=6, label=f'Optimal (t={optimal_threshold:.3f})')

    # Add vertical lines at breakpoints
    for bp in breakpoints:
        ax.axvline(x=bp, color='red', linestyle='--', alpha=0.3)

    ax.set_xlabel('Decision Threshold')
    ax.set_ylabel(f'{metric.upper()} Score')
    ax.set_title(f'Piecewise-Constant Nature of {metric.upper()} Score{title_suffix}')
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.set_ylim(0, 1.05)

    plt.tight_layout()
    plt.show()

    return fig, optimal_threshold, optimal_score

# Plot F1 score for our example
fig, opt_thresh, opt_score = plot_piecewise_metric(y_true, y_prob, 'f1')
print(f"\nOptimal F1 threshold: {opt_thresh:.3f} (F1 = {opt_score:.3f})")
../_images/examples_04_interactive_demo_4_0.svg

Optimal F1 threshold: 0.350 (F1 = 0.750)

2. Interactive Exploration

Use the sliders below to see how changing the data affects the piecewise-constant structure:

[4]:
def create_static_demo():
    """Create static examples showing piecewise-constant behavior with different data characteristics."""

    print("📊 STATIC EXAMPLES: Different Data Characteristics")
    print("=" * 55)

    # Example 1: Small imbalanced dataset
    print("\n1️⃣ Small Imbalanced Dataset (5 samples, 20% positive)")
    np.random.seed(42)
    y_ex1 = np.array([0, 0, 0, 1, 1])
    p_ex1 = np.array([0.1, 0.3, 0.4, 0.7, 0.9])
    fig1, opt1, score1 = plot_piecewise_metric(y_ex1, p_ex1, 'f1',
                                               title_suffix='\nSmall Imbalanced Dataset')
    print(f"   → Optimal F1: {opt1:.3f} (score = {score1:.3f})")
    print(f"   → Breakpoints: {len(np.unique(p_ex1))} unique probabilities")

    # Example 2: Larger balanced dataset
    print("\n2️⃣ Larger Balanced Dataset (20 samples, ~50% positive)")
    np.random.seed(123)
    y_ex2 = np.random.randint(0, 2, 20)
    p_ex2 = np.random.beta(2, 2, 20)  # Bell-shaped distribution
    # Sort for cleaner visualization
    sort_idx = np.argsort(p_ex2)
    y_ex2, p_ex2 = y_ex2[sort_idx], p_ex2[sort_idx]

    fig2, opt2, score2 = plot_piecewise_metric(y_ex2, p_ex2, 'f1',
                                               title_suffix='\nLarger Balanced Dataset')
    print(f"   → Optimal F1: {opt2:.3f} (score = {score2:.3f})")
    print(f"   → Breakpoints: {len(np.unique(p_ex2))} unique probabilities")

    # Example 3: Precision vs Recall trade-off
    print("\n3️⃣ Precision vs Recall Comparison")
    y_ex3 = np.array([0, 0, 1, 1, 0, 1, 0, 1])
    p_ex3 = np.array([0.1, 0.3, 0.4, 0.6, 0.65, 0.8, 0.85, 0.9])

    # Compare different metrics on same data
    metrics_to_compare = ['precision', 'recall', 'f1']
    print(f"   Data: {len(y_ex3)} samples, {y_ex3.sum()} positive")

    for metric in metrics_to_compare:
        result = optimize_thresholds(y_ex3, p_ex3, metric=metric)
        optimal_thresh = result.thresholds[0]
        optimal_score = _metric_score(y_ex3, p_ex3, optimal_thresh, metric)
        print(f"   → {metric.capitalize()}: t={optimal_thresh:.3f}, score={optimal_score:.3f}")

    # Plot the trade-off
    thresholds = np.linspace(0.05, 0.95, 100)
    precision_scores = [_metric_score(y_ex3, p_ex3, t, 'precision') for t in thresholds]
    recall_scores = [_metric_score(y_ex3, p_ex3, t, 'recall') for t in thresholds]
    f1_scores = [_metric_score(y_ex3, p_ex3, t, 'f1') for t in thresholds]

    fig, ax = plt.subplots(1, 1, figsize=(12, 6))
    ax.plot(thresholds, precision_scores, 'g-', linewidth=2, label='Precision')
    ax.plot(thresholds, recall_scores, 'r-', linewidth=2, label='Recall')
    ax.plot(thresholds, f1_scores, 'b-', linewidth=2, label='F1 Score')

    # Mark optimal points
    for metric, color in zip(['precision', 'recall', 'f1'], ['green', 'red', 'blue']):
        result = optimize_thresholds(y_ex3, p_ex3, metric=metric)
        opt_t = result.thresholds[0]
        opt_s = _metric_score(y_ex3, p_ex3, opt_t, metric)
        ax.scatter([opt_t], [opt_s], color=color, s=150, marker='*',
                  edgecolors='black', zorder=5)

    ax.set_xlabel('Decision Threshold')
    ax.set_ylabel('Metric Score')
    ax.set_title('Precision vs Recall Trade-off\nStars show optimal thresholds for each metric')
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.set_ylim(0, 1.05)

    plt.tight_layout()
    plt.show()

    print("\n💡 Key Insights:")
    print("   • Precision optimal: High threshold (fewer false positives)")
    print("   • Recall optimal: Low threshold (fewer false negatives)")
    print("   • F1 optimal: Balanced trade-off between precision and recall")

# Run the static demo
create_static_demo()
📊 STATIC EXAMPLES: Different Data Characteristics
=======================================================

1️⃣ Small Imbalanced Dataset (5 samples, 20% positive)
../_images/examples_04_interactive_demo_6_1.svg
   → Optimal F1: 0.550 (score = 1.000)
   → Breakpoints: 5 unique probabilities

2️⃣ Larger Balanced Dataset (20 samples, ~50% positive)
../_images/examples_04_interactive_demo_6_3.svg
   → Optimal F1: 0.221 (score = 0.621)
   → Breakpoints: 20 unique probabilities

3️⃣ Precision vs Recall Comparison
   Data: 8 samples, 4 positive
   → Precision: t=0.875, score=1.000
   → Recall: t=0.350, score=1.000
   → F1: t=0.350, score=0.800
../_images/examples_04_interactive_demo_6_5.svg

💡 Key Insights:
   • Precision optimal: High threshold (fewer false positives)
   • Recall optimal: Low threshold (fewer false negatives)
   • F1 optimal: Balanced trade-off between precision and recall

3. Optimization Methods Comparison

Let’s compare different optimization approaches on the same data:

[5]:
def compare_optimization_methods(y_true, y_prob, metric='f1'):
    """Compare different threshold optimization methods."""

    print(f"Comparing optimization methods for {metric.upper()} score...\n")

    # Method 1: Sort-scan algorithm (our recommended approach)
    result_sort_scan = optimize_thresholds(y_true, y_prob, metric=metric, method='sort_scan')
    thresh_sort_scan = result_sort_scan.thresholds[0]  # Get scalar value from array
    score_sort_scan = _metric_score(y_true, y_prob, thresh_sort_scan, metric)

    # Method 2: scipy.optimize.minimize_scalar (continuous optimization)
    result = optimize.minimize_scalar(
        lambda t: -_metric_score(y_true, y_prob, t, metric),
        bounds=(0, 1),
        method='bounded'
    )
    thresh_minimize = result.x
    score_minimize = _metric_score(y_true, y_prob, thresh_minimize, metric)

    # Method 3: With fallback (what our 'minimize' method actually does)
    result_fallback = optimize_thresholds(y_true, y_prob, metric=metric, method='minimize')
    thresh_fallback = result_fallback.thresholds[0]  # Get scalar value from array
    score_fallback = _metric_score(y_true, y_prob, thresh_fallback, metric)

    # Display results
    methods = [
        ('Sort-Scan Algorithm', thresh_sort_scan, score_sort_scan),
        ('minimize_scalar Only', thresh_minimize, score_minimize),
        ('With Fallback', thresh_fallback, score_fallback)
    ]

    for name, threshold, score in methods:
        print(f"{name:18} | Threshold: {threshold:.4f} | {metric.upper()}: {score:.4f}")

    # Create visualization
    thresholds = np.linspace(0.01, 0.99, 500)
    scores = [_metric_score(y_true, y_prob, t, metric) for t in thresholds]

    unique_probs = np.unique(y_prob)
    unique_scores = [_metric_score(y_true, y_prob, t, metric) for t in unique_probs]

    fig, ax = plt.subplots(1, 1, figsize=(12, 6))

    # Plot metric function
    ax.plot(thresholds, scores, 'b-', linewidth=1.5, alpha=0.7, label=f'{metric.upper()} Score')

    # Plot breakpoints
    ax.scatter(unique_probs, unique_scores, color='lightcoral', s=30, alpha=0.6,
              label=f'Breakpoints ({len(unique_probs)} points)')

    # Plot results from different methods
    colors = ['green', 'red', 'blue']
    markers = ['*', 'x', 'D']

    for (name, threshold, score), color, marker in zip(methods, colors, markers, strict=False):
        ax.scatter([threshold], [score], color=color, s=120, marker=marker,
                  zorder=5, label=f'{name}\n(t={threshold:.3f})', edgecolors='black')

    ax.set_xlabel('Decision Threshold')
    ax.set_ylabel(f'{metric.upper()} Score')
    ax.set_title('Comparison of Optimization Methods')
    ax.grid(True, alpha=0.3)
    ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

    plt.tight_layout()
    plt.show()

    return methods

# Test with example data
np.random.seed(123)
n = 15
y_test = np.random.randint(0, 2, n)
y_prob_test = np.random.beta(2, 2, n)

print(f"Test data: {n} samples with {len(np.unique(y_prob_test))} unique probabilities\n")
results = compare_optimization_methods(y_test, y_prob_test, 'f1')
Test data: 15 samples with 15 unique probabilities

Comparing optimization methods for F1 score...

Sort-Scan Algorithm | Threshold: 0.1977 | F1: 0.6316
minimize_scalar Only | Threshold: 0.2016 | F1: 0.6316
With Fallback      | Threshold: 0.1855 | F1: 0.6316
/tmp/ipykernel_3060/1158437657.py:56: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.
  ax.scatter([threshold], [score], color=color, s=120, marker=marker,
../_images/examples_04_interactive_demo_8_2.svg

4. Why the Fallback Mechanism Works

The key insight is that the optimal threshold must be one of the unique predicted probabilities. Here’s why:

[6]:
def demonstrate_optimal_at_breakpoints():
    """Show that the optimal threshold is always at a breakpoint."""

    # Create example with clear optimal point
    y_true = np.array([0, 0, 1, 1, 0, 1])
    y_prob = np.array([0.2, 0.3, 0.6, 0.7, 0.8, 0.9])

    print("Demonstrating that optimal threshold is at a breakpoint...\n")
    print(f"Data: labels = {y_true}")
    print(f"      probs  = {y_prob}\n")

    # Import the correct function
    from optimal_cutoffs.metrics_core import confusion_matrix_at_threshold

    # Evaluate F1 at each unique probability
    unique_probs = np.unique(y_prob)
    print("F1 score at each unique probability (breakpoint):")

    for _i, prob in enumerate(unique_probs):
        f1 = _metric_score(y_true, y_prob, prob, 'f1')
        tp, tn, fp, fn = confusion_matrix_at_threshold(y_true, y_prob, prob)
        print(f"  t = {prob:.1f}: F1 = {f1:.3f} | TP={tp}, TN={tn}, FP={fp}, FN={fn}")

    # Find optimal
    result = optimize_thresholds(y_true, y_prob, metric='f1')
    optimal_thresh = result.thresholds[0]  # Get scalar value from array
    optimal_f1 = _metric_score(y_true, y_prob, optimal_thresh, 'f1')

    print(f"\n→ Optimal: t = {optimal_thresh:.1f}, F1 = {optimal_f1:.3f}")

    # Now test a threshold between breakpoints
    between_thresh = 0.65  # Between 0.6 and 0.7
    between_f1 = _metric_score(y_true, y_prob, between_thresh, 'f1')

    print(f"\nFor comparison, at t = {between_thresh:.2f} (between breakpoints):")
    print(f"  F1 = {between_f1:.3f} (same as t = 0.6 because both give same predictions)")

    # Visualize predictions at different thresholds
    print("\nPrediction vectors:")
    for thresh in [0.6, 0.65, 0.7]:
        predictions = (y_prob >= thresh).astype(int)
        print(f"  t = {thresh:.2f}: {predictions}")

    print("\n→ Note: t=0.6 and t=0.65 give the same predictions, hence same F1!")

demonstrate_optimal_at_breakpoints()
Demonstrating that optimal threshold is at a breakpoint...

Data: labels = [0 0 1 1 0 1]
      probs  = [0.2 0.3 0.6 0.7 0.8 0.9]

F1 score at each unique probability (breakpoint):
  t = 0.2: F1 = 0.750 | TP=3.0, TN=1.0, FP=2.0, FN=0.0
  t = 0.3: F1 = 0.857 | TP=3.0, TN=2.0, FP=1.0, FN=0.0
  t = 0.6: F1 = 0.667 | TP=2.0, TN=2.0, FP=1.0, FN=1.0
  t = 0.7: F1 = 0.400 | TP=1.0, TN=2.0, FP=1.0, FN=2.0
  t = 0.8: F1 = 0.500 | TP=1.0, TN=3.0, FP=0.0, FN=2.0
  t = 0.9: F1 = 0.000 | TP=0.0, TN=3.0, FP=0.0, FN=3.0

→ Optimal: t = 0.4, F1 = 0.857

For comparison, at t = 0.65 (between breakpoints):
  F1 = 0.667 (same as t = 0.6 because both give same predictions)

Prediction vectors:
  t = 0.60: [0 0 1 1 1 1]
  t = 0.65: [0 0 0 1 1 1]
  t = 0.70: [0 0 0 1 1 1]

→ Note: t=0.6 and t=0.65 give the same predictions, hence same F1!

5. Multiple Metrics Comparison

Different metrics often have different optimal thresholds:

[7]:
def compare_multiple_metrics(y_true, y_prob):
    """Show how different metrics have different optimal thresholds."""

    metrics = ['accuracy', 'f1', 'precision', 'recall']
    colors = ['blue', 'red', 'green', 'orange']

    thresholds = np.linspace(0.05, 0.95, 200)

    fig, ax = plt.subplots(1, 1, figsize=(12, 8))

    results = {}

    for metric, color in zip(metrics, colors, strict=False):
        # Calculate scores across threshold range
        scores = [_metric_score(y_true, y_prob, t, metric) for t in thresholds]
        ax.plot(thresholds, scores, color=color, linewidth=2, label=metric.capitalize())

        # Find optimal threshold
        result = optimize_thresholds(y_true, y_prob, metric=metric)
        optimal_thresh = result.thresholds[0]  # Fixed: was result.threshold
        optimal_score = _metric_score(y_true, y_prob, optimal_thresh, metric)

        # Mark optimal point
        ax.scatter([optimal_thresh], [optimal_score], color=color, s=150,
                  marker='*', zorder=5, edgecolors='black', linewidth=1)

        results[metric] = (optimal_thresh, optimal_score)

    # Add breakpoint lines
    unique_probs = np.unique(y_prob)
    for prob in unique_probs:
        ax.axvline(x=prob, color='gray', linestyle='--', alpha=0.3)

    ax.set_xlabel('Decision Threshold')
    ax.set_ylabel('Metric Score')
    ax.set_title('Different Metrics Have Different Optimal Thresholds\n' +
                '(Stars show optimal points, dashed lines show breakpoints)')
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.set_ylim(0, 1.05)

    plt.tight_layout()
    plt.show()

    # Print results
    print("Optimal thresholds by metric:")
    for metric, (thresh, score) in results.items():
        print(f"  {metric:9}: t = {thresh:.3f}, score = {score:.3f}")

    return results

# Demo with well-separated data
y_demo = np.array([0, 0, 0, 1, 1, 1])
p_demo = np.array([0.1, 0.3, 0.4, 0.6, 0.8, 0.9])

print(f"Demo data: labels = {y_demo}")
print(f"           probs  = {p_demo}\n")

metric_results = compare_multiple_metrics(y_demo, p_demo)
Demo data: labels = [0 0 0 1 1 1]
           probs  = [0.1 0.3 0.4 0.6 0.8 0.9]

../_images/examples_04_interactive_demo_12_1.svg
Optimal thresholds by metric:
  accuracy : t = 0.406, score = 1.000
  f1       : t = 0.500, score = 1.000
  precision: t = 0.850, score = 1.000
  recall   : t = 0.500, score = 1.000

6. Practical Implications

Key Takeaways

  1. Piecewise-Constant Nature: Classification metrics only change at unique probability values

  2. Optimization Challenge: Continuous optimizers can get stuck in flat regions and miss the global optimum

  3. Smart Solution: Evaluate metrics at all unique probabilities (guaranteed global optimum)

  4. Fallback Mechanism: Combine continuous optimization with discrete evaluation for robustness

  5. Metric Differences: Different metrics often have different optimal thresholds

When This Matters Most

  • Imbalanced datasets: Default 0.5 threshold is often far from optimal

  • Cost-sensitive decisions: When false positives and false negatives have different costs

  • Metric optimization: When you need to maximize a specific metric (F1, precision, recall)

  • Model deployment: When converting probabilities to hard predictions

Computational Efficiency

The smart brute force approach is actually very efficient:

  • Time complexity: O(k) where k = number of unique probabilities

  • Typical case: k ≪ n (much fewer unique probabilities than samples)

  • Guaranteed optimum: No risk of local minima or convergence issues

[8]:
# Final demonstration: efficiency comparison
import time


def efficiency_demo():
    """Demonstrate the efficiency of sort-scan vs continuous optimization."""

    # Generate larger dataset
    np.random.seed(42)
    n_samples = 1000
    y_large = np.random.randint(0, 2, n_samples)
    p_large = np.random.beta(2, 2, n_samples)

    n_unique = len(np.unique(p_large))

    print(f"Efficiency test with {n_samples} samples, {n_unique} unique probabilities\n")

    methods = [
        ('sort_scan', 'Sort-Scan Algorithm'),
        ('minimize', 'Minimize with Fallback'),
        ('gradient', 'Gradient Method')
    ]

    for method_code, method_name in methods:
        start_time = time.time()
        result = optimize_thresholds(y_large, p_large, metric='f1', method=method_code)
        end_time = time.time()

        threshold = result.thresholds[0]  # Get scalar value from array
        score = _metric_score(y_large, p_large, threshold, 'f1')
        duration = end_time - start_time

        print(f"{method_name:20} | Time: {duration:.4f}s | F1: {score:.4f} | Threshold: {threshold:.4f}")

    print(f"\n→ Sort-scan algorithm evaluates only {n_unique} points vs {n_samples} samples!")

efficiency_demo()
Gradient optimization is ineffective for piecewise-constant metrics. Use sort_scan instead.
Efficiency test with 1000 samples, 1000 unique probabilities

Sort-Scan Algorithm  | Time: 0.0006s | F1: 0.6759 | Threshold: 0.0146
Minimize with Fallback | Time: 0.0068s | F1: 0.6730 | Threshold: 0.1186
Gradient Method      | Time: 0.0012s | F1: 0.5267 | Threshold: 0.5031

→ Sort-scan algorithm evaluates only 1000 points vs 1000 samples!

Conclusion

This notebook demonstrated the piecewise-constant nature of classification metrics and why this creates challenges for traditional optimization methods. The optimal-classification-cutoffs library addresses these challenges through:

  1. Smart algorithms that leverage the mathematical structure of the problem

  2. Fallback mechanisms that ensure robust optimization

  3. Efficient implementation that scales well with dataset size

What’s Next?

Now that you understand the mathematical foundations:

  • Apply to your data: Use the techniques from our examples

  • 01_quickstart.py: Get immediate performance improvements

  • 02_business_value.py: Optimize for real business metrics

  • 03_multiclass.py: Handle complex multi-class scenarios

Additional Resources