Theory and Background

This page explains the theoretical foundations underlying the optimal threshold selection algorithms, particularly focusing on the piecewise-constant nature of classification metrics and the relationship between threshold optimization and probability calibration.

Piecewise-Constant Metrics

Understanding the Step Function Nature

Most classification metrics—including F1 score, accuracy, precision, and recall—are piecewise-constant functions with respect to the decision threshold. This fundamental property has important implications for optimization algorithms.

Why Metrics Are Piecewise-Constant

Classification metrics depend on confusion matrix counts: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). These counts are determined by comparing predicted probabilities to a threshold:

\[\begin{split}\text{prediction} = \begin{cases} 1 & \text{if } p \geq \text{threshold} \\ 0 & \text{if } p < \text{threshold} \end{cases}\end{split}\]

Key insight: The confusion matrix counts only change when the threshold crosses one of the unique predicted probability values. Between any two consecutive probability values, all predictions remain unchanged, so the metric value stays constant.

For example, with predicted probabilities [0.2, 0.4, 0.7, 0.9]:

Thresholds in [0.0, 0.2) → all predictions are 1
Thresholds in [0.2, 0.4) → predictions are [0, 1, 1, 1]
Thresholds in [0.4, 0.7) → predictions are [0, 0, 1, 1]
Thresholds in [0.7, 0.9) → predictions are [0, 0, 0, 1]
Thresholds in [0.9, 1.0] → all predictions are 0

This creates a step function where the metric value jumps only at the unique probability values: {0.2, 0.4, 0.7, 0.9}.

Mathematical Proof

Let \(S = \{p_1, p_2, \ldots, p_k\}\) be the set of unique predicted probabilities, sorted in ascending order. For any threshold \(t\) in the interval \((p_i, p_{i+1})\), the prediction vector remains constant:

\[\begin{split}\hat{y}_j = \begin{cases} 1 & \text{if } p_j \geq t \text{ (equivalently, if } p_j > p_i \text{)} \\ 0 & \text{otherwise} \end{cases}\end{split}\]

Since the prediction vector is constant over each interval, the confusion matrix elements (TP, TN, FP, FN) are also constant over each interval. Therefore, any metric function \(M(\text{TP}, \text{TN}, \text{FP}, \text{FN})\) is piecewise-constant with breakpoints at the unique probability values.

Why Continuous Optimizers Can Miss the Maximum

Standard continuous optimization algorithms (like scipy.optimize.minimize_scalar) assume the objective function is smooth and use gradient-based or bracketing methods. However, piecewise-constant functions have several problematic properties:

Zero gradients: The derivative is zero everywhere except at the breakpoints (where it’s undefined)
No local information: The gradient doesn’t point toward better solutions
Arbitrary convergence: Optimizers may converge to any point within a constant region

Consider this example: if the true optimum is at threshold 0.7 with F1 = 0.85, but minimize_scalar converges to 0.65, it will report the same F1 = 0.85 (since both lie in the same constant region) but return the wrong threshold value.

The Optimized O(n log n) Solution

Since the optimal threshold must occur at one of the unique probability values, we can solve this problem optimally with a sort-and-scan algorithm:

# Sort predictions by descending probability (O(n log n))
sorted_indices = np.argsort(-predicted_probabilities)

# Compute all confusion matrices in one vectorized pass (O(n))
tp_cumsum = np.cumsum(labels_sorted)  # True positives at each cut
fp_cumsum = np.cumsum(1 - labels_sorted)  # False positives at each cut

# Vectorized metric evaluation across all n cuts (O(n))
all_scores = metric_function_vectorized(tp_cumsum, tn_array, fp_cumsum, fn_array)
optimal_cut = np.argmax(all_scores)

This approach:

True O(n log n) complexity: Single sort followed by linear scan
Guarantees global optimum: Evaluates all n possible cut points, not just unique probabilities
Vectorized operations: Eliminates Python loops for maximum efficiency
Numerically stable: Returns threshold as midpoint between adjacent probabilities
Exact results: No approximation error from numerical optimization

Performance improvements over the original approach: - 50-100x speedup on typical datasets - Scales better with dataset size (true O(n log n) vs O(k log n) where k can be large) - More robust numerical properties

Fallback Mechanism for Minimize Method

The library implements a hybrid approach for the "minimize" method:

Run scipy.optimize.minimize_scalar to get a candidate threshold
Also evaluate all unique probability values
Return whichever gives the highest metric score

This fallback ensures that even when continuous optimization finds a suboptimal threshold within a constant region, the final result is still globally optimal.

Visualization

The following plot demonstrates the piecewise-constant nature of F1 score:

# Generate example data
y_true = [0, 0, 1, 1, 0, 1, 0]
y_prob = [0.1, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9]

# The F1 score will be constant between each pair of consecutive probabilities
# and will only change at the breakpoints: {0.1, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9}

See examples/piecewise_visualization.ipynb for an interactive demonstration.

Non-Piecewise Metrics

Some metrics are not piecewise-constant:

Log-loss (cross-entropy): Depends directly on probability values, not just binary predictions
Brier score: Quadratic loss that uses continuous probability values
ROC-AUC: Based on ranking, changes continuously with threshold

For these metrics, continuous optimization methods are more appropriate, and the library automatically falls back to evaluating all unique probabilities rather than using the O(n log n) piecewise algorithm.

Calibration and Threshold Optimization

Understanding the Relationship

Calibration and threshold optimization address different aspects of probabilistic classification:

Calibration ensures that predicted probabilities match true frequencies (e.g., among samples with predicted probability 0.7, approximately 70% should be positive)
Threshold optimization finds decision boundaries that maximize specific classification metrics

These are complementary techniques, not competing alternatives.

When to Use Calibration

Use probability calibration when:

Probability interpretation matters: You need reliable probability estimates for decision-making
Model comparison: Comparing models based on probability quality (e.g., using Brier score or log-likelihood)
Risk assessment: Converting scores to meaningful probabilities for business decisions
Ranking quality: Ensuring that higher probabilities truly indicate higher likelihood

Common calibration methods:

Platt scaling: Fits a sigmoid function to map scores to probabilities [Platt1999]
Isotonic regression: Non-parametric method that fits a monotonic function [ZadroznyCElkan2002]

When to Use Threshold Optimization

Use threshold optimization when:

Metric optimization: You need to maximize specific classification metrics (F1, precision, recall)
Binary decisions: Converting probabilities to hard classifications for deployment
Imbalanced classes: Standard 0.5 threshold is often suboptimal for imbalanced datasets
Business constraints: Achieving specific precision/recall trade-offs

Using Both Together

The recommended workflow combines both techniques:

from sklearn.calibration import CalibratedClassifierCV
from optimal_cutoffs import ThresholdOptimizer

# Step 1: Train your base classifier
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

# Step 2: Calibrate probabilities
calibrated_model = CalibratedClassifierCV(base_model, cv=3)
calibrated_model.fit(X_train, y_train)
y_prob_cal = calibrated_model.predict_proba(X_val)[:, 1]

# Step 3: Optimize threshold on calibrated probabilities
optimizer = ThresholdOptimizer(metric='f1')
optimizer.fit(y_val, y_prob_cal)

# Step 4: Make final predictions
y_prob_test = calibrated_model.predict_proba(X_test)[:, 1]
y_pred_test = optimizer.predict(y_prob_test)

Benefits of this combined approach:

Reliable probabilities: Calibration ensures probability quality
Optimal decisions: Threshold optimization maximizes your chosen metric
Robust pipeline: Works well across different types of models and datasets

Calibration Diagnostics

Before and after calibration, evaluate:

Reliability diagrams: Plot predicted vs observed frequencies in probability bins
Calibration metrics: Brier score, expected calibration error (ECE)
Sharpness: How spread out the predicted probabilities are

Example diagnostic code:

from sklearn.calibration import calibration_curve

# Before calibration
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_true, y_prob_uncalibrated, n_bins=10
)

# After calibration
fraction_of_positives_cal, mean_predicted_value_cal = calibration_curve(
    y_true, y_prob_calibrated, n_bins=10
)

When Calibration May Not Be Needed

Skip calibration if:

Only rankings matter: Using ROC-AUC or ranking-based evaluation
Already well-calibrated: Model probabilities are already reliable (check with diagnostics)
Computational constraints: Limited time/resources for the calibration step
Threshold-only usage: Only need binary predictions, never probability estimates

References

[Platt1999]

Platt, J. (1999). “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods”. In Advances in Large Margin Classifiers.

[ZadroznyCElkan2002]

Zadrozny, B. & Elkan, C. (2002). “Transforming classifier scores into accurate multiclass probability estimates”. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining.

[NiculescuMizilCaruana2005]

Niculescu-Mizil, A. & Caruana, R. (2005). “Predicting Good Probabilities with Supervised Learning”. In Proceedings of the 22nd International Conference on Machine Learning.