Quick Start

This guide will get you up and running with optimal-classification-cutoffs in just a few minutes.

Basic Binary Classification

The simplest use case is finding the optimal threshold for binary classification:

from optimal_cutoffs import get_optimal_threshold
import numpy as np

# Your binary classification data
y_true = np.array([0, 0, 1, 1, 0, 1])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.9])

# Find optimal threshold for F1 score
threshold = get_optimal_threshold(y_true, y_prob, metric='f1')
print(f"Optimal threshold: {threshold:.3f}")

# Make predictions
predictions = (y_prob >= threshold).astype(int)
print(f"Predictions: {predictions}")

Other Metrics

You can optimize for different metrics:

# Optimize for accuracy
threshold_acc = get_optimal_threshold(y_true, y_prob, metric='accuracy')

# Optimize for precision
threshold_prec = get_optimal_threshold(y_true, y_prob, metric='precision')

# Optimize for recall
threshold_rec = get_optimal_threshold(y_true, y_prob, metric='recall')

print(f"Accuracy threshold: {threshold_acc:.3f}")
print(f"Precision threshold: {threshold_prec:.3f}")
print(f"Recall threshold: {threshold_rec:.3f}")

Multiclass Classification

For multiclass problems, the library automatically detects the problem type and returns per-class thresholds:

# Multiclass example with 3 classes
y_true = np.array([0, 1, 2, 0, 1, 2])
y_prob = np.array([
    [0.7, 0.2, 0.1],  # Sample 1: likely class 0
    [0.1, 0.8, 0.1],  # Sample 2: likely class 1
    [0.1, 0.1, 0.8],  # Sample 3: likely class 2
    [0.6, 0.3, 0.1],  # Sample 4: likely class 0
    [0.2, 0.7, 0.1],  # Sample 5: likely class 1
    [0.1, 0.2, 0.7]   # Sample 6: likely class 2
])

# Get per-class optimal thresholds
thresholds = get_optimal_threshold(y_true, y_prob, metric='f1')
print(f"Optimal thresholds per class: {thresholds}")

Using the Scikit-learn Interface

For integration with scikit-learn pipelines, use the ThresholdOptimizer class:

from optimal_cutoffs import ThresholdOptimizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Generate sample data
X = np.random.randn(1000, 5)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_prob_train = clf.predict_proba(X_train)[:, 1]
y_prob_test = clf.predict_proba(X_test)[:, 1]

# Optimize threshold
optimizer = ThresholdOptimizer(metric='f1', method='smart_brute')
optimizer.fit(y_train, y_prob_train)

# Make optimized predictions
y_pred = optimizer.predict(y_prob_test)

print(f"Optimal threshold: {optimizer.threshold_:.3f}")
print(f"Test accuracy: {np.mean(y_pred == y_test):.3f}")

Optimization Methods

The library provides several optimization methods:

# Auto method selection (recommended)
threshold = get_optimal_threshold(y_true, y_prob, metric='f1', method='auto')

# Fast O(n log n) algorithm for piecewise metrics
threshold = get_optimal_threshold(y_true, y_prob, metric='f1', method='sort_scan')

# Brute force evaluation of all unique probabilities
threshold = get_optimal_threshold(y_true, y_prob, metric='f1', method='smart_brute')

# Scipy-based continuous optimization
threshold = get_optimal_threshold(y_true, y_prob, metric='f1', method='minimize')

Cost-Sensitive Optimization

For applications where different types of errors have different costs:

# False negatives cost 5x more than false positives
threshold = get_optimal_threshold(
    y_true, y_prob,
    utility={"fp": -1.0, "fn": -5.0}
)

# With benefits for correct predictions
threshold = get_optimal_threshold(
    y_true, y_prob,
    utility={"tp": 2.0, "tn": 1.0, "fp": -1.0, "fn": -5.0}
)

Next Steps

  • Read the User Guide for detailed explanations and advanced features

  • Check out Examples for more comprehensive examples

  • Explore Advanced Topics topics like cross-validation and custom metrics

  • Understand the Theory and Background behind why this approach works better than standard methods