prediction_stability¶

prediction_stability(models, X_oos, task='categorical')[source]¶

Measure how consistent model predictions are across models on the SAME OOS data.

This metric quantifies prediction stability by measuring how much models agree with each other on the same out-of-sample data. Lower values indicate more stable/consistent predictions.

Parameters:

models (dict) – Mapping of model name -> fitted model (must have .predict() method). Requires at least 2 models.
X_oos (ndarray[tuple[Any, ...], dtype[floating]]) – Out-of-sample feature matrix to evaluate on.
task (str) – Type of prediction task.

Returns:

Stability score for each model.

For ‘categorical’:: Average pairwise DISAGREEMENT rate per model (range: 0-1). Lower is better (more stable). 0 = perfect agreement with all other models.
For ‘continuous’:: RMSE of each model’s predictions vs the ensemble mean. Lower is better (more stable). 0 = identical to ensemble mean.

Return type:

dict[str, float]

Raises:

ValueError – If fewer than 2 models provided, or if task is not ‘categorical’ or ‘continuous’.

Examples

>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.model_selection import train_test_split
>>> X, y = make_classification(n_samples=100, random_state=42)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
>>> models = {
...     'tree1': DecisionTreeClassifier(random_state=1).fit(X_train, y_train),
...     'tree2': DecisionTreeClassifier(random_state=2).fit(X_train, y_train),
... }
>>> stability = prediction_stability(models, X_test, task='categorical')
>>> print(stability)  # Lower values = more stable predictions
{'tree1': 0.15, 'tree2': 0.15}

Notes

Stability is measured relative to other models in the collection
For categorical tasks, uses pairwise agreement rates
For continuous tasks, uses RMSE to ensemble mean as stability proxy
This metric is complementary to predictive accuracy - a model can be accurate but unstable, or stable but inaccurate