evaluate_models¶

evaluate_models(models, X, y, task='categorical')[source]¶

Evaluate predictive performance of multiple models using standard metrics.

Computes task-appropriate performance metrics for each model. For classification, includes accuracy and AUC (if predict_proba available). For regression, includes MAE, RMSE, and R².

Parameters:
  • models (dict) – Model name -> fitted model mapping. Models must have .predict() method.

  • X (ndarray[tuple[Any, ...], dtype[floating]]) – Feature matrix for evaluation.

  • y (ndarray[tuple[Any, ...], dtype[floating]]) – Ground-truth labels (classification) or targets (regression).

  • task (str) – Type of prediction task.

Returns:

Nested dictionary: {model_name: {metric_name: value}}

For ‘categorical’:
  • ’acc’: Classification accuracy (0-1)

  • ’auc’: ROC AUC score (0-1, if predict_proba available)

    For binary: standard AUC For multi-class: one-vs-rest macro AUC

For ‘continuous’:
  • ’mae’: Mean Absolute Error (lower is better)

  • ’rmse’: Root Mean Squared Error (lower is better)

  • ’r2’: R² coefficient of determination (-∞ to 1, higher is better)

Return type:

dict[str, dict[str, float]]

Raises:

ValueError – If task is not ‘categorical’ or ‘continuous’.

Examples

>>> from sklearn.tree import DecisionTreeRegressor
>>> X, y = make_regression(n_samples=100, random_state=42)
>>> models = {
...     'shallow': DecisionTreeRegressor(max_depth=3, random_state=42).fit(X, y),
...     'deep': DecisionTreeRegressor(max_depth=10, random_state=42).fit(X, y),
... }
>>> performance = evaluate_models(models, X, y, task='continuous')
>>> print(performance['shallow'])
{'mae': 12.3, 'rmse': 15.7, 'r2': 0.85}

Notes

  • AUC computation gracefully handles cases where predict_proba is not available

  • For multi-class classification, uses one-vs-rest strategy for AUC

  • All metrics use standard sklearn implementations

  • Consider using separate train/test sets to avoid overfitting bias