evaluate_models¶
- evaluate_models(models, X, y, task='categorical')[source]¶
Evaluate predictive performance of multiple models using standard metrics.
Computes task-appropriate performance metrics for each model. For classification, includes accuracy and AUC (if predict_proba available). For regression, includes MAE, RMSE, and R².
- Parameters:
models (dict) – Model name -> fitted model mapping. Models must have .predict() method.
X (ndarray[tuple[Any, ...], dtype[floating]]) – Feature matrix for evaluation.
y (ndarray[tuple[Any, ...], dtype[floating]]) – Ground-truth labels (classification) or targets (regression).
task (str) – Type of prediction task.
- Returns:
Nested dictionary: {model_name: {metric_name: value}}
- For ‘categorical’:
’acc’: Classification accuracy (0-1)
- ’auc’: ROC AUC score (0-1, if predict_proba available)
For binary: standard AUC For multi-class: one-vs-rest macro AUC
- For ‘continuous’:
’mae’: Mean Absolute Error (lower is better)
’rmse’: Root Mean Squared Error (lower is better)
’r2’: R² coefficient of determination (-∞ to 1, higher is better)
- Return type:
dict[str, dict[str, float]]
- Raises:
ValueError – If task is not ‘categorical’ or ‘continuous’.
Examples
>>> from sklearn.tree import DecisionTreeRegressor >>> X, y = make_regression(n_samples=100, random_state=42) >>> models = { ... 'shallow': DecisionTreeRegressor(max_depth=3, random_state=42).fit(X, y), ... 'deep': DecisionTreeRegressor(max_depth=10, random_state=42).fit(X, y), ... } >>> performance = evaluate_models(models, X, y, task='continuous') >>> print(performance['shallow']) {'mae': 12.3, 'rmse': 15.7, 'r2': 0.85}
Notes
AUC computation gracefully handles cases where predict_proba is not available
For multi-class classification, uses one-vs-rest strategy for AUC
All metrics use standard sklearn implementations
Consider using separate train/test sets to avoid overfitting bias