Examples

StagecoachML comes with comprehensive examples demonstrating real-world machine learning workflows. All examples use datasets from scikit-learn for easy reproduction.

Quick Start Examples

🌸 Iris Classification

File: examples/iris_classification/iris_pipeline.py

A complete classification pipeline demonstrating:

  • Multi-class classification (3 species)

  • Multiple algorithms (Random Forest, Logistic Regression)

  • Model comparison and selection

  • Feature importance analysis

  • Sample predictions with probabilities

python examples/iris_classification/iris_pipeline.py

Key Features:

  • Loads famous iris dataset (150 samples, 4 features)

  • Trains and compares two different classifiers

  • Provides detailed performance metrics

  • Shows feature importance for decision trees

  • Makes predictions on sample flower measurements

🏠 Boston Housing Regression

File: examples/boston_housing/housing_pipeline.py

A comprehensive regression workflow featuring:

  • Multiple regression algorithms

  • Feature scaling for linear models

  • Advanced performance metrics

  • Real estate price predictions

python examples/boston_housing/housing_pipeline.py

Key Features:

  • Boston housing dataset (506 samples, 13 features)

  • Four different regressors (Linear, Ridge, Random Forest, Gradient Boosting)

  • Automatic model comparison with RMSE, MAE, and R²

  • Feature importance analysis

  • Predictions on realistic house characteristics

🔢 Handwritten Digits Recognition

File: examples/digits_recognition/digits_pipeline.py

A computer vision pipeline showcasing:

  • Image classification (8×8 pixel digits)

  • Multiple classifiers (SVM, Logistic Regression, Random Forest)

  • ASCII visualization of digits

  • Detailed error analysis

python examples/digits_recognition/digits_pipeline.py

Key Features:

  • Handwritten digits dataset (1,797 samples, 64 features)

  • Image preprocessing and normalization

  • Three different classification algorithms

  • Confusion matrix and error analysis

  • Visual representation of digit patterns

Advanced Examples

🔧 Custom Stages

File: examples/custom_stages/custom_pipeline.py

Advanced example demonstrating how to extend StagecoachML:

  • Custom stage creation

  • Input/output validation

  • Configuration and retry logic

  • Performance benchmarking

python examples/custom_stages/custom_pipeline.py

Custom Stages Demonstrated:

  • DataGeneratorStage: Synthetic dataset generation

  • StatisticalAnalyzerStage: Advanced data analysis

  • CrossValidatorStage: Model validation with retry capability

  • ModelExplainerStage: Feature importance and insights

  • PerformanceBenchmarkStage: Pipeline performance metrics

Example Structure

Each example follows a consistent structure:

examples/
├── iris_classification/
│   ├── iris_pipeline.py      # Complete runnable script
│   ├── iris_config.yaml      # CLI configuration
│   └── README.md             # Detailed documentation
├── boston_housing/
│   ├── housing_pipeline.py
│   └── README.md
├── digits_recognition/
│   ├── digits_pipeline.py
│   └── README.md
└── custom_stages/
    ├── custom_pipeline.py
    └── README.md

Running Examples

Method 1: Direct Python Execution

# From the project root directory
python examples/iris_classification/iris_pipeline.py
python examples/boston_housing/housing_pipeline.py
python examples/digits_recognition/digits_pipeline.py
python examples/custom_stages/custom_pipeline.py

Method 2: Interactive Exploration

import sys
sys.path.insert(0, "src")

# Run any example
from examples.iris_classification.iris_pipeline import main
results = main()

# Access specific results
accuracy = results['evaluate']['best_accuracy']
model = results['evaluate']['best_model']

Method 3: CLI (where available)

stagecoach run examples/iris_classification/iris_config.yaml

Common Patterns Demonstrated

1. Data Loading and Preprocessing

  • Loading sklearn datasets

  • Data normalization and scaling

  • Train/test splitting with stratification

2. Model Training and Comparison

  • Multiple algorithm comparison

  • Hyperparameter configuration

  • Cross-validation techniques

3. Evaluation and Analysis

  • Performance metrics (accuracy, RMSE, R²)

  • Confusion matrices

  • Feature importance analysis

  • Error pattern identification

4. Pipeline Management

  • Stage dependencies and execution order

  • Context passing between stages

  • Result aggregation and reporting

5. Visualization and Reporting

  • ASCII art for simple visualizations

  • Detailed classification reports

  • Performance summaries

Expected Performance

Iris Classification

  • Best Model: Usually Random Forest or Logistic Regression

  • Accuracy: 95-100% (high-quality, separable dataset)

  • Runtime: <5 seconds

Boston Housing Regression

  • Best Model: Typically Random Forest or Gradient Boosting

  • RMSE: ~$3-4k (good performance for housing prices)

  • R² Score: 0.85-0.95

  • Runtime: <10 seconds

Digits Recognition

  • Best Model: Usually SVM with RBF kernel

  • Accuracy: 95-99% (excellent for 8×8 images)

  • Common Errors: 8↔9, 4↔9, 5↔6 (similar visual patterns)

  • Runtime: <15 seconds

Custom Stages

  • Demonstrates: Advanced StagecoachML features

  • Synthetic Data: Configurable dataset generation

  • Analysis: Comprehensive pipeline insights

  • Runtime: <20 seconds

Customization Ideas

Extend Existing Examples

  1. Add More Models: Include XGBoost, Neural Networks, or ensemble methods

  2. Feature Engineering: Create polynomial features or interactions

  3. Hyperparameter Tuning: Add grid search or Bayesian optimization

  4. Cross-Validation: Implement k-fold validation for robust evaluation

  5. Visualization: Add matplotlib plots for better insights

Create New Examples

  1. Time Series: Stock price or weather prediction

  2. NLP: Text classification or sentiment analysis (using sklearn’s text datasets)

  3. Clustering: Customer segmentation or image clustering

  4. Dimensionality Reduction: PCA or t-SNE visualization

  5. Anomaly Detection: Outlier detection in various domains

Best Practices Shown

Code Organization

  • Clear function separation for each stage

  • Descriptive variable names and comments

  • Consistent error handling

  • Modular design for reusability

Pipeline Design

  • Logical stage dependencies

  • Minimal coupling between stages

  • Clear context passing

  • Comprehensive result collection

Machine Learning

  • Proper data splitting

  • Feature preprocessing

  • Model validation techniques

  • Performance evaluation

  • Result interpretation

Troubleshooting

Common Issues

  1. Import Errors: Ensure src/ is in Python path

  2. Missing Dependencies: Install sklearn, numpy, pandas

  3. Dataset Warnings: sklearn may show deprecation warnings (normal)

  4. Performance Variation: Results may vary slightly due to randomness

Solutions

# Fix import issues
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

# Install missing packages
pip install scikit-learn numpy pandas

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

Next Steps

After exploring the examples:

  1. Modify Parameters: Change model hyperparameters and see the impact

  2. Add Stages: Implement additional preprocessing or analysis stages

  3. Create Custom Examples: Apply StagecoachML to your own datasets

  4. Build Production Pipelines: Scale examples for real-world applications

  5. Contribute: Share your examples with the community

These examples provide a solid foundation for understanding StagecoachML capabilities and serve as templates for building your own machine learning pipelines.