Examples¶

StagecoachML comes with comprehensive examples demonstrating real-world machine learning workflows. All examples use datasets from scikit-learn for easy reproduction.

Quick Start Examples¶

🌸 Iris Classification¶

File: examples/iris_classification/iris_pipeline.py

A complete classification pipeline demonstrating:

Multi-class classification (3 species)
Multiple algorithms (Random Forest, Logistic Regression)
Model comparison and selection
Feature importance analysis
Sample predictions with probabilities

python examples/iris_classification/iris_pipeline.py

Key Features:

Loads famous iris dataset (150 samples, 4 features)
Trains and compares two different classifiers
Provides detailed performance metrics
Shows feature importance for decision trees
Makes predictions on sample flower measurements

🏠 Boston Housing Regression¶

File: examples/boston_housing/housing_pipeline.py

A comprehensive regression workflow featuring:

Multiple regression algorithms
Feature scaling for linear models
Advanced performance metrics
Real estate price predictions

python examples/boston_housing/housing_pipeline.py

Key Features:

Boston housing dataset (506 samples, 13 features)
Four different regressors (Linear, Ridge, Random Forest, Gradient Boosting)
Automatic model comparison with RMSE, MAE, and R²
Feature importance analysis
Predictions on realistic house characteristics

🔢 Handwritten Digits Recognition¶

File: examples/digits_recognition/digits_pipeline.py

A computer vision pipeline showcasing:

Image classification (8×8 pixel digits)
Multiple classifiers (SVM, Logistic Regression, Random Forest)
ASCII visualization of digits
Detailed error analysis

python examples/digits_recognition/digits_pipeline.py

Key Features:

Handwritten digits dataset (1,797 samples, 64 features)
Image preprocessing and normalization
Three different classification algorithms
Confusion matrix and error analysis
Visual representation of digit patterns

Advanced Examples¶

🔧 Custom Stages¶

File: examples/custom_stages/custom_pipeline.py

Advanced example demonstrating how to extend StagecoachML:

Custom stage creation
Input/output validation
Configuration and retry logic
Performance benchmarking

python examples/custom_stages/custom_pipeline.py

Custom Stages Demonstrated:

DataGeneratorStage: Synthetic dataset generation
StatisticalAnalyzerStage: Advanced data analysis
CrossValidatorStage: Model validation with retry capability
ModelExplainerStage: Feature importance and insights
PerformanceBenchmarkStage: Pipeline performance metrics

Example Structure¶

Each example follows a consistent structure:

examples/
├── iris_classification/
│   ├── iris_pipeline.py      # Complete runnable script
│   ├── iris_config.yaml      # CLI configuration
│   └── README.md             # Detailed documentation
├── boston_housing/
│   ├── housing_pipeline.py
│   └── README.md
├── digits_recognition/
│   ├── digits_pipeline.py
│   └── README.md
└── custom_stages/
    ├── custom_pipeline.py
    └── README.md

Running Examples¶

Method 1: Direct Python Execution¶

# From the project root directory
python examples/iris_classification/iris_pipeline.py
python examples/boston_housing/housing_pipeline.py
python examples/digits_recognition/digits_pipeline.py
python examples/custom_stages/custom_pipeline.py

Method 2: Interactive Exploration¶

import sys
sys.path.insert(0, "src")

# Run any example
from examples.iris_classification.iris_pipeline import main
results = main()

# Access specific results
accuracy = results['evaluate']['best_accuracy']
model = results['evaluate']['best_model']

Method 3: CLI (where available)¶

stagecoach run examples/iris_classification/iris_config.yaml

Common Patterns Demonstrated¶

1. Data Loading and Preprocessing¶

Loading sklearn datasets
Data normalization and scaling
Train/test splitting with stratification

2. Model Training and Comparison¶

Multiple algorithm comparison
Hyperparameter configuration
Cross-validation techniques

3. Evaluation and Analysis¶

Performance metrics (accuracy, RMSE, R²)
Confusion matrices
Feature importance analysis
Error pattern identification

4. Pipeline Management¶

Stage dependencies and execution order
Context passing between stages
Result aggregation and reporting

5. Visualization and Reporting¶

ASCII art for simple visualizations
Detailed classification reports
Performance summaries

Expected Performance¶

Iris Classification¶

Best Model: Usually Random Forest or Logistic Regression
Accuracy: 95-100% (high-quality, separable dataset)
Runtime: <5 seconds

Boston Housing Regression¶

Best Model: Typically Random Forest or Gradient Boosting
RMSE: ~$3-4k (good performance for housing prices)
R² Score: 0.85-0.95
Runtime: <10 seconds

Digits Recognition¶

Best Model: Usually SVM with RBF kernel
Accuracy: 95-99% (excellent for 8×8 images)
Common Errors: 8↔9, 4↔9, 5↔6 (similar visual patterns)
Runtime: <15 seconds

Custom Stages¶

Demonstrates: Advanced StagecoachML features
Synthetic Data: Configurable dataset generation
Analysis: Comprehensive pipeline insights
Runtime: <20 seconds

Customization Ideas¶

Extend Existing Examples¶

Add More Models: Include XGBoost, Neural Networks, or ensemble methods
Feature Engineering: Create polynomial features or interactions
Hyperparameter Tuning: Add grid search or Bayesian optimization
Cross-Validation: Implement k-fold validation for robust evaluation
Visualization: Add matplotlib plots for better insights

Create New Examples¶

Time Series: Stock price or weather prediction
NLP: Text classification or sentiment analysis (using sklearn’s text datasets)
Clustering: Customer segmentation or image clustering
Dimensionality Reduction: PCA or t-SNE visualization
Anomaly Detection: Outlier detection in various domains

Best Practices Shown¶

Code Organization¶

Clear function separation for each stage
Descriptive variable names and comments
Consistent error handling
Modular design for reusability

Pipeline Design¶

Logical stage dependencies
Minimal coupling between stages
Clear context passing
Comprehensive result collection

Machine Learning¶

Proper data splitting
Feature preprocessing
Model validation techniques
Performance evaluation
Result interpretation

Troubleshooting¶

Common Issues¶

Import Errors: Ensure src/ is in Python path
Missing Dependencies: Install sklearn, numpy, pandas
Dataset Warnings: sklearn may show deprecation warnings (normal)
Performance Variation: Results may vary slightly due to randomness

Solutions¶

# Fix import issues
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

# Install missing packages
pip install scikit-learn numpy pandas

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

Next Steps¶

After exploring the examples:

Modify Parameters: Change model hyperparameters and see the impact
Add Stages: Implement additional preprocessing or analysis stages
Create Custom Examples: Apply StagecoachML to your own datasets
Build Production Pipelines: Scale examples for real-world applications
Contribute: Share your examples with the community

These examples provide a solid foundation for understanding StagecoachML capabilities and serve as templates for building your own machine learning pipelines.