πŸš€ Getting Started with OnlineRakeΒΆ

Welcome to OnlineRake - a powerful Python package for streaming survey weight calibration!

This notebook demonstrates how to use OnlineRake to correct bias in real-time data streams using two state-of-the-art algorithms:

  • SGD Raking: Fast and effective for most scenarios

  • MWU Raking: Maintains positive weights through multiplicative updates

Let’s see these algorithms in action! 🎯

[1]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from onlinerake import OnlineRakingSGD, OnlineRakingMWU, Targets

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)

print("πŸ“¦ Libraries imported successfully!")
print("🎨 Plotting style configured")
print("🎲 Random seed set for reproducibility")
πŸ“¦ Libraries imported successfully!
🎨 Plotting style configured
🎲 Random seed set for reproducibility

πŸ“Š Example 1: Correcting Feature Bias in Online SurveyΒΆ

Imagine you’re running an online survey, but your responses are biased - certain features are under or over-represented compared to the target population.

OnlineRake to the rescue! πŸ¦Έβ€β™‚οΈ

[2]:
# Define our target population proportions
targets = Targets(
    feature_a=0.52,  # 52% have feature A
    feature_b=0.51,  # 51% have feature B
    feature_c=0.35,  # 35% have feature C
    feature_d=0.19,  # 19% have feature D
)

print("🎯 Target margins:")
for feature, target in targets.as_dict().items():
    print(f"   {feature}: {target:.1%}")

# Initialize both raking algorithms
sgd_raker = OnlineRakingSGD(targets, learning_rate=4.0)
mwu_raker = OnlineRakingMWU(targets, learning_rate=1.2)

print("\nπŸ”§ Rakers initialized!")
print(f"   SGD learning rate: {sgd_raker.learning_rate}")
print(f"   MWU learning rate: {mwu_raker.learning_rate}")
🎯 Target margins:
   feature_a: 52.0%
   feature_b: 51.0%
   feature_c: 35.0%
   feature_d: 19.0%

πŸ”§ Rakers initialized!
   SGD learning rate: 4.0
   MWU learning rate: 1.2
[3]:
# Simulate biased survey responses
n_responses = 500
raw_totals = {"feature_a": 0, "feature_b": 0, "feature_c": 0, "feature_d": 0}

print(f"🎭 Simulating {n_responses} biased survey responses...")
print("πŸ“‰ Bias pattern: Survey with feature underrepresentation\n")

# Store history for plotting
sgd_history = []
mwu_history = []
observation_numbers = []

for i in range(n_responses):
    # Generate biased observations
    feature_a = 1 if np.random.random() < 0.30 else 0  # 30% vs target 52%
    feature_b = 1 if np.random.random() < 0.35 else 0  # 35% vs target 51%
    feature_c = 1 if np.random.random() < 0.60 else 0  # 60% vs target 35%
    feature_d = 1 if np.random.random() < 0.15 else 0  # 15% vs target 19%

    obs = {
        "feature_a": feature_a, "feature_b": feature_b,
        "feature_c": feature_c, "feature_d": feature_d
    }

    # Update both rakers
    sgd_raker.partial_fit(obs)
    mwu_raker.partial_fit(obs)

    # Track raw proportions
    for key in raw_totals:
        raw_totals[key] += obs[key]

    # Store history for plotting (every 25 observations)
    if (i + 1) % 25 == 0:
        observation_numbers.append(i + 1)
        sgd_history.append(sgd_raker.margins.copy())
        mwu_history.append(mwu_raker.margins.copy())

print("βœ… Simulation complete!")
🎭 Simulating 500 biased survey responses...
πŸ“‰ Bias pattern: Survey with feature underrepresentation

βœ… Simulation complete!
[4]:
# Calculate final results
raw_margins = {k: v / n_responses for k, v in raw_totals.items()}
sgd_margins = sgd_raker.margins
mwu_margins = mwu_raker.margins

print("πŸ“‹ RESULTS SUMMARY")
print("=" * 60)
print(f"{'Feature':<12} {'Target':<8} {'Raw':<8} {'SGD':<8} {'MWU':<8}")
print("-" * 60)

for feature in ["feature_a", "feature_b", "feature_c", "feature_d"]:
    target = targets.as_dict()[feature]
    raw = raw_margins[feature]
    sgd = sgd_margins[feature]
    mwu = mwu_margins[feature]
    print(f"{feature:<12} {target:<8.3f} {raw:<8.3f} {sgd:<8.3f} {mwu:<8.3f}")

print("\nπŸ“ˆ ALGORITHM PERFORMANCE")
print("-" * 30)
print(f"Effective Sample Size:")
print(f"   SGD: {sgd_raker.effective_sample_size:.1f}")
print(f"   MWU: {mwu_raker.effective_sample_size:.1f}")

print(f"\nFinal Loss (squared error):")
print(f"   SGD: {sgd_raker.loss:.6f}")
print(f"   MWU: {mwu_raker.loss:.6f}")

if sgd_raker.loss < mwu_raker.loss:
    print("\nπŸ† SGD achieved lower loss!")
else:
    print("\nπŸ† MWU achieved lower loss!")
πŸ“‹ RESULTS SUMMARY
============================================================
Feature      Target   Raw      SGD      MWU
------------------------------------------------------------
feature_a    0.520    0.330    0.491    0.453
feature_b    0.510    0.344    0.491    0.452
feature_c    0.350    0.602    0.378    0.432
feature_d    0.190    0.134    0.167    0.147

πŸ“ˆ ALGORITHM PERFORMANCE
------------------------------
Effective Sample Size:
   SGD: 294.1
   MWU: 307.9

Final Loss (squared error):
   SGD: 0.002512
   MWU: 0.016494

πŸ† SGD achieved lower loss!
[5]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('🎯 OnlineRake Results: Before vs After Calibration', fontsize=16, fontweight='bold')

# 1. Before/After comparison
features = list(targets.feature_names)
target_vals = [targets[f] for f in features]
raw_vals = [raw_margins[f] for f in features]
sgd_vals = [sgd_margins[f] for f in features]
mwu_vals = [mwu_margins[f] for f in features]

x = np.arange(len(features))
width = 0.2

axes[0,0].bar(x - 1.5*width, target_vals, width, label='🎯 Target', alpha=0.8, color='gold')
axes[0,0].bar(x - 0.5*width, raw_vals, width, label='❌ Raw (Biased)', alpha=0.8, color='red')
axes[0,0].bar(x + 0.5*width, sgd_vals, width, label='βœ… SGD Corrected', alpha=0.8, color='green')
axes[0,0].bar(x + 1.5*width, mwu_vals, width, label='βœ… MWU Corrected', alpha=0.8, color='blue')

axes[0,0].set_xlabel('Features')
axes[0,0].set_ylabel('Proportion')
axes[0,0].set_title('πŸ“Š Target vs Raw vs Corrected Proportions')
axes[0,0].set_xticks(x)
axes[0,0].set_xticklabels(features, rotation=45)
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# 2. Convergence over time
for i, feature in enumerate(features):
    sgd_feature_history = [margins[feature] for margins in sgd_history]
    mwu_feature_history = [margins[feature] for margins in mwu_history]

    axes[0,1].plot(observation_numbers, sgd_feature_history, '-',
                  label=f'SGD {feature}' if i < 2 else '', alpha=0.7)
    axes[0,1].plot(observation_numbers, mwu_feature_history, '--',
                  label=f'MWU {feature}' if i < 2 else '', alpha=0.7)

    # Add target line
    axes[0,1].axhline(y=targets[feature], color='red', linestyle=':', alpha=0.5)

axes[0,1].set_xlabel('Observations')
axes[0,1].set_ylabel('Margin')
axes[0,1].set_title('πŸ“ˆ Margin Convergence Over Time')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# 3. Error comparison
sgd_errors = [abs(sgd_margins[f] - targets[f]) for f in features]
mwu_errors = [abs(mwu_margins[f] - targets[f]) for f in features]
raw_errors = [abs(raw_margins[f] - targets[f]) for f in features]

x = np.arange(len(features))
axes[1,0].bar(x - width, raw_errors, width, label='❌ Raw Error', alpha=0.8, color='red')
axes[1,0].bar(x, sgd_errors, width, label='βœ… SGD Error', alpha=0.8, color='green')
axes[1,0].bar(x + width, mwu_errors, width, label='βœ… MWU Error', alpha=0.8, color='blue')

axes[1,0].set_xlabel('Features')
axes[1,0].set_ylabel('Absolute Error')
axes[1,0].set_title('πŸ“‰ Error Reduction by Feature')
axes[1,0].set_xticks(x)
axes[1,0].set_xticklabels(features, rotation=45)
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)

# 4. Performance metrics
metrics = ['Loss', 'ESS']
sgd_metrics = [sgd_raker.loss, sgd_raker.effective_sample_size]
mwu_metrics = [mwu_raker.loss, mwu_raker.effective_sample_size]

x = np.arange(len(metrics))
axes[1,1].bar(x - width/2, sgd_metrics, width, label='SGD', alpha=0.8, color='green')
axes[1,1].bar(x + width/2, mwu_metrics, width, label='MWU', alpha=0.8, color='blue')

axes[1,1].set_xlabel('Metrics')
axes[1,1].set_ylabel('Value')
axes[1,1].set_title('⚑ Algorithm Performance Comparison')
axes[1,1].set_xticks(x)
axes[1,1].set_xticklabels(metrics)
axes[1,1].legend()
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🎨 Visualization complete! Clear evidence that OnlineRake works! ✨")
/tmp/ipykernel_2690/1366069361.py:82: UserWarning: Glyph 128202 (\N{BAR CHART}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1366069361.py:82: UserWarning: Glyph 127919 (\N{DIRECT HIT}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1366069361.py:82: UserWarning: Glyph 10060 (\N{CROSS MARK}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1366069361.py:82: UserWarning: Glyph 9989 (\N{WHITE HEAVY CHECK MARK}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1366069361.py:82: UserWarning: Glyph 128200 (\N{CHART WITH UPWARDS TREND}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1366069361.py:82: UserWarning: Glyph 128201 (\N{CHART WITH DOWNWARDS TREND}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 128202 (\N{BAR CHART}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 127919 (\N{DIRECT HIT}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 10060 (\N{CROSS MARK}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 9989 (\N{WHITE HEAVY CHECK MARK}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 128200 (\N{CHART WITH UPWARDS TREND}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 128201 (\N{CHART WITH DOWNWARDS TREND}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
../_images/notebooks_01_getting_started_6_1.png

🎨 Visualization complete! Clear evidence that OnlineRake works! ✨

🌊 Example 2: Real-time Feature Tracking with Pattern Shifts¢

Now let’s see how OnlineRake handles changing data patterns over time - a common challenge in real-world streaming data!

[6]:
# Set up new scenario with different targets
streaming_targets = Targets(
    feature_a=0.48,  # 48% have feature A
    feature_b=0.53,  # 53% have feature B
    feature_c=0.32,  # 32% have feature C
    feature_d=0.17,  # 17% have feature D
)

print("🌊 STREAMING SCENARIO: Time-varying bias patterns")
print("=" * 50)
print("🎯 Target feature margins:")
for feature, target in streaming_targets.as_dict().items():
    print(f"   {feature}: {target:.1%}")

raker = OnlineRakingSGD(streaming_targets, learning_rate=3.0)
print(f"\nπŸš€ SGD Raker initialized with learning rate: {raker.learning_rate}")
🌊 STREAMING SCENARIO: Time-varying bias patterns
==================================================
🎯 Target feature margins:
   feature_a: 48.0%
   feature_b: 53.0%
   feature_c: 32.0%
   feature_d: 17.0%

πŸš€ SGD Raker initialized with learning rate: 3.0
[7]:
# Simulate data with time-varying bias
np.random.seed(789)
n_obs = 1000

print(f"\n🎭 Simulating {n_obs} observations with time-varying bias...")
print("πŸ“Š Pattern: Feature probabilities change over time\n")

# Track evolution
margin_history = []
loss_history = []
ess_history = []
time_points = []

for i in range(n_obs):
    # Feature patterns change over time
    time_factor = i / n_obs

    # Feature A: increases over time (0.2 β†’ 0.6)
    p_feature_a = 0.2 + 0.4 * time_factor
    feature_a = 1 if np.random.random() < p_feature_a else 0

    # Feature B: relatively stable
    feature_b = 1 if np.random.random() < 0.52 else 0

    # Feature C: decreases over time (0.6 β†’ 0.3)
    p_feature_c = 0.6 - 0.3 * time_factor
    feature_c = 1 if np.random.random() < p_feature_c else 0

    # Feature D: relatively stable
    feature_d = 1 if np.random.random() < 0.18 else 0

    obs = {
        "feature_a": feature_a, "feature_b": feature_b,
        "feature_c": feature_c, "feature_d": feature_d
    }
    raker.partial_fit(obs)

    # Record progress every 50 observations
    if (i + 1) % 50 == 0:
        time_points.append(i + 1)
        margin_history.append(raker.margins.copy())
        loss_history.append(raker.loss)
        ess_history.append(raker.effective_sample_size)

print("βœ… Streaming simulation complete!")
print(f"πŸ“Š Tracked {len(time_points)} checkpoints")
print(f"🎯 Final ESS: {raker.effective_sample_size:.1f} / {n_obs}")

🎭 Simulating 1000 observations with time-varying bias...
πŸ“Š Pattern: Feature probabilities change over time

βœ… Streaming simulation complete!
πŸ“Š Tracked 20 checkpoints
🎯 Final ESS: 743.5 / 1000
[8]:
# Visualize the streaming results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('🌊 Streaming OnlineRake: Adapting to Time-Varying Bias', fontsize=16, fontweight='bold')

# 1. Margin evolution over time
features = list(streaming_targets.feature_names)
colors = plt.cm.Set1(np.linspace(0, 1, len(features)))

for i, feature in enumerate(features):
    feature_margins = [margins[feature] for margins in margin_history]
    axes[0,0].plot(time_points, feature_margins, '-o',
                  label=f'{feature}', color=colors[i], markersize=4)
    # Add target line
    axes[0,0].axhline(y=streaming_targets[feature], color=colors[i],
                     linestyle='--', alpha=0.7, linewidth=2)

axes[0,0].set_xlabel('Observations')
axes[0,0].set_ylabel('Weighted Margin')
axes[0,0].set_title('πŸ“ˆ Margin Evolution Over Time')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# 2. Loss convergence
axes[0,1].plot(time_points, loss_history, '-o', color='red', markersize=4)
axes[0,1].set_xlabel('Observations')
axes[0,1].set_ylabel('Loss')
axes[0,1].set_title('πŸ“‰ Loss Over Time')
axes[0,1].grid(True, alpha=0.3)
axes[0,1].set_yscale('log')

# 3. Effective Sample Size evolution
axes[1,0].plot(time_points, ess_history, '-o', color='blue', markersize=4)
axes[1,0].set_xlabel('Observations')
axes[1,0].set_ylabel('Effective Sample Size')
axes[1,0].set_title('⚑ ESS Evolution')
axes[1,0].grid(True, alpha=0.3)

# 4. Final comparison
final_margins = raker.margins
target_vals = [streaming_targets[f] for f in features]
final_vals = [final_margins[f] for f in features]
errors = [abs(final_vals[i] - target_vals[i]) for i in range(len(features))]

x = np.arange(len(features))
width = 0.35

axes[1,1].bar(x - width/2, target_vals, width, label='🎯 Target', alpha=0.8, color='gold')
axes[1,1].bar(x + width/2, final_vals, width, label='βœ… Achieved', alpha=0.8, color='green')

# Add error annotations
for i, error in enumerate(errors):
    axes[1,1].text(i, max(target_vals[i], final_vals[i]) + 0.02,
                  f'Ξ”={error:.3f}', ha='center', fontsize=9, color='red')

axes[1,1].set_xlabel('Features')
axes[1,1].set_ylabel('Proportion')
axes[1,1].set_title('🎯 Final Results vs Targets')
axes[1,1].set_xticks(x)
axes[1,1].set_xticklabels(features, rotation=45)
axes[1,1].legend()
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🎨 Streaming visualization complete!")
print("🌟 OnlineRake successfully adapted to changing patterns! ✨")
/tmp/ipykernel_2690/1278875831.py:63: UserWarning: Glyph 128200 (\N{CHART WITH UPWARDS TREND}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1278875831.py:63: UserWarning: Glyph 128201 (\N{CHART WITH DOWNWARDS TREND}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1278875831.py:63: UserWarning: Glyph 127919 (\N{DIRECT HIT}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1278875831.py:63: UserWarning: Glyph 9989 (\N{WHITE HEAVY CHECK MARK}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/tmp/ipykernel_2690/1278875831.py:63: UserWarning: Glyph 127754 (\N{WATER WAVE}) missing from font(s) DejaVu Sans.
  plt.tight_layout()
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 128200 (\N{CHART WITH UPWARDS TREND}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 128201 (\N{CHART WITH DOWNWARDS TREND}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 127919 (\N{DIRECT HIT}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 9989 (\N{WHITE HEAVY CHECK MARK}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
/home/runner/work/onlinerake/onlinerake/.venv/lib/python3.14/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 127754 (\N{WATER WAVE}) missing from font(s) DejaVu Sans.
  fig.canvas.print_figure(bytes_io, **kw)
../_images/notebooks_01_getting_started_10_1.png

🎨 Streaming visualization complete!
🌟 OnlineRake successfully adapted to changing patterns! ✨
[9]:
# Print detailed final results
print("\nπŸ“‹ STREAMING RESULTS SUMMARY")
print("=" * 50)
for feature in features:
    target = streaming_targets[feature]
    final = final_margins[feature]
    error = abs(final - target)
    improvement = (1 - error/abs(target - 0.5)) * 100 if abs(target - 0.5) > 0 else 100
    print(f"{feature:<12}: {final:.3f} (target: {target:.3f}, error: {error:.3f})")

avg_error = np.mean([abs(final_margins[f] - streaming_targets[f]) for f in features])
print(f"\nπŸ“Š Average absolute error: {avg_error:.4f}")
print(f"🎯 Final ESS: {raker.effective_sample_size:.1f} / {n_obs}")
print(f"πŸ“‰ Final loss: {raker.loss:.6f}")

if avg_error < 0.02:
    print("\nπŸ† EXCELLENT! Very low error achieved! πŸŽ‰")
elif avg_error < 0.05:
    print("\nβœ… GOOD! Acceptable error level achieved! πŸ‘")
else:
    print("\n⚠️ MODERATE: Consider tuning parameters for better performance")

πŸ“‹ STREAMING RESULTS SUMMARY
==================================================
feature_a   : 0.496 (target: 0.480, error: 0.016)
feature_b   : 0.532 (target: 0.530, error: 0.002)
feature_c   : 0.321 (target: 0.320, error: 0.001)
feature_d   : 0.173 (target: 0.170, error: 0.003)

πŸ“Š Average absolute error: 0.0054
🎯 Final ESS: 743.5 / 1000
πŸ“‰ Final loss: 0.000259

πŸ† EXCELLENT! Very low error achieved! πŸŽ‰

πŸŽ‰ Summary: OnlineRake Success!ΒΆ

Congratulations! 🎊 You’ve successfully used OnlineRake to:

βœ… Correct feature bias in real-time survey data
βœ… Handle time-varying patterns in streaming data
βœ… Achieve target margins with quantifiable accuracy
βœ… Monitor performance with comprehensive diagnostics

πŸ”‘ Key Takeaways:ΒΆ

  1. SGD Raking is fast and effective for most scenarios

  2. MWU Raking maintains positive weights through multiplicative updates

  3. Learning rates can be tuned for convergence speed vs stability

  4. Real-time monitoring helps detect issues early

  5. Visual validation makes results immediately obvious

πŸš€ Next Steps:ΒΆ

  • Try the Performance Comparison notebook for SGD vs MWU analysis

  • Explore Advanced Diagnostics for convergence monitoring

  • Check out the API Reference for all available options

Happy raking! 🎯✨

[10]:
print("🎊 Thank you for using OnlineRake!")
print("πŸ“š Check out the documentation for more examples and advanced features")
print("πŸ› Found a bug or have a feature request? Please let us know!")
print("⭐ If you found this useful, consider starring the repository! ⭐")
🎊 Thank you for using OnlineRake!
πŸ“š Check out the documentation for more examples and advanced features
πŸ› Found a bug or have a feature request? Please let us know!
⭐ If you found this useful, consider starring the repository! ⭐