Survey Weight Calibration: A Realistic Example

This example demonstrates how to use fairlex for calibrating survey weights in a realistic scenario. We’ll simulate a political opinion survey with typical demographic biases and calibrate against US Census benchmarks.

Scenario

A polling organization conducted a survey with 200 respondents to gauge public opinion. Like most surveys, the sample has demographic biases:

  • Over-representation of older, higher-educated respondents

  • Under-representation of Hispanic and younger demographics

  • Geographic skew toward certain regions

We’ll use leximin calibration to adjust the weights to match known population demographics from the 2023 US Census.

[1]:
import numpy as np
import pandas as pd

from fairlex import evaluate_solution, leximin_residual, leximin_weight_fair

# Set random seed for reproducibility
np.random.seed(42)

Step 1: Create Realistic Survey Data

We’ll simulate survey respondents with demographic characteristics that exhibit typical survey biases.

[2]:
n_respondents = 200

# Generate biased survey sample
# Age groups: 18-29, 30-44, 45-64, 65+
age_groups = np.random.choice(
    ['18-29', '30-44', '45-64', '65+'],
    size=n_respondents,
    p=[0.15, 0.20, 0.35, 0.30]  # Skewed toward older respondents
)

# Gender: Male, Female
gender = np.random.choice(
    ['Male', 'Female'],
    size=n_respondents,
    p=[0.48, 0.52]  # Close to population
)

# Race/Ethnicity
race_ethnicity = np.random.choice(
    ['White_NH', 'Black', 'Hispanic', 'Asian', 'Other'],
    size=n_respondents,
    p=[0.70, 0.11, 0.10, 0.06, 0.03]  # Under-representation of Hispanic, over-representation of White
)

# Education: HS_or_less, Some_college, Bachelor_plus
education = np.random.choice(
    ['HS_or_less', 'Some_college', 'Bachelor_plus'],
    size=n_respondents,
    p=[0.25, 0.30, 0.45]  # Over-representation of college educated
)

# Region: Northeast, Midwest, South, West
region = np.random.choice(
    ['Northeast', 'Midwest', 'South', 'West'],
    size=n_respondents,
    p=[0.20, 0.22, 0.35, 0.23]  # Roughly representative
)

# Create DataFrame
survey_data = pd.DataFrame({
    'age_group': age_groups,
    'gender': gender,
    'race_ethnicity': race_ethnicity,
    'education': education,
    'region': region
})

print("Survey Sample Demographics:")
print("===========================")
for col in survey_data.columns:
    print(f"\n{col.replace('_', ' ').title()}:")
    print(survey_data[col].value_counts(normalize=True).round(3))
Survey Sample Demographics:
===========================

Age Group:
age_group
65+      0.300
45-64    0.290
30-44    0.245
18-29    0.165
Name: proportion, dtype: float64

Gender:
gender
Female    0.555
Male      0.445
Name: proportion, dtype: float64

Race Ethnicity:
race_ethnicity
White_NH    0.670
Hispanic    0.130
Asian       0.080
Black       0.075
Other       0.045
Name: proportion, dtype: float64

Education:
education
Bachelor_plus    0.410
Some_college     0.315
HS_or_less       0.275
Name: proportion, dtype: float64

Region:
region
South        0.34
Midwest      0.24
Northeast    0.23
West         0.19
Name: proportion, dtype: float64

Step 2: Define Population Benchmarks

These targets are based on 2023 US Census data and represent the true population distributions we want to match.

[3]:
# Population benchmarks from 2023 US Census data
population_targets = {
    # Age distribution (approximate from Census data)
    'age_18_29': 0.18,
    'age_30_44': 0.25,
    'age_45_64': 0.32,
    'age_65_plus': 0.25,

    # Gender distribution
    'male': 0.495,
    'female': 0.505,

    # Race/Ethnicity distribution
    'white_nh': 0.582,
    'black': 0.120,
    'hispanic': 0.190,
    'asian': 0.058,
    'other_race': 0.050,

    # Education distribution (adults 25+, approximate)
    'hs_or_less': 0.38,
    'some_college': 0.28,
    'bachelor_plus': 0.34,

    # Regional distribution
    'northeast': 0.17,
    'midwest': 0.21,
    'south': 0.38,
    'west': 0.24
}

print("Population Targets (2023 US Census):")
print("====================================")
for category, target in population_targets.items():
    print(f"{category.replace('_', ' ').title()}: {target:.1%}")
Population Targets (2023 US Census):
====================================
Age 18 29: 18.0%
Age 30 44: 25.0%
Age 45 64: 32.0%
Age 65 Plus: 25.0%
Male: 49.5%
Female: 50.5%
White Nh: 58.2%
Black: 12.0%
Hispanic: 19.0%
Asian: 5.8%
Other Race: 5.0%
Hs Or Less: 38.0%
Some College: 28.0%
Bachelor Plus: 34.0%
Northeast: 17.0%
Midwest: 21.0%
South: 38.0%
West: 24.0%

Step 3: Construct Membership Matrix

The membership matrix A defines which respondents belong to each demographic group. Each row represents a demographic category, and each column represents a respondent.

[4]:
# Create membership matrix A
membership_indicators = []
target_totals = []
margin_labels = []

# Age groups
for age_group, target_prop in zip(['18-29', '30-44', '45-64', '65+'],
                                  [population_targets['age_18_29'], population_targets['age_30_44'],
                                   population_targets['age_45_64'], population_targets['age_65_plus']]):
    indicator = (survey_data['age_group'] == age_group).astype(float)
    membership_indicators.append(indicator)
    target_totals.append(target_prop * n_respondents)
    margin_labels.append(f'Age {age_group}')

# Gender
for gender_val, target_prop in zip(['Male', 'Female'],
                                   [population_targets['male'], population_targets['female']]):
    indicator = (survey_data['gender'] == gender_val).astype(float)
    membership_indicators.append(indicator)
    target_totals.append(target_prop * n_respondents)
    margin_labels.append(f'Gender {gender_val}')

# Race/Ethnicity
race_mapping = {
    'White_NH': population_targets['white_nh'],
    'Black': population_targets['black'],
    'Hispanic': population_targets['hispanic'],
    'Asian': population_targets['asian'],
    'Other': population_targets['other_race']
}
for race_val, target_prop in race_mapping.items():
    indicator = (survey_data['race_ethnicity'] == race_val).astype(float)
    membership_indicators.append(indicator)
    target_totals.append(target_prop * n_respondents)
    margin_labels.append(f'Race {race_val}')

# Education
edu_mapping = {
    'HS_or_less': population_targets['hs_or_less'],
    'Some_college': population_targets['some_college'],
    'Bachelor_plus': population_targets['bachelor_plus']
}
for edu_val, target_prop in edu_mapping.items():
    indicator = (survey_data['education'] == edu_val).astype(float)
    membership_indicators.append(indicator)
    target_totals.append(target_prop * n_respondents)
    margin_labels.append(f'Education {edu_val}')

# Region
region_mapping = {
    'Northeast': population_targets['northeast'],
    'Midwest': population_targets['midwest'],
    'South': population_targets['south'],
    'West': population_targets['west']
}
for region_val, target_prop in region_mapping.items():
    indicator = (survey_data['region'] == region_val).astype(float)
    membership_indicators.append(indicator)
    target_totals.append(target_prop * n_respondents)
    margin_labels.append(f'Region {region_val}')

# Population total constraint
total_indicator = np.ones(n_respondents)
membership_indicators.append(total_indicator)
target_totals.append(n_respondents)
margin_labels.append('Total Population')

# Convert to arrays
A = np.array(membership_indicators, dtype=float)
b = np.array(target_totals, dtype=float)

print(f"Membership matrix shape: {A.shape}")
print(f"Number of demographic margins: {len(margin_labels)}")
print(f"Number of respondents: {n_respondents}")
Membership matrix shape: (19, 200)
Number of demographic margins: 19
Number of respondents: 200

Step 4: Set Up Base Weights

We start with equal base weights representing a simple random sample design.

[5]:
# Base weights (equal weights for simple random sample)
w0 = np.ones(n_respondents)

print(f"Base weights: {n_respondents} equal weights of {w0[0]:.1f}")
print(f"Base weight total: {w0.sum():.1f}")
Base weights: 200 equal weights of 1.0
Base weight total: 200.0

Step 5: Analyze Pre-Calibration Bias

Let’s examine the demographic bias in our sample before calibration.

[6]:
# Calculate current sample proportions
current_totals = A @ w0
current_props = current_totals / n_respondents
target_props = b / n_respondents

# Create comparison DataFrame
bias_analysis = pd.DataFrame({
    'Demographic': margin_labels,
    'Sample_%': current_props * 100,
    'Target_%': target_props * 100,
    'Difference': (current_props - target_props) * 100
})

print("Pre-Calibration Demographic Bias:")
print("==================================")
print(bias_analysis.round(1))

# Highlight largest biases
largest_biases = bias_analysis.iloc[:-1].sort_values('Difference', key=abs, ascending=False).head(5)
print("\nLargest Demographic Biases:")
print("===========================")
for _, row in largest_biases.iterrows():
    direction = "over" if row['Difference'] > 0 else "under"
    print(f"{row['Demographic']}: {abs(row['Difference']):.1f}pp {direction}-represented")
Pre-Calibration Demographic Bias:
==================================
                Demographic  Sample_%  Target_%  Difference
0                 Age 18-29      16.5      18.0        -1.5
1                 Age 30-44      24.5      25.0        -0.5
2                 Age 45-64      29.0      32.0        -3.0
3                   Age 65+      30.0      25.0         5.0
4               Gender Male      44.5      49.5        -5.0
5             Gender Female      55.5      50.5         5.0
6             Race White_NH      67.0      58.2         8.8
7                Race Black       7.5      12.0        -4.5
8             Race Hispanic      13.0      19.0        -6.0
9                Race Asian       8.0       5.8         2.2
10               Race Other       4.5       5.0        -0.5
11     Education HS_or_less      27.5      38.0       -10.5
12   Education Some_college      31.5      28.0         3.5
13  Education Bachelor_plus      41.0      34.0         7.0
14         Region Northeast      23.0      17.0         6.0
15           Region Midwest      24.0      21.0         3.0
16             Region South      34.0      38.0        -4.0
17              Region West      19.0      24.0        -5.0
18         Total Population     100.0     100.0         0.0

Largest Demographic Biases:
===========================
Education HS_or_less: 10.5pp under-represented
Race White_NH: 8.8pp over-represented
Education Bachelor_plus: 7.0pp over-represented
Race Hispanic: 6.0pp under-represented
Region Northeast: 6.0pp over-represented

Step 6: Apply Leximin Calibration

We’ll apply both calibration methods available in fairlex:

  1. Residual leximin: Minimizes the worst margin error

  2. Weight-fair leximin: Balances margin accuracy with weight stability

[7]:
# Method 1: Residual leximin calibration
result_residual = leximin_residual(
    A, b, w0,
    min_ratio=0.2,  # Allow weights to be as low as 0.2x original
    max_ratio=5.0   # Allow weights to be as high as 5.0x original
)

print("Residual Leximin Results:")
print("========================")
print(f"Optimization status: {result_residual.status} ({result_residual.message})")
print(f"Maximum absolute residual (epsilon): {result_residual.epsilon:.4f}")
print(f"Weight range: [{result_residual.w.min():.3f}, {result_residual.w.max():.3f}]")
print(f"Weight mean: {result_residual.w.mean():.3f}")

# Method 2: Weight-fair leximin calibration
result_weight_fair = leximin_weight_fair(
    A, b, w0,
    min_ratio=0.2,
    max_ratio=5.0,
    slack=0.001  # Allow small additional margin error for better weight stability
)

print("\nWeight-Fair Leximin Results:")
print("============================")
print(f"Optimization status: {result_weight_fair.status} ({result_weight_fair.message})")
print(f"Maximum absolute residual (epsilon): {result_weight_fair.epsilon:.4f}")
print(f"Maximum relative weight change (t): {result_weight_fair.t:.4f}")
print(f"Weight range: [{result_weight_fair.w.min():.3f}, {result_weight_fair.w.max():.3f}]")
print(f"Weight mean: {result_weight_fair.w.mean():.3f}")
Residual Leximin Results:
========================
Optimization status: 0 (Optimization terminated successfully. (HiGHS Status 7: Optimal))
Maximum absolute residual (epsilon): 0.0000
Weight range: [0.200, 5.000]
Weight mean: 1.000

Weight-Fair Leximin Results:
============================
Optimization status: 0 (Optimization terminated successfully. (HiGHS Status 7: Optimal))
Maximum absolute residual (epsilon): 0.0000
Maximum relative weight change (t): 0.5999
Weight range: [0.400, 1.600]
Weight mean: 1.000

Step 7: Evaluate Calibration Quality

Let’s assess how well each method performed using comprehensive diagnostics.

[8]:
# Evaluate both methods
metrics_residual = evaluate_solution(A, b, result_residual.w, base_weights=w0)
metrics_weight_fair = evaluate_solution(A, b, result_weight_fair.w, base_weights=w0)

# Create comparison DataFrame
comparison = pd.DataFrame({
    'Metric': list(metrics_residual.keys()),
    'Residual_Method': list(metrics_residual.values()),
    'Weight_Fair_Method': list(metrics_weight_fair.values())
})

print("Calibration Method Comparison:")
print("=============================\n")
print(comparison.round(4))

# Interpret key metrics
print("\n\nKey Insights:")
print("=============")
print(f"• Margin Accuracy: Residual method achieves max error of {metrics_residual['resid_max_abs']:.4f}")
print(f"                   Weight-fair method achieves max error of {metrics_weight_fair['resid_max_abs']:.4f}")
print(f"• Weight Stability: Residual method ESS = {metrics_residual['ESS']:.1f} (design effect = {metrics_residual['deff']:.2f})")
print(f"                    Weight-fair method ESS = {metrics_weight_fair['ESS']:.1f} (design effect = {metrics_weight_fair['deff']:.2f})")
print(f"• Weight Changes: Residual method max change = {metrics_residual['max_rel_dev']:.2%}")
print(f"                  Weight-fair method max change = {metrics_weight_fair['max_rel_dev']:.2%}")
Calibration Method Comparison:
=============================

            Metric  Residual_Method  Weight_Fair_Method
0    resid_max_abs           0.0000              0.0010
1        resid_p95           0.0000              0.0010
2     resid_median           0.0000              0.0010
3      total_error          -0.0000             -0.0010
4       weight_max           5.0000              1.5999
5       weight_min           0.2000              0.4001
6       weight_p99           5.0000              1.5999
7       weight_p95           5.0000              1.5999
8    weight_median           0.2000              0.9395
9              ESS          51.1336            148.2206
10            deff           3.9113              1.3493
11     max_rel_dev           4.0000              0.5999
12     p95_rel_dev           4.0000              0.5999
13  median_rel_dev           0.8000              0.5999


Key Insights:
=============
• Margin Accuracy: Residual method achieves max error of 0.0000
                   Weight-fair method achieves max error of 0.0010
• Weight Stability: Residual method ESS = 51.1 (design effect = 3.91)
                    Weight-fair method ESS = 148.2 (design effect = 1.35)
• Weight Changes: Residual method max change = 400.00%
                  Weight-fair method max change = 59.99%

Step 8: Analyze Post-Calibration Demographics

Let’s verify that our calibration successfully corrected the demographic biases.

[9]:
# Calculate post-calibration demographics for weight-fair method
calibrated_totals = A @ result_weight_fair.w
calibrated_props = calibrated_totals / n_respondents

# Create final comparison
final_comparison = pd.DataFrame({
    'Demographic': margin_labels,
    'Original_%': (A @ w0 / n_respondents) * 100,
    'Target_%': (b / n_respondents) * 100,
    'Calibrated_%': calibrated_props * 100,
    'Final_Error': abs(calibrated_props - target_props) * 100
})

print("Post-Calibration Results (Weight-Fair Method):")
print("===============================================\n")
print(final_comparison.round(2))

# Summary statistics
max_error = final_comparison['Final_Error'].iloc[:-1].max()  # Exclude total row
mean_error = final_comparison['Final_Error'].iloc[:-1].mean()

print("\nCalibration Summary:")
print("===================")
print(f"Maximum demographic error: {max_error:.3f} percentage points")
print(f"Average demographic error: {mean_error:.3f} percentage points")
print(f"All margins calibrated within: ±{max_error:.3f}pp of targets")
Post-Calibration Results (Weight-Fair Method):
===============================================

                Demographic  Original_%  Target_%  Calibrated_%  Final_Error
0                 Age 18-29        16.5      18.0          18.0          0.0
1                 Age 30-44        24.5      25.0          25.0          0.0
2                 Age 45-64        29.0      32.0          32.0          0.0
3                   Age 65+        30.0      25.0          25.0          0.0
4               Gender Male        44.5      49.5          49.5          0.0
5             Gender Female        55.5      50.5          50.5          0.0
6             Race White_NH        67.0      58.2          58.2          0.0
7                Race Black         7.5      12.0          12.0          0.0
8             Race Hispanic        13.0      19.0          19.0          0.0
9                Race Asian         8.0       5.8           5.8          0.0
10               Race Other         4.5       5.0           5.0          0.0
11     Education HS_or_less        27.5      38.0          38.0          0.0
12   Education Some_college        31.5      28.0          28.0          0.0
13  Education Bachelor_plus        41.0      34.0          34.0          0.0
14         Region Northeast        23.0      17.0          17.0          0.0
15           Region Midwest        24.0      21.0          21.0          0.0
16             Region South        34.0      38.0          38.0          0.0
17              Region West        19.0      24.0          24.0          0.0
18         Total Population       100.0     100.0         100.0          0.0

Calibration Summary:
===================
Maximum demographic error: 0.001 percentage points
Average demographic error: 0.000 percentage points
All margins calibrated within: ±0.001pp of targets

Step 9: Practical Interpretation

Understanding what these results mean for survey analysis.

[10]:
# Weight distribution analysis
weight_stats = pd.DataFrame({
    'Statistic': ['Min', 'Q25', 'Median', 'Q75', 'Max', 'Mean', 'Std Dev'],
    'Original_Weights': [w0.min(), np.percentile(w0, 25), np.median(w0),
                         np.percentile(w0, 75), w0.max(), w0.mean(), w0.std()],
    'Calibrated_Weights': [result_weight_fair.w.min(), np.percentile(result_weight_fair.w, 25),
                           np.median(result_weight_fair.w), np.percentile(result_weight_fair.w, 75),
                           result_weight_fair.w.max(), result_weight_fair.w.mean(),
                           result_weight_fair.w.std()]
})

print("Weight Distribution Analysis:")
print("============================\n")
print(weight_stats.round(3))

print("\n\nPractical Implications:")
print("=======================")
print(f"1. Survey Representativeness: Calibration corrected {len([x for x in final_comparison['Final_Error'].iloc[:-1] if abs(x) > 0.001])} demographic biases")
print(f"2. Effective Sample Size: Reduced from {n_respondents} to {metrics_weight_fair['ESS']:.0f} due to weighting")
print(f"3. Design Effect: {metrics_weight_fair['deff']:.2f} (variance inflation factor)")
print(f"4. Weight Variability: Coeffient of variation = {result_weight_fair.w.std() / result_weight_fair.w.mean():.3f}")

print("\nMethod Recommendation:")
print("=====================")
if metrics_weight_fair['ESS'] > metrics_residual['ESS']:
    print("✓ Weight-fair method recommended: Better preserves effective sample size")
else:
    print("✓ Residual method recommended: Achieves better margin accuracy")

print(f"\nCalibration achieves population-representative results with {metrics_weight_fair['ESS']:.0f} effective respondents.")
Weight Distribution Analysis:
============================

  Statistic  Original_Weights  Calibrated_Weights
0       Min               1.0               0.400
1       Q25               1.0               0.400
2    Median               1.0               0.940
3       Q75               1.0               1.600
4       Max               1.0               1.600
5      Mean               1.0               1.000
6   Std Dev               0.0               0.591


Practical Implications:
=======================
1. Survey Representativeness: Calibration corrected 0 demographic biases
2. Effective Sample Size: Reduced from 200 to 148 due to weighting
3. Design Effect: 1.35 (variance inflation factor)
4. Weight Variability: Coeffient of variation = 0.591

Method Recommendation:
=====================
✓ Weight-fair method recommended: Better preserves effective sample size

Calibration achieves population-representative results with 148 effective respondents.

Summary

This example demonstrated realistic survey weight calibration using fairlex:

Key Features Demonstrated:

  • Realistic survey biases (age, education, race/ethnicity skews)

  • Multiple demographic margins (18 categories + total)

  • US Census population benchmarks

  • Comparison of residual vs. weight-fair methods

  • Comprehensive quality assessment

Typical Use Cases:

  • Political polling calibration

  • Market research weight adjustment

  • Social survey representativeness correction

  • Post-stratification in complex surveys

Method Selection Guidelines:

  • Residual leximin: Use when margin accuracy is paramount

  • Weight-fair leximin: Use when both accuracy and weight stability matter

  • Consider design effect and effective sample size in your choice

The leximin approach ensures that no single demographic group bears a disproportionate burden in achieving representativeness, making it particularly suitable for surveys with multiple important demographic targets.