Survey Weight Calibration: A Realistic Example¶
This example demonstrates how to use fairlex for calibrating survey weights in a realistic scenario. We’ll simulate a political opinion survey with typical demographic biases and calibrate against US Census benchmarks.
Scenario¶
A polling organization conducted a survey with 200 respondents to gauge public opinion. Like most surveys, the sample has demographic biases:
Over-representation of older, higher-educated respondents
Under-representation of Hispanic and younger demographics
Geographic skew toward certain regions
We’ll use leximin calibration to adjust the weights to match known population demographics from the 2023 US Census.
[1]:
import numpy as np
import pandas as pd
from fairlex import evaluate_solution, leximin_residual, leximin_weight_fair
# Set random seed for reproducibility
np.random.seed(42)
Step 1: Create Realistic Survey Data¶
We’ll simulate survey respondents with demographic characteristics that exhibit typical survey biases.
[2]:
n_respondents = 200
# Generate biased survey sample
# Age groups: 18-29, 30-44, 45-64, 65+
age_groups = np.random.choice(
['18-29', '30-44', '45-64', '65+'],
size=n_respondents,
p=[0.15, 0.20, 0.35, 0.30] # Skewed toward older respondents
)
# Gender: Male, Female
gender = np.random.choice(
['Male', 'Female'],
size=n_respondents,
p=[0.48, 0.52] # Close to population
)
# Race/Ethnicity
race_ethnicity = np.random.choice(
['White_NH', 'Black', 'Hispanic', 'Asian', 'Other'],
size=n_respondents,
p=[0.70, 0.11, 0.10, 0.06, 0.03] # Under-representation of Hispanic, over-representation of White
)
# Education: HS_or_less, Some_college, Bachelor_plus
education = np.random.choice(
['HS_or_less', 'Some_college', 'Bachelor_plus'],
size=n_respondents,
p=[0.25, 0.30, 0.45] # Over-representation of college educated
)
# Region: Northeast, Midwest, South, West
region = np.random.choice(
['Northeast', 'Midwest', 'South', 'West'],
size=n_respondents,
p=[0.20, 0.22, 0.35, 0.23] # Roughly representative
)
# Create DataFrame
survey_data = pd.DataFrame({
'age_group': age_groups,
'gender': gender,
'race_ethnicity': race_ethnicity,
'education': education,
'region': region
})
print("Survey Sample Demographics:")
print("===========================")
for col in survey_data.columns:
print(f"\n{col.replace('_', ' ').title()}:")
print(survey_data[col].value_counts(normalize=True).round(3))
Survey Sample Demographics:
===========================
Age Group:
age_group
65+ 0.300
45-64 0.290
30-44 0.245
18-29 0.165
Name: proportion, dtype: float64
Gender:
gender
Female 0.555
Male 0.445
Name: proportion, dtype: float64
Race Ethnicity:
race_ethnicity
White_NH 0.670
Hispanic 0.130
Asian 0.080
Black 0.075
Other 0.045
Name: proportion, dtype: float64
Education:
education
Bachelor_plus 0.410
Some_college 0.315
HS_or_less 0.275
Name: proportion, dtype: float64
Region:
region
South 0.34
Midwest 0.24
Northeast 0.23
West 0.19
Name: proportion, dtype: float64
Step 2: Define Population Benchmarks¶
These targets are based on 2023 US Census data and represent the true population distributions we want to match.
[3]:
# Population benchmarks from 2023 US Census data
population_targets = {
# Age distribution (approximate from Census data)
'age_18_29': 0.18,
'age_30_44': 0.25,
'age_45_64': 0.32,
'age_65_plus': 0.25,
# Gender distribution
'male': 0.495,
'female': 0.505,
# Race/Ethnicity distribution
'white_nh': 0.582,
'black': 0.120,
'hispanic': 0.190,
'asian': 0.058,
'other_race': 0.050,
# Education distribution (adults 25+, approximate)
'hs_or_less': 0.38,
'some_college': 0.28,
'bachelor_plus': 0.34,
# Regional distribution
'northeast': 0.17,
'midwest': 0.21,
'south': 0.38,
'west': 0.24
}
print("Population Targets (2023 US Census):")
print("====================================")
for category, target in population_targets.items():
print(f"{category.replace('_', ' ').title()}: {target:.1%}")
Population Targets (2023 US Census):
====================================
Age 18 29: 18.0%
Age 30 44: 25.0%
Age 45 64: 32.0%
Age 65 Plus: 25.0%
Male: 49.5%
Female: 50.5%
White Nh: 58.2%
Black: 12.0%
Hispanic: 19.0%
Asian: 5.8%
Other Race: 5.0%
Hs Or Less: 38.0%
Some College: 28.0%
Bachelor Plus: 34.0%
Northeast: 17.0%
Midwest: 21.0%
South: 38.0%
West: 24.0%
Step 3: Construct Membership Matrix¶
The membership matrix A defines which respondents belong to each demographic group. Each row represents a demographic category, and each column represents a respondent.
[4]:
# Create membership matrix A
membership_indicators = []
target_totals = []
margin_labels = []
# Age groups
for age_group, target_prop in zip(['18-29', '30-44', '45-64', '65+'],
[population_targets['age_18_29'], population_targets['age_30_44'],
population_targets['age_45_64'], population_targets['age_65_plus']]):
indicator = (survey_data['age_group'] == age_group).astype(float)
membership_indicators.append(indicator)
target_totals.append(target_prop * n_respondents)
margin_labels.append(f'Age {age_group}')
# Gender
for gender_val, target_prop in zip(['Male', 'Female'],
[population_targets['male'], population_targets['female']]):
indicator = (survey_data['gender'] == gender_val).astype(float)
membership_indicators.append(indicator)
target_totals.append(target_prop * n_respondents)
margin_labels.append(f'Gender {gender_val}')
# Race/Ethnicity
race_mapping = {
'White_NH': population_targets['white_nh'],
'Black': population_targets['black'],
'Hispanic': population_targets['hispanic'],
'Asian': population_targets['asian'],
'Other': population_targets['other_race']
}
for race_val, target_prop in race_mapping.items():
indicator = (survey_data['race_ethnicity'] == race_val).astype(float)
membership_indicators.append(indicator)
target_totals.append(target_prop * n_respondents)
margin_labels.append(f'Race {race_val}')
# Education
edu_mapping = {
'HS_or_less': population_targets['hs_or_less'],
'Some_college': population_targets['some_college'],
'Bachelor_plus': population_targets['bachelor_plus']
}
for edu_val, target_prop in edu_mapping.items():
indicator = (survey_data['education'] == edu_val).astype(float)
membership_indicators.append(indicator)
target_totals.append(target_prop * n_respondents)
margin_labels.append(f'Education {edu_val}')
# Region
region_mapping = {
'Northeast': population_targets['northeast'],
'Midwest': population_targets['midwest'],
'South': population_targets['south'],
'West': population_targets['west']
}
for region_val, target_prop in region_mapping.items():
indicator = (survey_data['region'] == region_val).astype(float)
membership_indicators.append(indicator)
target_totals.append(target_prop * n_respondents)
margin_labels.append(f'Region {region_val}')
# Population total constraint
total_indicator = np.ones(n_respondents)
membership_indicators.append(total_indicator)
target_totals.append(n_respondents)
margin_labels.append('Total Population')
# Convert to arrays
A = np.array(membership_indicators, dtype=float)
b = np.array(target_totals, dtype=float)
print(f"Membership matrix shape: {A.shape}")
print(f"Number of demographic margins: {len(margin_labels)}")
print(f"Number of respondents: {n_respondents}")
Membership matrix shape: (19, 200)
Number of demographic margins: 19
Number of respondents: 200
Step 4: Set Up Base Weights¶
We start with equal base weights representing a simple random sample design.
[5]:
# Base weights (equal weights for simple random sample)
w0 = np.ones(n_respondents)
print(f"Base weights: {n_respondents} equal weights of {w0[0]:.1f}")
print(f"Base weight total: {w0.sum():.1f}")
Base weights: 200 equal weights of 1.0
Base weight total: 200.0
Step 5: Analyze Pre-Calibration Bias¶
Let’s examine the demographic bias in our sample before calibration.
[6]:
# Calculate current sample proportions
current_totals = A @ w0
current_props = current_totals / n_respondents
target_props = b / n_respondents
# Create comparison DataFrame
bias_analysis = pd.DataFrame({
'Demographic': margin_labels,
'Sample_%': current_props * 100,
'Target_%': target_props * 100,
'Difference': (current_props - target_props) * 100
})
print("Pre-Calibration Demographic Bias:")
print("==================================")
print(bias_analysis.round(1))
# Highlight largest biases
largest_biases = bias_analysis.iloc[:-1].sort_values('Difference', key=abs, ascending=False).head(5)
print("\nLargest Demographic Biases:")
print("===========================")
for _, row in largest_biases.iterrows():
direction = "over" if row['Difference'] > 0 else "under"
print(f"{row['Demographic']}: {abs(row['Difference']):.1f}pp {direction}-represented")
Pre-Calibration Demographic Bias:
==================================
Demographic Sample_% Target_% Difference
0 Age 18-29 16.5 18.0 -1.5
1 Age 30-44 24.5 25.0 -0.5
2 Age 45-64 29.0 32.0 -3.0
3 Age 65+ 30.0 25.0 5.0
4 Gender Male 44.5 49.5 -5.0
5 Gender Female 55.5 50.5 5.0
6 Race White_NH 67.0 58.2 8.8
7 Race Black 7.5 12.0 -4.5
8 Race Hispanic 13.0 19.0 -6.0
9 Race Asian 8.0 5.8 2.2
10 Race Other 4.5 5.0 -0.5
11 Education HS_or_less 27.5 38.0 -10.5
12 Education Some_college 31.5 28.0 3.5
13 Education Bachelor_plus 41.0 34.0 7.0
14 Region Northeast 23.0 17.0 6.0
15 Region Midwest 24.0 21.0 3.0
16 Region South 34.0 38.0 -4.0
17 Region West 19.0 24.0 -5.0
18 Total Population 100.0 100.0 0.0
Largest Demographic Biases:
===========================
Education HS_or_less: 10.5pp under-represented
Race White_NH: 8.8pp over-represented
Education Bachelor_plus: 7.0pp over-represented
Race Hispanic: 6.0pp under-represented
Region Northeast: 6.0pp over-represented
Step 6: Apply Leximin Calibration¶
We’ll apply both calibration methods available in fairlex:
Residual leximin: Minimizes the worst margin error
Weight-fair leximin: Balances margin accuracy with weight stability
[7]:
# Method 1: Residual leximin calibration
result_residual = leximin_residual(
A, b, w0,
min_ratio=0.2, # Allow weights to be as low as 0.2x original
max_ratio=5.0 # Allow weights to be as high as 5.0x original
)
print("Residual Leximin Results:")
print("========================")
print(f"Optimization status: {result_residual.status} ({result_residual.message})")
print(f"Maximum absolute residual (epsilon): {result_residual.epsilon:.4f}")
print(f"Weight range: [{result_residual.w.min():.3f}, {result_residual.w.max():.3f}]")
print(f"Weight mean: {result_residual.w.mean():.3f}")
# Method 2: Weight-fair leximin calibration
result_weight_fair = leximin_weight_fair(
A, b, w0,
min_ratio=0.2,
max_ratio=5.0,
slack=0.001 # Allow small additional margin error for better weight stability
)
print("\nWeight-Fair Leximin Results:")
print("============================")
print(f"Optimization status: {result_weight_fair.status} ({result_weight_fair.message})")
print(f"Maximum absolute residual (epsilon): {result_weight_fair.epsilon:.4f}")
print(f"Maximum relative weight change (t): {result_weight_fair.t:.4f}")
print(f"Weight range: [{result_weight_fair.w.min():.3f}, {result_weight_fair.w.max():.3f}]")
print(f"Weight mean: {result_weight_fair.w.mean():.3f}")
Residual Leximin Results:
========================
Optimization status: 0 (Optimization terminated successfully. (HiGHS Status 7: Optimal))
Maximum absolute residual (epsilon): 0.0000
Weight range: [0.200, 5.000]
Weight mean: 1.000
Weight-Fair Leximin Results:
============================
Optimization status: 0 (Optimization terminated successfully. (HiGHS Status 7: Optimal))
Maximum absolute residual (epsilon): 0.0000
Maximum relative weight change (t): 0.5999
Weight range: [0.400, 1.600]
Weight mean: 1.000
Step 7: Evaluate Calibration Quality¶
Let’s assess how well each method performed using comprehensive diagnostics.
[8]:
# Evaluate both methods
metrics_residual = evaluate_solution(A, b, result_residual.w, base_weights=w0)
metrics_weight_fair = evaluate_solution(A, b, result_weight_fair.w, base_weights=w0)
# Create comparison DataFrame
comparison = pd.DataFrame({
'Metric': list(metrics_residual.keys()),
'Residual_Method': list(metrics_residual.values()),
'Weight_Fair_Method': list(metrics_weight_fair.values())
})
print("Calibration Method Comparison:")
print("=============================\n")
print(comparison.round(4))
# Interpret key metrics
print("\n\nKey Insights:")
print("=============")
print(f"• Margin Accuracy: Residual method achieves max error of {metrics_residual['resid_max_abs']:.4f}")
print(f" Weight-fair method achieves max error of {metrics_weight_fair['resid_max_abs']:.4f}")
print(f"• Weight Stability: Residual method ESS = {metrics_residual['ESS']:.1f} (design effect = {metrics_residual['deff']:.2f})")
print(f" Weight-fair method ESS = {metrics_weight_fair['ESS']:.1f} (design effect = {metrics_weight_fair['deff']:.2f})")
print(f"• Weight Changes: Residual method max change = {metrics_residual['max_rel_dev']:.2%}")
print(f" Weight-fair method max change = {metrics_weight_fair['max_rel_dev']:.2%}")
Calibration Method Comparison:
=============================
Metric Residual_Method Weight_Fair_Method
0 resid_max_abs 0.0000 0.0010
1 resid_p95 0.0000 0.0010
2 resid_median 0.0000 0.0010
3 total_error -0.0000 -0.0010
4 weight_max 5.0000 1.5999
5 weight_min 0.2000 0.4001
6 weight_p99 5.0000 1.5999
7 weight_p95 5.0000 1.5999
8 weight_median 0.2000 0.9395
9 ESS 51.1336 148.2206
10 deff 3.9113 1.3493
11 max_rel_dev 4.0000 0.5999
12 p95_rel_dev 4.0000 0.5999
13 median_rel_dev 0.8000 0.5999
Key Insights:
=============
• Margin Accuracy: Residual method achieves max error of 0.0000
Weight-fair method achieves max error of 0.0010
• Weight Stability: Residual method ESS = 51.1 (design effect = 3.91)
Weight-fair method ESS = 148.2 (design effect = 1.35)
• Weight Changes: Residual method max change = 400.00%
Weight-fair method max change = 59.99%
Step 8: Analyze Post-Calibration Demographics¶
Let’s verify that our calibration successfully corrected the demographic biases.
[9]:
# Calculate post-calibration demographics for weight-fair method
calibrated_totals = A @ result_weight_fair.w
calibrated_props = calibrated_totals / n_respondents
# Create final comparison
final_comparison = pd.DataFrame({
'Demographic': margin_labels,
'Original_%': (A @ w0 / n_respondents) * 100,
'Target_%': (b / n_respondents) * 100,
'Calibrated_%': calibrated_props * 100,
'Final_Error': abs(calibrated_props - target_props) * 100
})
print("Post-Calibration Results (Weight-Fair Method):")
print("===============================================\n")
print(final_comparison.round(2))
# Summary statistics
max_error = final_comparison['Final_Error'].iloc[:-1].max() # Exclude total row
mean_error = final_comparison['Final_Error'].iloc[:-1].mean()
print("\nCalibration Summary:")
print("===================")
print(f"Maximum demographic error: {max_error:.3f} percentage points")
print(f"Average demographic error: {mean_error:.3f} percentage points")
print(f"All margins calibrated within: ±{max_error:.3f}pp of targets")
Post-Calibration Results (Weight-Fair Method):
===============================================
Demographic Original_% Target_% Calibrated_% Final_Error
0 Age 18-29 16.5 18.0 18.0 0.0
1 Age 30-44 24.5 25.0 25.0 0.0
2 Age 45-64 29.0 32.0 32.0 0.0
3 Age 65+ 30.0 25.0 25.0 0.0
4 Gender Male 44.5 49.5 49.5 0.0
5 Gender Female 55.5 50.5 50.5 0.0
6 Race White_NH 67.0 58.2 58.2 0.0
7 Race Black 7.5 12.0 12.0 0.0
8 Race Hispanic 13.0 19.0 19.0 0.0
9 Race Asian 8.0 5.8 5.8 0.0
10 Race Other 4.5 5.0 5.0 0.0
11 Education HS_or_less 27.5 38.0 38.0 0.0
12 Education Some_college 31.5 28.0 28.0 0.0
13 Education Bachelor_plus 41.0 34.0 34.0 0.0
14 Region Northeast 23.0 17.0 17.0 0.0
15 Region Midwest 24.0 21.0 21.0 0.0
16 Region South 34.0 38.0 38.0 0.0
17 Region West 19.0 24.0 24.0 0.0
18 Total Population 100.0 100.0 100.0 0.0
Calibration Summary:
===================
Maximum demographic error: 0.001 percentage points
Average demographic error: 0.000 percentage points
All margins calibrated within: ±0.001pp of targets
Step 9: Practical Interpretation¶
Understanding what these results mean for survey analysis.
[10]:
# Weight distribution analysis
weight_stats = pd.DataFrame({
'Statistic': ['Min', 'Q25', 'Median', 'Q75', 'Max', 'Mean', 'Std Dev'],
'Original_Weights': [w0.min(), np.percentile(w0, 25), np.median(w0),
np.percentile(w0, 75), w0.max(), w0.mean(), w0.std()],
'Calibrated_Weights': [result_weight_fair.w.min(), np.percentile(result_weight_fair.w, 25),
np.median(result_weight_fair.w), np.percentile(result_weight_fair.w, 75),
result_weight_fair.w.max(), result_weight_fair.w.mean(),
result_weight_fair.w.std()]
})
print("Weight Distribution Analysis:")
print("============================\n")
print(weight_stats.round(3))
print("\n\nPractical Implications:")
print("=======================")
print(f"1. Survey Representativeness: Calibration corrected {len([x for x in final_comparison['Final_Error'].iloc[:-1] if abs(x) > 0.001])} demographic biases")
print(f"2. Effective Sample Size: Reduced from {n_respondents} to {metrics_weight_fair['ESS']:.0f} due to weighting")
print(f"3. Design Effect: {metrics_weight_fair['deff']:.2f} (variance inflation factor)")
print(f"4. Weight Variability: Coeffient of variation = {result_weight_fair.w.std() / result_weight_fair.w.mean():.3f}")
print("\nMethod Recommendation:")
print("=====================")
if metrics_weight_fair['ESS'] > metrics_residual['ESS']:
print("✓ Weight-fair method recommended: Better preserves effective sample size")
else:
print("✓ Residual method recommended: Achieves better margin accuracy")
print(f"\nCalibration achieves population-representative results with {metrics_weight_fair['ESS']:.0f} effective respondents.")
Weight Distribution Analysis:
============================
Statistic Original_Weights Calibrated_Weights
0 Min 1.0 0.400
1 Q25 1.0 0.400
2 Median 1.0 0.940
3 Q75 1.0 1.600
4 Max 1.0 1.600
5 Mean 1.0 1.000
6 Std Dev 0.0 0.591
Practical Implications:
=======================
1. Survey Representativeness: Calibration corrected 0 demographic biases
2. Effective Sample Size: Reduced from 200 to 148 due to weighting
3. Design Effect: 1.35 (variance inflation factor)
4. Weight Variability: Coeffient of variation = 0.591
Method Recommendation:
=====================
✓ Weight-fair method recommended: Better preserves effective sample size
Calibration achieves population-representative results with 148 effective respondents.
Summary¶
This example demonstrated realistic survey weight calibration using fairlex:
Key Features Demonstrated:
Realistic survey biases (age, education, race/ethnicity skews)
Multiple demographic margins (18 categories + total)
US Census population benchmarks
Comparison of residual vs. weight-fair methods
Comprehensive quality assessment
Typical Use Cases:
Political polling calibration
Market research weight adjustment
Social survey representativeness correction
Post-stratification in complex surveys
Method Selection Guidelines:
Residual leximin: Use when margin accuracy is paramount
Weight-fair leximin: Use when both accuracy and weight stability matter
Consider design effect and effective sample size in your choice
The leximin approach ensures that no single demographic group bears a disproportionate burden in achieving representativeness, making it particularly suitable for surveys with multiple important demographic targets.