API Reference¶
This page contains the complete API reference for fewlab.
Main Functions¶
Fewlab: Optimal item selection for efficient labeling and survey sampling.
Main API functions: - items_to_label: Deterministic A-optimal selection - pi_aopt_for_budget: A-optimal inclusion probabilities - balanced_fixed_size: Balanced sampling with fixed size - row_se_min_labels: Row-wise SE minimization - calibrate_weights: GREG-style weight calibration - core_plus_tail: Hybrid deterministic core + balanced tail - adaptive_core_tail: Data-driven hybrid selection
- fewlab.items_to_label(counts, X, K, *, item_axis=1, ensure_full_rank=True, ridge=None)[source]¶
Return a deterministic list of item identifiers to label (length K), using the A-opt square-root rule on w_j = g_j^T (X^T X)^{-1} g_j.
- Parameters:
counts (DataFrame (n x m)) – Nonnegative counts C with rows = units and columns = items. Index must align with X.index.
X (DataFrame (n x p)) – Covariate matrix used in the regression y ~ X. Index must align with counts.index.
K (int) – Desired number of items to label (K <= m).
item_axis ({1}) – Currently only axis=1 (columns=items) is supported. Must be 1.
ensure_full_rank (bool) – If True, and X^T X is rank-deficient, add a small ridge.
ridge (float or None) – If not None, use (X^T X + ridge I)^{-1} explicitly.
- Returns:
A list of item identifiers (counts.columns) to label, deterministic.
- Return type:
Notes
We compute T_i = sum_j c_ij, v_j = c_{·j}/T, g_j = X^T v_j, and w_j = g_j^T (X^T X)^{-1} g_j. We then pick the top-K items by w_j.
This deterministically approximates the fixed-budget A-opt solution.
If ridge is None but X is ill-conditioned, a tiny ridge is applied if ensure_full_rank is True.
- fewlab.pi_aopt_for_budget(counts, X, K, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]¶
- Return A-opt first-order inclusion probabilities pi_j for expected budget K:
pi_j = clip(c * sqrt(w_j), [pi_min, 1]), with c chosen so sum pi = K.
- fewlab.balanced_fixed_size(pi, g, K, *, seed=None, max_swaps=MAX_SWAPS_BALANCED, tol=TOLERANCE_DEFAULT)[source]¶
- Heuristic fixed-size sampler that:
starts with a K-sized draw from pi (normalized),
greedily swaps in/out items to reduce ||sum((I/pi)-1) g||_2.
- fewlab.row_se_min_labels(counts, eps2, *, pi_min=PI_MIN_DEFAULT, max_iter=MAX_ITER_ROWSE, tol=TOLERANCE_DEFAULT, seed=None)[source]¶
- Fewest-labels design subject to per-row SE caps:
sum_j q_ij / pi_j <= eps2_i + sum_j q_ij for all rows i,
where q_ij = (c_ij / T_i)^2.
- fewlab.calibrate_weights(pi, g, selected, pop_totals=None, *, distance='chi2', ridge=SMALL_RIDGE, nonneg=True)[source]¶
Compute calibrated weights for selected items using GREG/Deville-Särndal calibration.
- Solves the optimization problem:
min ||w - d||^2 s.t. G_S w = t
where d = 1/pi are base HT weights, G_S is the matrix of g-vectors for selected items, and t are population totals.
Examples
>>> import pandas as pd >>> import numpy as np >>> from fewlab import calibrate_weights, _influence >>> >>> # Sample data >>> counts = pd.DataFrame(np.random.poisson(5, (100, 50))) >>> X = pd.DataFrame(np.random.randn(100, 3)) >>> selected = counts.columns[:20] >>> >>> # Compute influence and calibrate >>> inf = _influence(counts, X) >>> pi = pd.Series(0.4, index=counts.columns) >>> weights = calibrate_weights(pi, inf.g, selected) >>> >>> # Weights will satisfy calibration constraint: >>> # sum(weights * g_selected) ≈ sum(g_all)
- Parameters:
pi (pd.Series) – Inclusion probabilities for all items (index = item names).
g (np.ndarray, shape (p, m)) – Regression projections g_j = X^T v_j for all items.
selected (sequence of str or pd.Index) – Item identifiers that were actually selected.
pop_totals (np.ndarray, shape (p,), optional) – Known population totals. If None, uses g.sum(axis=1).
distance ({'chi2', 'euclidean'}) – Distance measure for calibration. Currently only ‘chi2’ is implemented.
ridge (float) – Ridge regularization parameter for numerical stability.
nonneg (bool) – If True, enforce non-negative weights (may slightly violate calibration).
- Returns:
Calibrated weights indexed by selected items.
- Return type:
pd.Series
Notes
- The closed-form solution for chi-square distance is:
w* = d_S + G_S^T (G_S G_S^T + ridge I)^{-1} (t - G_S d_S)
where d_S are the base weights for selected items.
References
Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376-382.
- fewlab.calibrated_ht_estimator(counts, labels, weights, *, normalize_by_total=True)[source]¶
Compute calibrated Horvitz-Thompson estimator for row shares.
- For each row i, estimates:
y_i = (1/T_i) * sum_{j in S} w_j * a_j * C_ij
where w_j are calibrated weights, a_j are labels, and C_ij are counts.
- Parameters:
counts (pd.DataFrame, shape (n, m)) – Count matrix with rows=units, columns=items.
labels (pd.Series) – Item labels (only for selected items).
weights (pd.Series) – Calibrated weights for selected items.
normalize_by_total (bool) – If True, divide by row totals T_i to get shares.
- Returns:
Estimated row shares (or totals if normalize_by_total=False).
- Return type:
pd.Series
- fewlab.core_plus_tail(counts, X, K, *, tail_frac=0.2, seed=None, ensure_full_rank=True, ridge=None)[source]¶
Hybrid sampler combining deterministic core with balanced probabilistic tail.
Strategy: 1. Select K_core = (1-tail_frac)*K items deterministically (highest w_j) 2. Compute A-optimal π for full budget K 3. Select K_tail = K - K_core items from remainder using balanced sampling
Examples
>>> import pandas as pd >>> import numpy as np >>> from fewlab import core_plus_tail >>> >>> # Survey data with 1000 units and 200 items >>> counts = pd.DataFrame(np.random.poisson(10, (1000, 200))) >>> X = pd.DataFrame(np.random.randn(1000, 5)) # 5 covariates >>> >>> # Select 50 items: 80% deterministic, 20% probabilistic >>> selected, pi, info = core_plus_tail(counts, X, K=50, tail_frac=0.2) >>> >>> # Use calibrated weights for estimation >>> from fewlab import calibrate_weights, calibrated_ht_estimator >>> weights = calibrate_weights(pi, info['g'], selected) >>> >>> # info contains: >>> # - info['core']: 40 deterministic items (highest influence) >>> # - info['tail']: 10 probabilistic items (balanced sampling) >>> # - info['weights']: Standard HT weights >>> # - info['tail_only_weights']: Mixed weights for variance reduction
- Parameters:
counts (pd.DataFrame, shape (n, m)) – Count matrix with rows=units, columns=items.
X (pd.DataFrame, shape (n, p)) – Covariate matrix, index must align with counts.index.
K (int) – Total budget (number of items to select).
tail_frac (float, default=0.2) – Fraction of budget allocated to probabilistic tail (0 < tail_frac < 1).
seed (int, optional) – Random seed for balanced tail selection.
ensure_full_rank (bool) – If True, add ridge to X^T X if rank-deficient.
ridge (float, optional) – Explicit ridge parameter for (X^T X + ridge I)^{-1}.
- Returns:
selected (pd.Index) – Selected item identifiers (length K).
pi (pd.Series) – Inclusion probabilities for all items (computed for full budget K).
info (dict) – Additional information including: - ‘core’: Items in deterministic core - ‘tail’: Items in probabilistic tail - ‘weights’: Suggested weights (1/pi for selected items) - ‘tail_only_weights’: Alternative weights (1/pi for core, 1.0 for tail)
- Return type:
- fewlab.adaptive_core_tail(counts, X, K, *, min_tail_frac=0.1, max_tail_frac=0.4, condition_threshold=1e6, seed=None)[source]¶
Adaptive core+tail selection with data-driven tail fraction.
Automatically determines optimal tail_frac based on: - Condition number of X^T X (higher -> more tail) - Distribution of influence weights w_j (more skewed -> less tail)
- Parameters:
counts (pd.DataFrame) – Count matrix.
X (pd.DataFrame) – Covariate matrix.
K (int) – Total budget.
min_tail_frac (float, default=0.1) – Minimum fraction for tail.
max_tail_frac (float, default=0.4) – Maximum fraction for tail.
condition_threshold (float) – Threshold for considering X^T X ill-conditioned.
seed (int, optional) – Random seed.
- Return type:
Same as core_plus_tail, with adaptive tail_frac in info dict.
- fewlab.greedy_aopt_selection(counts, X, K, *, ensure_full_rank=True, ridge=None)[source]¶
Select K items using a greedy A-optimal strategy.
Iteratively selects the item that maximally reduces the trace of the covariance matrix (A-optimality). Uses Sherman-Morrison rank-1 updates for efficiency.
- Parameters:
- Returns:
List of selected item identifiers.
- Return type:
Core Module¶
- class fewlab.core.Influence(w, g, cols)[source]¶
Bases:
objectInfluence data structure with memory-optimized slots.
- fewlab.core.pi_aopt_for_budget(counts, X, K, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]¶
- Return A-opt first-order inclusion probabilities pi_j for expected budget K:
pi_j = clip(c * sqrt(w_j), [pi_min, 1]), with c chosen so sum pi = K.
- fewlab.core.items_to_label(counts, X, K, *, item_axis=1, ensure_full_rank=True, ridge=None)[source]¶
Return a deterministic list of item identifiers to label (length K), using the A-opt square-root rule on w_j = g_j^T (X^T X)^{-1} g_j.
- Parameters:
counts (DataFrame (n x m)) – Nonnegative counts C with rows = units and columns = items. Index must align with X.index.
X (DataFrame (n x p)) – Covariate matrix used in the regression y ~ X. Index must align with counts.index.
K (int) – Desired number of items to label (K <= m).
item_axis ({1}) – Currently only axis=1 (columns=items) is supported. Must be 1.
ensure_full_rank (bool) – If True, and X^T X is rank-deficient, add a small ridge.
ridge (float or None) – If not None, use (X^T X + ridge I)^{-1} explicitly.
- Returns:
A list of item identifiers (counts.columns) to label, deterministic.
- Return type:
Notes
We compute T_i = sum_j c_ij, v_j = c_{·j}/T, g_j = X^T v_j, and w_j = g_j^T (X^T X)^{-1} g_j. We then pick the top-K items by w_j.
This deterministically approximates the fixed-budget A-opt solution.
If ridge is None but X is ill-conditioned, a tiny ridge is applied if ensure_full_rank is True.