API Reference

This page contains the complete API reference for fewlab.

Main Functions

Fewlab: Optimal item selection for efficient labeling and survey sampling.

Main API functions: - items_to_label: Deterministic A-optimal selection - pi_aopt_for_budget: A-optimal inclusion probabilities - balanced_fixed_size: Balanced sampling with fixed size - row_se_min_labels: Row-wise SE minimization - calibrate_weights: GREG-style weight calibration - core_plus_tail: Hybrid deterministic core + balanced tail - adaptive_core_tail: Data-driven hybrid selection

fewlab.items_to_label(counts, X, K, *, item_axis=1, ensure_full_rank=True, ridge=None)[source]

Return a deterministic list of item identifiers to label (length K), using the A-opt square-root rule on w_j = g_j^T (X^T X)^{-1} g_j.

Parameters:
  • counts (DataFrame (n x m)) – Nonnegative counts C with rows = units and columns = items. Index must align with X.index.

  • X (DataFrame (n x p)) – Covariate matrix used in the regression y ~ X. Index must align with counts.index.

  • K (int) – Desired number of items to label (K <= m).

  • item_axis ({1}) – Currently only axis=1 (columns=items) is supported. Must be 1.

  • ensure_full_rank (bool) – If True, and X^T X is rank-deficient, add a small ridge.

  • ridge (float or None) – If not None, use (X^T X + ridge I)^{-1} explicitly.

Returns:

A list of item identifiers (counts.columns) to label, deterministic.

Return type:

list

Notes

  • We compute T_i = sum_j c_ij, v_j = c_{·j}/T, g_j = X^T v_j, and w_j = g_j^T (X^T X)^{-1} g_j. We then pick the top-K items by w_j.

  • This deterministically approximates the fixed-budget A-opt solution.

  • If ridge is None but X is ill-conditioned, a tiny ridge is applied if ensure_full_rank is True.

fewlab.pi_aopt_for_budget(counts, X, K, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]
Return A-opt first-order inclusion probabilities pi_j for expected budget K:

pi_j = clip(c * sqrt(w_j), [pi_min, 1]), with c chosen so sum pi = K.

Parameters:
Return type:

Series

fewlab.balanced_fixed_size(pi, g, K, *, seed=None, max_swaps=MAX_SWAPS_BALANCED, tol=TOLERANCE_DEFAULT)[source]
Heuristic fixed-size sampler that:
  1. starts with a K-sized draw from pi (normalized),

  2. greedily swaps in/out items to reduce ||sum((I/pi)-1) g||_2.

Parameters:
  • pi (Series (length m), index=items)

  • g (ndarray (p x m) regression projections g_j)

  • K (int fixed sample size)

  • seed (int | None)

  • max_swaps (int)

  • tol (float)

Return type:

Pandas Index of selected item ids (length K).

fewlab.row_se_min_labels(counts, eps2, *, pi_min=PI_MIN_DEFAULT, max_iter=MAX_ITER_ROWSE, tol=TOLERANCE_DEFAULT, seed=None)[source]
Fewest-labels design subject to per-row SE caps:

sum_j q_ij / pi_j <= eps2_i + sum_j q_ij for all rows i,

where q_ij = (c_ij / T_i)^2.

Return type:

Series of pi_j (index = item ids).

Parameters:
fewlab.topk(arr, k)[source]

Return indices of top-k entries of arr (descending).

Parameters:
Return type:

ndarray

fewlab.calibrate_weights(pi, g, selected, pop_totals=None, *, distance='chi2', ridge=SMALL_RIDGE, nonneg=True)[source]

Compute calibrated weights for selected items using GREG/Deville-Särndal calibration.

Solves the optimization problem:

min ||w - d||^2 s.t. G_S w = t

where d = 1/pi are base HT weights, G_S is the matrix of g-vectors for selected items, and t are population totals.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import calibrate_weights, _influence
>>>
>>> # Sample data
>>> counts = pd.DataFrame(np.random.poisson(5, (100, 50)))
>>> X = pd.DataFrame(np.random.randn(100, 3))
>>> selected = counts.columns[:20]
>>>
>>> # Compute influence and calibrate
>>> inf = _influence(counts, X)
>>> pi = pd.Series(0.4, index=counts.columns)
>>> weights = calibrate_weights(pi, inf.g, selected)
>>>
>>> # Weights will satisfy calibration constraint:
>>> # sum(weights * g_selected) ≈ sum(g_all)
Parameters:
  • pi (pd.Series) – Inclusion probabilities for all items (index = item names).

  • g (np.ndarray, shape (p, m)) – Regression projections g_j = X^T v_j for all items.

  • selected (sequence of str or pd.Index) – Item identifiers that were actually selected.

  • pop_totals (np.ndarray, shape (p,), optional) – Known population totals. If None, uses g.sum(axis=1).

  • distance ({'chi2', 'euclidean'}) – Distance measure for calibration. Currently only ‘chi2’ is implemented.

  • ridge (float) – Ridge regularization parameter for numerical stability.

  • nonneg (bool) – If True, enforce non-negative weights (may slightly violate calibration).

Returns:

Calibrated weights indexed by selected items.

Return type:

pd.Series

Notes

The closed-form solution for chi-square distance is:

w* = d_S + G_S^T (G_S G_S^T + ridge I)^{-1} (t - G_S d_S)

where d_S are the base weights for selected items.

References

Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376-382.

fewlab.calibrated_ht_estimator(counts, labels, weights, *, normalize_by_total=True)[source]

Compute calibrated Horvitz-Thompson estimator for row shares.

For each row i, estimates:

y_i = (1/T_i) * sum_{j in S} w_j * a_j * C_ij

where w_j are calibrated weights, a_j are labels, and C_ij are counts.

Parameters:
  • counts (pd.DataFrame, shape (n, m)) – Count matrix with rows=units, columns=items.

  • labels (pd.Series) – Item labels (only for selected items).

  • weights (pd.Series) – Calibrated weights for selected items.

  • normalize_by_total (bool) – If True, divide by row totals T_i to get shares.

Returns:

Estimated row shares (or totals if normalize_by_total=False).

Return type:

pd.Series

fewlab.core_plus_tail(counts, X, K, *, tail_frac=0.2, seed=None, ensure_full_rank=True, ridge=None)[source]

Hybrid sampler combining deterministic core with balanced probabilistic tail.

Strategy: 1. Select K_core = (1-tail_frac)*K items deterministically (highest w_j) 2. Compute A-optimal π for full budget K 3. Select K_tail = K - K_core items from remainder using balanced sampling

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import core_plus_tail
>>>
>>> # Survey data with 1000 units and 200 items
>>> counts = pd.DataFrame(np.random.poisson(10, (1000, 200)))
>>> X = pd.DataFrame(np.random.randn(1000, 5))  # 5 covariates
>>>
>>> # Select 50 items: 80% deterministic, 20% probabilistic
>>> selected, pi, info = core_plus_tail(counts, X, K=50, tail_frac=0.2)
>>>
>>> # Use calibrated weights for estimation
>>> from fewlab import calibrate_weights, calibrated_ht_estimator
>>> weights = calibrate_weights(pi, info['g'], selected)
>>>
>>> # info contains:
>>> # - info['core']: 40 deterministic items (highest influence)
>>> # - info['tail']: 10 probabilistic items (balanced sampling)
>>> # - info['weights']: Standard HT weights
>>> # - info['tail_only_weights']: Mixed weights for variance reduction
Parameters:
  • counts (pd.DataFrame, shape (n, m)) – Count matrix with rows=units, columns=items.

  • X (pd.DataFrame, shape (n, p)) – Covariate matrix, index must align with counts.index.

  • K (int) – Total budget (number of items to select).

  • tail_frac (float, default=0.2) – Fraction of budget allocated to probabilistic tail (0 < tail_frac < 1).

  • seed (int, optional) – Random seed for balanced tail selection.

  • ensure_full_rank (bool) – If True, add ridge to X^T X if rank-deficient.

  • ridge (float, optional) – Explicit ridge parameter for (X^T X + ridge I)^{-1}.

Returns:

  • selected (pd.Index) – Selected item identifiers (length K).

  • pi (pd.Series) – Inclusion probabilities for all items (computed for full budget K).

  • info (dict) – Additional information including: - ‘core’: Items in deterministic core - ‘tail’: Items in probabilistic tail - ‘weights’: Suggested weights (1/pi for selected items) - ‘tail_only_weights’: Alternative weights (1/pi for core, 1.0 for tail)

Return type:

tuple[Index, Series, dict[str, Any]]

fewlab.adaptive_core_tail(counts, X, K, *, min_tail_frac=0.1, max_tail_frac=0.4, condition_threshold=1e6, seed=None)[source]

Adaptive core+tail selection with data-driven tail fraction.

Automatically determines optimal tail_frac based on: - Condition number of X^T X (higher -> more tail) - Distribution of influence weights w_j (more skewed -> less tail)

Parameters:
  • counts (pd.DataFrame) – Count matrix.

  • X (pd.DataFrame) – Covariate matrix.

  • K (int) – Total budget.

  • min_tail_frac (float, default=0.1) – Minimum fraction for tail.

  • max_tail_frac (float, default=0.4) – Maximum fraction for tail.

  • condition_threshold (float) – Threshold for considering X^T X ill-conditioned.

  • seed (int, optional) – Random seed.

Return type:

Same as core_plus_tail, with adaptive tail_frac in info dict.

fewlab.greedy_aopt_selection(counts, X, K, *, ensure_full_rank=True, ridge=None)[source]

Select K items using a greedy A-optimal strategy.

Iteratively selects the item that maximally reduces the trace of the covariance matrix (A-optimality). Uses Sherman-Morrison rank-1 updates for efficiency.

Parameters:
  • counts (DataFrame (n x m)) – Nonnegative counts.

  • X (DataFrame (n x p)) – Covariate matrix.

  • K (int) – Number of items to select.

  • ensure_full_rank (bool) – If True, adds a small ridge if needed.

  • ridge (float | None) – Explicit ridge parameter.

Returns:

List of selected item identifiers.

Return type:

list[str]

Core Module

class fewlab.core.Influence(w, g, cols)[source]

Bases: object

Influence data structure with memory-optimized slots.

Parameters:
w: ndarray
g: ndarray
cols: list[str]
__init__(w, g, cols)
Parameters:
Return type:

None

fewlab.core.pi_aopt_for_budget(counts, X, K, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]
Return A-opt first-order inclusion probabilities pi_j for expected budget K:

pi_j = clip(c * sqrt(w_j), [pi_min, 1]), with c chosen so sum pi = K.

Parameters:
Return type:

Series

fewlab.core.items_to_label(counts, X, K, *, item_axis=1, ensure_full_rank=True, ridge=None)[source]

Return a deterministic list of item identifiers to label (length K), using the A-opt square-root rule on w_j = g_j^T (X^T X)^{-1} g_j.

Parameters:
  • counts (DataFrame (n x m)) – Nonnegative counts C with rows = units and columns = items. Index must align with X.index.

  • X (DataFrame (n x p)) – Covariate matrix used in the regression y ~ X. Index must align with counts.index.

  • K (int) – Desired number of items to label (K <= m).

  • item_axis ({1}) – Currently only axis=1 (columns=items) is supported. Must be 1.

  • ensure_full_rank (bool) – If True, and X^T X is rank-deficient, add a small ridge.

  • ridge (float or None) – If not None, use (X^T X + ridge I)^{-1} explicitly.

Returns:

A list of item identifiers (counts.columns) to label, deterministic.

Return type:

list

Notes

  • We compute T_i = sum_j c_ij, v_j = c_{·j}/T, g_j = X^T v_j, and w_j = g_j^T (X^T X)^{-1} g_j. We then pick the top-K items by w_j.

  • This deterministically approximates the fixed-budget A-opt solution.

  • If ridge is None but X is ill-conditioned, a tiny ridge is applied if ensure_full_rank is True.

Selection Module

fewlab.selection.topk(arr, k)[source]

Return indices of top-k entries of arr (descending).

Parameters:
Return type:

ndarray

Balanced Sampling

fewlab.balanced.balanced_fixed_size(pi, g, K, *, seed=None, max_swaps=MAX_SWAPS_BALANCED, tol=TOLERANCE_DEFAULT)[source]
Heuristic fixed-size sampler that:
  1. starts with a K-sized draw from pi (normalized),

  2. greedily swaps in/out items to reduce ||sum((I/pi)-1) g||_2.

Parameters:
  • pi (Series (length m), index=items)

  • g (ndarray (p x m) regression projections g_j)

  • K (int fixed sample size)

  • seed (int | None)

  • max_swaps (int)

  • tol (float)

Return type:

Pandas Index of selected item ids (length K).

Row Standard Error Minimization

fewlab.rowse.row_se_min_labels(counts, eps2, *, pi_min=PI_MIN_DEFAULT, max_iter=MAX_ITER_ROWSE, tol=TOLERANCE_DEFAULT, seed=None)[source]
Fewest-labels design subject to per-row SE caps:

sum_j q_ij / pi_j <= eps2_i + sum_j q_ij for all rows i,

where q_ij = (c_ij / T_i)^2.

Return type:

Series of pi_j (index = item ids).

Parameters: