API Reference¶
This page contains the complete API reference for fewlab.
Main Functions¶
Fewlab: Optimal item selection for efficient labeling and survey sampling.
Main API functions: - items_to_label: Deterministic A-optimal selection - pi_aopt_for_budget: A-optimal inclusion probabilities - balanced_fixed_size: Balanced sampling with fixed size - row_se_min_labels: Row-wise SE minimization - calibrate_weights: GREG-style weight calibration - core_plus_tail: Hybrid deterministic core + balanced tail - adaptive_core_tail: Data-driven hybrid selection
- class fewlab.Design(counts, X, *, ridge='auto', ensure_full_rank=True)[source]
Bases:
objectPrimary interface for optimal experimental design with cached computations.
The class stores processed data, cached influence matrices, and diagnostics so that repeated operations such as selection, sampling, and calibration can reuse expensive intermediate results.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import Design >>> >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> design = Design(counts, X) >>> design.select(budget=20).shape[0] 20
- Parameters:
- __init__(counts, X, *, ridge='auto', ensure_full_rank=True)[source]
Initialize the design with validated data and cached influence computation.
- Args:
counts: Count matrix with non-negative entries. X: Feature matrix aligned with counts.index. ridge: Ridge value or “auto” to infer it from conditioning. ensure_full_rank: Whether to add a ridge when X^T X is ill-conditioned.
- calibrate_weights(selected, pop_totals=None, *, distance='chi2', ridge=SMALL_RIDGE, nonneg=True)[source]
Compute calibrated weights for selected items.
- Args:
selected: Identifiers of sampled items. pop_totals: Optional population totals; defaults to sums of the g matrix. distance: Calibration distance measure (e.g., “chi2”). ridge: Ridge regularization parameter. nonneg: Whether to enforce non-negative calibrated weights.
- Returns:
Calibrated weights indexed by the selected items.
- estimate(selected, labels, weights=None, *, normalize_by_total=True)[source]
Compute calibrated Horvitz-Thompson estimates for row shares.
- Args:
selected: Identifiers of sampled items. labels: Observed labels for the selected items. weights: Optional calibrated weights; if omitted they are computed internally. normalize_by_total: Whether to divide by row totals to produce shares.
- Returns:
Estimation result with estimates, weights, and diagnostics.
- inclusion_probabilities(budget, *, pi_min=PI_MIN_DEFAULT, method='aopt', **kwargs)[source]
Compute inclusion probabilities for a given budget.
- Args:
budget: Expected total budget (sum of inclusion probabilities). pi_min: Minimum inclusion probability per item. method: Probability computation strategy, “aopt” or “row_se”. **kwargs: Additional method-specific arguments (e.g., eps2 for “row_se”).
- Returns:
Probability result with inclusion probabilities and diagnostics.
- Raises:
ValidationError: If the method name is unknown.
- property influence_weights: Series
A-optimal influence weights w_j for each item.
- property n_items: int
Number of items (columns) after preprocessing.
- property n_units: int
Number of units (rows) after preprocessing.
- sample(budget, method='balanced', *, random_state=None, **kwargs)[source]
Generate probabilistic samples using various methods.
- Args:
budget: Number of items to sample. method: Sampling method (“balanced”, “core_plus_tail”, or “adaptive”). random_state: Random state for reproducible sampling. Can be None, int, or Generator. **kwargs: Method-specific parameters (e.g., tail_frac, pi_min, tolerances).
- Returns:
Sampled item identifiers.
- Raises:
ValidationError: If the method name is unknown.
- select(budget, method='deterministic')[source]
Select items using deterministic algorithms.
- Args:
budget: Number of items to select. method: Selection algorithm: “deterministic” (batch) or “greedy” (sequential).
- Returns:
Selection result with items, influence weights, and diagnostics.
- Raises:
ValidationError: If the method name is unknown.
- fewlab.items_to_label(counts, X, budget, *, ensure_full_rank=True, ridge=None)[source]
Select items to label using deterministic A-optimal design.
Influence weights are computed as w_j = g_j^T (X^T X)^{-1} g_j, and the top entries are returned.
- Args:
counts: Count matrix with units as rows and items as columns. X: Feature matrix aligned with counts.index. budget: Number of items to select. ensure_full_rank: Whether to add a ridge term when X^T X is ill-conditioned. ridge: Optional ridge parameter overriding the automatic heuristic.
- Returns:
Selection result with items, influence weights, and diagnostics.
- See Also:
pi_aopt_for_budget: Compute inclusion probabilities for the same design. greedy_aopt_selection: Greedy sequential variant. core_plus_tail: Hybrid deterministic and probabilistic selection.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import items_to_label >>> >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> result = items_to_label(counts, X, budget=50) >>> len(result.selected) 50
- fewlab.pi_aopt_for_budget(counts, X, budget, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]
Compute A-optimal first-order inclusion probabilities for a target budget.
The probabilities follow the square-root rule pi_j = clip(c * sqrt(w_j), [pi_min, 1]) with c chosen so that sum(pi) = budget.
- Args:
counts: Count matrix with non-negative values. X: Feature matrix aligned with counts.index. budget: Expected total budget (sum of inclusion probabilities). pi_min: Minimum allowed inclusion probability. ensure_full_rank: Whether to add a small ridge term when X^T X is ill-conditioned. ridge: Explicit ridge parameter overriding the automatic heuristic.
- Returns:
Probability result with inclusion probabilities and computation diagnostics.
- Note:
If budget < m * pi_min (where m is the number of items), the budget constraint cannot be satisfied. In this case, the function returns all probabilities as pi_min, resulting in sum(pi) = m * pi_min > budget, and issues a warning. The violation details are included in the result’s diagnostics under budget_violation.
- See Also:
items_to_label: Deterministic selection using the same influence weights. balanced_fixed_size: Fixed-size balanced sampling using these probabilities.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import pi_aopt_for_budget >>> >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> result = pi_aopt_for_budget(counts, X, budget=50) >>> round(result.budget_used, 1) 50.0
- fewlab.balanced_fixed_size(pi, g, budget, *, random_state=None, max_swaps=MAX_SWAPS_BALANCED, tol=TOLERANCE_DEFAULT)[source]
Fixed-size balanced sampling with variance reduction.
Implements a two-step heuristic:
Initial selection proportional to inclusion probabilities pi
Greedy local search to minimize calibration residual ||sum((I/pi)-1) g||_2
This balancing procedure aims to reduce the variance of Horvitz-Thompson estimators by making the sample more representative.
- Args:
pi: Inclusion probabilities for items. Index contains item identifiers. g: Regression projections g_j = X^T v_j for each item j (shape (p, m)). budget: Fixed sample size (number of items to select). random_state: Random state for reproducible sampling. Can be None, int, or Generator. max_swaps: Maximum number of swap iterations for balancing. tol: Tolerance for stopping criterion (residual norm).
- Returns:
Index of selected items. Length equals budget.
- Raises:
ValidationError: If pi, g, or budget fail validation checks.
- See Also:
pi_aopt_for_budget: Compute optimal inclusion probabilities. core_plus_tail: Hybrid deterministic + balanced sampling. calibrate_weights: Post-stratification weight adjustment.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import pi_aopt_for_budget, balanced_fixed_size, _influence >>> >>> # Setup data >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> >>> # Compute probabilities and influence matrix >>> pi = pi_aopt_for_budget(counts, X, budget=30) >>> inf = _influence(counts, X) >>> >>> # Balanced sampling >>> selected = balanced_fixed_size(pi, inf.g, budget=30, random_state=42) >>> print(f"Selected {len(selected)} items with balanced design")
- Notes:
The balancing algorithm aims to make sum_S (I_j/pi_j - 1) * g_j ≈ 0, where S is the selected sample and I_j are selection indicators. This reduces variance in calibrated estimators.
- fewlab.row_se_min_labels(counts, eps2, *, pi_min=PI_MIN_DEFAULT, max_iter=MAX_ITER_ROWSE, tol=TOLERANCE_DEFAULT, random_state=None, return_result=False, raise_on_failure=False)[source]
Compute inclusion probabilities that minimize total expected labels under row-wise SE limits.
The routine solves:
` minimize sum_j pi_j subject to sum_j q_ij / pi_j <= eps2_i + sum_j q_ij, q_ij = (c_ij / T_i)^2 `- Args:
counts: Non-negative count matrix with units as rows and items as columns. eps2: Row-wise squared standard-error tolerance; scalar applies to every row. pi_min: Minimum allowable inclusion probability. max_iter: Maximum optimization iterations. tol: Convergence tolerance for constraint violations. random_state: Random state for the stochastic subgradient steps. Can be None, int, or Generator. return_result: If True, return a RowSEResult with diagnostics. raise_on_failure: If True, raise a ValidationError when constraints remain violated.
- Returns:
Probability series if return_result is False (default) or a RowSEResult with diagnostics when return_result is True.
- Raises:
- ValidationError: If inputs are invalid or raise_on_failure is True and the constraints
remain violated after optimization.
- See Also:
pi_aopt_for_budget: A-optimal probabilities for a fixed budget. items_to_label: Deterministic selection without SE constraints.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import row_se_min_labels >>> >>> counts = pd.DataFrame(np.random.poisson(10, (500, 50))) >>> pi = row_se_min_labels(counts, eps2=0.05**2) >>> float(pi.sum())
- fewlab.topk(arr, k, *, index=None)[source]
Return indices of the top-k entries of
arrin descending order.- Args:
arr: Array of scores to rank. k: Number of entries to keep. index: Optional index to map positions back to labels.
- Returns:
Index of the top-k entries ordered by decreasing value.
- fewlab.calibrate_weights(pi, g, selected, pop_totals=None, *, distance='chi2', ridge=SMALL_RIDGE, nonneg=True)[source]
Compute calibrated weights for selected items using GREG/Deville-Särndal calibration.
- Args:
pi: Inclusion probabilities for all items (index = item names). g: Regression projections g_j = X^T v_j for all items (shape (p, m)). selected: Item identifiers drawn in the sample. pop_totals: Known population totals (shape (p,)); defaults to g.sum(axis=1). distance: Calibration distance measure; currently only “chi2” is supported. ridge: Ridge regularization parameter for numerical stability. nonneg: Whether to enforce non-negative calibrated weights.
- Returns:
Calibrated weights indexed by the selected items.
- Raises:
NotImplementedError: If distance is not “chi2”. ValueError: If pop_totals has the wrong shape.
- Notes:
The closed-form solution for chi-square distance is w* = d_S + G_S^T (G_S G_S^T + ridge I)^{-1} (t - G_S d_S) where d_S are base weights.
- References:
Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376-382.
- fewlab.calibrated_ht_estimator(counts, labels, weights, *, normalize_by_total=True)[source]
Compute calibrated Horvitz-Thompson estimator for row shares.
- Args:
counts: Count matrix with rows as units and columns as items. labels: Item labels for the selected items. weights: Calibrated weights for the selected items. normalize_by_total: Whether to divide by row totals to obtain shares.
- Returns:
Estimated row shares (or totals if normalize_by_total is False).
- fewlab.core_plus_tail(counts, X, budget, *, tail_frac=0.2, random_state=None, ensure_full_rank=True, ridge=None)[source]
Hybrid sampler combining a deterministic core with a balanced probabilistic tail.
- Strategy:
Select budget_core = (1 - tail_frac) * budget items deterministically (largest w_j).
Compute A-optimal inclusion probabilities for the full budget.
Draw the remaining budget_tail items using balanced sampling.
- Args:
counts: Count matrix with units as rows and candidate items as columns. X: Feature matrix aligned with counts.index. budget: Total number of items to select. tail_frac: Fraction of the budget allocated to the probabilistic tail. random_state: Random state for balanced tail selection. Can be None, int, or Generator. ensure_full_rank: Whether to regularize X^T X if it is rank-deficient. ridge: Optional ridge penalty added to X^T X.
- Returns:
Selection result containing the chosen items, inclusion probabilities, and metadata.
- Raises:
ValidationError: If inputs fail validation or the core/tail split is infeasible.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import core_plus_tail >>> >>> counts = pd.DataFrame(np.random.poisson(10, (1000, 200))) >>> X = pd.DataFrame(np.random.randn(1000, 5)) >>> result = core_plus_tail(counts, X, budget=50, tail_frac=0.2) >>> result.selected.shape (50,)
- fewlab.adaptive_core_tail(counts, X, budget, *, min_tail_frac=0.1, max_tail_frac=0.4, condition_threshold=1e6, random_state=None)[source]
Adaptive core+tail selection with a data-driven tail fraction.
The routine increases the tail fraction when X^T X is poorly conditioned and decreases it when influence weights are highly concentrated.
- Args:
counts: Count matrix. X: Feature matrix. budget: Total number of items to select. min_tail_frac: Minimum allowable tail fraction. max_tail_frac: Maximum allowable tail fraction. condition_threshold: Baseline condition number scale. random_state: Random state for the balanced sampling step. Can be None, int, or Generator.
- Returns:
Selection result identical to core_plus_tail, with adaptive metadata in info.
- fewlab.greedy_aopt_selection(counts, X, budget, *, ensure_full_rank=True, ridge=None)[source]
Select items using greedy A-optimal sequential selection.
The algorithm iteratively chooses the item that maximally reduces the trace of the covariance matrix using Sherman-Morrison updates.
- Args:
counts: Count matrix with non-negative entries. X: Feature matrix aligned with counts.index. budget: Number of items to select sequentially. ensure_full_rank: Whether to add a ridge if the information matrix becomes singular. ridge: Optional explicit ridge parameter.
- Returns:
Selection result with items, influence weights, and diagnostics.
- See Also:
items_to_label: Batch A-optimal selection (faster, different results). pi_aopt_for_budget: Compute inclusion probabilities for A-optimal design.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import greedy_aopt_selection >>> >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> result = greedy_aopt_selection(counts, X, budget=20) >>> len(result.selected) 20
- class fewlab.CoreTailResult(selected, probabilities, core, tail, ht_weights, mixed_weights, diagnostics)[source]
Bases:
objectStructured result for hybrid core+tail selection methods.
- Attributes:
selected: All selected item identifiers (core + tail). probabilities: A-optimal inclusion probabilities for all items. core: Deterministic core items (highest influence). tail: Probabilistic tail items (balanced sampling). ht_weights: Standard Horvitz-Thompson weights for the selected items. mixed_weights: Mixed weights (1/pi for core, 1.0 for tail) for variance reduction. diagnostics: Additional metadata such as budget splits and tail fraction.
- Properties:
budget_used: Total number of items selected. budget_core: Number of items in the deterministic core. budget_tail: Number of items in the probabilistic tail. tail_frac: Fraction of the budget allocated to the tail.
- Examples:
>>> result = design.sample(budget=50, method="core_plus_tail", tail_frac=0.2) >>> len(result.selected), len(result.core), len(result.tail) (50, 40, 10)
- Parameters:
- selected: Index
- probabilities: Series
- core: Index
- tail: Index
- ht_weights: Series
- mixed_weights: Series
- __getitem__(key)[source]
Access selected items by index.
- __init__(selected, probabilities, core, tail, ht_weights, mixed_weights, diagnostics)
- __iter__()[source]
Iterate over selected items.
- property budget_core: int
Number of items in deterministic core.
- property budget_tail: int
Number of items in probabilistic tail.
- property budget_used: int
Total number of items selected.
- property probability_sum: float
Sum of inclusion probabilities.
- property tail_frac: float
Fraction of budget allocated to tail.
- class fewlab.SamplingResult(sample, probabilities, weights, diagnostics)[source]
Bases:
objectStructured result for probabilistic sampling methods.
- Attributes:
sample: Sampled item identifiers. probabilities: Inclusion probabilities used for sampling. weights: Suggested sampling weights for the sampled items. diagnostics: Sampling diagnostics and metadata.
- Properties:
sample_size: Number of sampled items.
- Examples:
>>> result = design.sample(budget=30, method="balanced") >>> result.sample_size 30
- sample: Index
- probabilities: Series
- weights: Series
- __getitem__(key)[source]
Access sampled items by index.
- __init__(sample, probabilities, weights, diagnostics)
- __iter__()[source]
Iterate over sampled items.
- property probability_sum: float
Sum of inclusion probabilities.
- property sample_size: int
Number of sampled items.
- class fewlab.SelectionResult(selected, influence_weights, diagnostics)[source]
Bases:
objectStructured result for deterministic selection methods.
- Attributes:
selected: Selected item identifiers ordered by influence. influence_weights: A-optimal influence weights used for selection. diagnostics: Selection diagnostics and metadata.
- Properties:
budget_used: Number of items selected.
- Examples:
>>> result = design.select(budget=30, method="deterministic") >>> len(result.selected) 30
- selected: Index
- influence_weights: Series
- __getitem__(key)[source]
Access selected items by index.
- __init__(selected, influence_weights, diagnostics)
- __iter__()[source]
Iterate over selected items.
- property budget_used: int
Number of items selected.
- class fewlab.ProbabilityResult(probabilities, influence_projections, diagnostics)[source]
Bases:
objectStructured result for probability computation methods.
Provides access to computed probabilities, influence projections, and computation diagnostics.
- probabilities
Inclusion probabilities indexed by item identifiers.
- Type:
pd.Series
- influence_projections
Regression projections g_j = X^T v_j for all items (shape (p, m)). Used for balanced sampling and weight calibration.
- Type:
np.ndarray
- Properties
- ----------
- budget_used
Sum of inclusion probabilities.
- Type:
Examples
>>> result = design.inclusion_probabilities(budget=50, method="aopt") >>> print(f"Budget used: {result.budget_used:.1f}") >>> # Now you can use influence_projections for balanced sampling >>> from fewlab import balanced_fixed_size >>> selected = balanced_fixed_size(result.probabilities, result.influence_projections, 50)
- probabilities: Series
- influence_projections: ndarray
- __init__(probabilities, influence_projections, diagnostics)
- property budget_used: float
Sum of inclusion probabilities.
- class fewlab.EstimationResult(estimates, weights, selected, diagnostics)[source]
Bases:
objectStructured result for estimation methods.
- estimates
Row-wise estimates.
- Type:
pd.Series
- weights
Calibrated weights used for estimation.
- Type:
pd.Series
- selected
Items used for estimation.
- Type:
pd.Index
Examples
>>> result = design.estimate(selected, labels) >>> print(f"Mean estimate: {result.estimates.mean():.3f}")
- estimates: Series
- weights: Series
- selected: Index
- __init__(estimates, weights, selected, diagnostics)
- class fewlab.RowSEResult(probabilities, max_violation, tolerance, iterations, best_iteration, feasible)[source]
Bases:
objectResult container for row_se_min_labels.
- Attributes:
probabilities: Inclusion probabilities indexed by item identifiers. max_violation: Maximum constraint violation encountered. tolerance: Target violation tolerance. iterations: Number of iterations executed. best_iteration: Iteration index where the best solution was recorded. feasible: Whether the best solution satisfies the tolerance.
- Parameters:
- probabilities: Series
- max_violation: float
- tolerance: float
- iterations: int
- best_iteration: int
- feasible: bool
- __init__(probabilities, max_violation, tolerance, iterations, best_iteration, feasible)
Core Module¶
- class fewlab.core.Influence(w, g, cols)[source]¶
Bases:
objectInfluence data structure with memory-optimized slots.
- fewlab.core.pi_aopt_for_budget(counts, X, budget, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]¶
Compute A-optimal first-order inclusion probabilities for a target budget.
The probabilities follow the square-root rule pi_j = clip(c * sqrt(w_j), [pi_min, 1]) with c chosen so that sum(pi) = budget.
- Args:
counts: Count matrix with non-negative values. X: Feature matrix aligned with counts.index. budget: Expected total budget (sum of inclusion probabilities). pi_min: Minimum allowed inclusion probability. ensure_full_rank: Whether to add a small ridge term when X^T X is ill-conditioned. ridge: Explicit ridge parameter overriding the automatic heuristic.
- Returns:
Probability result with inclusion probabilities and computation diagnostics.
- Note:
If budget < m * pi_min (where m is the number of items), the budget constraint cannot be satisfied. In this case, the function returns all probabilities as pi_min, resulting in sum(pi) = m * pi_min > budget, and issues a warning. The violation details are included in the result’s diagnostics under budget_violation.
- See Also:
items_to_label: Deterministic selection using the same influence weights. balanced_fixed_size: Fixed-size balanced sampling using these probabilities.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import pi_aopt_for_budget >>> >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> result = pi_aopt_for_budget(counts, X, budget=50) >>> round(result.budget_used, 1) 50.0
- fewlab.core.items_to_label(counts, X, budget, *, ensure_full_rank=True, ridge=None)[source]¶
Select items to label using deterministic A-optimal design.
Influence weights are computed as w_j = g_j^T (X^T X)^{-1} g_j, and the top entries are returned.
- Args:
counts: Count matrix with units as rows and items as columns. X: Feature matrix aligned with counts.index. budget: Number of items to select. ensure_full_rank: Whether to add a ridge term when X^T X is ill-conditioned. ridge: Optional ridge parameter overriding the automatic heuristic.
- Returns:
Selection result with items, influence weights, and diagnostics.
- See Also:
pi_aopt_for_budget: Compute inclusion probabilities for the same design. greedy_aopt_selection: Greedy sequential variant. core_plus_tail: Hybrid deterministic and probabilistic selection.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import items_to_label >>> >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> result = items_to_label(counts, X, budget=50) >>> len(result.selected) 50
Selection Module¶
- fewlab.selection.topk(arr, k, *, index=None)[source]¶
Return indices of the top-k entries of
arrin descending order.- Args:
arr: Array of scores to rank. k: Number of entries to keep. index: Optional index to map positions back to labels.
- Returns:
Index of the top-k entries ordered by decreasing value.
Balanced Sampling¶
- fewlab.balanced.balanced_fixed_size(pi, g, budget, *, random_state=None, max_swaps=MAX_SWAPS_BALANCED, tol=TOLERANCE_DEFAULT)[source]¶
Fixed-size balanced sampling with variance reduction.
Implements a two-step heuristic:
Initial selection proportional to inclusion probabilities pi
Greedy local search to minimize calibration residual ||sum((I/pi)-1) g||_2
This balancing procedure aims to reduce the variance of Horvitz-Thompson estimators by making the sample more representative.
- Args:
pi: Inclusion probabilities for items. Index contains item identifiers. g: Regression projections g_j = X^T v_j for each item j (shape (p, m)). budget: Fixed sample size (number of items to select). random_state: Random state for reproducible sampling. Can be None, int, or Generator. max_swaps: Maximum number of swap iterations for balancing. tol: Tolerance for stopping criterion (residual norm).
- Returns:
Index of selected items. Length equals budget.
- Raises:
ValidationError: If pi, g, or budget fail validation checks.
- See Also:
pi_aopt_for_budget: Compute optimal inclusion probabilities. core_plus_tail: Hybrid deterministic + balanced sampling. calibrate_weights: Post-stratification weight adjustment.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import pi_aopt_for_budget, balanced_fixed_size, _influence >>> >>> # Setup data >>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100))) >>> X = pd.DataFrame(np.random.randn(1000, 3)) >>> >>> # Compute probabilities and influence matrix >>> pi = pi_aopt_for_budget(counts, X, budget=30) >>> inf = _influence(counts, X) >>> >>> # Balanced sampling >>> selected = balanced_fixed_size(pi, inf.g, budget=30, random_state=42) >>> print(f"Selected {len(selected)} items with balanced design")
- Notes:
The balancing algorithm aims to make sum_S (I_j/pi_j - 1) * g_j ≈ 0, where S is the selected sample and I_j are selection indicators. This reduces variance in calibrated estimators.
Row Standard Error Minimization¶
- fewlab.rowse.row_se_min_labels(counts, eps2, *, pi_min=PI_MIN_DEFAULT, max_iter=MAX_ITER_ROWSE, tol=TOLERANCE_DEFAULT, random_state=None, return_result=False, raise_on_failure=False)[source]¶
Compute inclusion probabilities that minimize total expected labels under row-wise SE limits.
The routine solves:
` minimize sum_j pi_j subject to sum_j q_ij / pi_j <= eps2_i + sum_j q_ij, q_ij = (c_ij / T_i)^2 `- Args:
counts: Non-negative count matrix with units as rows and items as columns. eps2: Row-wise squared standard-error tolerance; scalar applies to every row. pi_min: Minimum allowable inclusion probability. max_iter: Maximum optimization iterations. tol: Convergence tolerance for constraint violations. random_state: Random state for the stochastic subgradient steps. Can be None, int, or Generator. return_result: If True, return a RowSEResult with diagnostics. raise_on_failure: If True, raise a ValidationError when constraints remain violated.
- Returns:
Probability series if return_result is False (default) or a RowSEResult with diagnostics when return_result is True.
- Raises:
- ValidationError: If inputs are invalid or raise_on_failure is True and the constraints
remain violated after optimization.
- See Also:
pi_aopt_for_budget: A-optimal probabilities for a fixed budget. items_to_label: Deterministic selection without SE constraints.
- Examples:
>>> import pandas as pd >>> import numpy as np >>> from fewlab import row_se_min_labels >>> >>> counts = pd.DataFrame(np.random.poisson(10, (500, 50))) >>> pi = row_se_min_labels(counts, eps2=0.05**2) >>> float(pi.sum())