API Reference

This page contains the complete API reference for fewlab.

Main Functions

Fewlab: Optimal item selection for efficient labeling and survey sampling.

Main API functions: - items_to_label: Deterministic A-optimal selection - pi_aopt_for_budget: A-optimal inclusion probabilities - balanced_fixed_size: Balanced sampling with fixed size - row_se_min_labels: Row-wise SE minimization - calibrate_weights: GREG-style weight calibration - core_plus_tail: Hybrid deterministic core + balanced tail - adaptive_core_tail: Data-driven hybrid selection

class fewlab.Design(counts, X, *, ridge='auto', ensure_full_rank=True)[source]

Bases: object

Primary interface for optimal experimental design with cached computations.

The class stores processed data, cached influence matrices, and diagnostics so that repeated operations such as selection, sampling, and calibration can reuse expensive intermediate results.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import Design
>>>
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>> design = Design(counts, X)
>>> design.select(budget=20).shape[0]
20
Parameters:
__init__(counts, X, *, ridge='auto', ensure_full_rank=True)[source]

Initialize the design with validated data and cached influence computation.

Args:

counts: Count matrix with non-negative entries. X: Feature matrix aligned with counts.index. ridge: Ridge value or “auto” to infer it from conditioning. ensure_full_rank: Whether to add a ridge when X^T X is ill-conditioned.

Parameters:
Return type:

None

__repr__()[source]

String representation of Design object.

Return type:

str

calibrate_weights(selected, pop_totals=None, *, distance='chi2', ridge=SMALL_RIDGE, nonneg=True)[source]

Compute calibrated weights for selected items.

Args:

selected: Identifiers of sampled items. pop_totals: Optional population totals; defaults to sums of the g matrix. distance: Calibration distance measure (e.g., “chi2”). ridge: Ridge regularization parameter. nonneg: Whether to enforce non-negative calibrated weights.

Returns:

Calibrated weights indexed by the selected items.

Parameters:
Return type:

Series

property diagnostics: dict[str, Any]

Comprehensive diagnostic information about the design.

estimate(selected, labels, weights=None, *, normalize_by_total=True)[source]

Compute calibrated Horvitz-Thompson estimates for row shares.

Args:

selected: Identifiers of sampled items. labels: Observed labels for the selected items. weights: Optional calibrated weights; if omitted they are computed internally. normalize_by_total: Whether to divide by row totals to produce shares.

Returns:

Estimation result with estimates, weights, and diagnostics.

Parameters:
Return type:

EstimationResult

inclusion_probabilities(budget, *, pi_min=PI_MIN_DEFAULT, method='aopt', **kwargs)[source]

Compute inclusion probabilities for a given budget.

Args:

budget: Expected total budget (sum of inclusion probabilities). pi_min: Minimum inclusion probability per item. method: Probability computation strategy, “aopt” or “row_se”. **kwargs: Additional method-specific arguments (e.g., eps2 for “row_se”).

Returns:

Probability result with inclusion probabilities and diagnostics.

Raises:

ValidationError: If the method name is unknown.

Parameters:
Return type:

ProbabilityResult

property influence_weights: Series

A-optimal influence weights w_j for each item.

property n_items: int

Number of items (columns) after preprocessing.

property n_units: int

Number of units (rows) after preprocessing.

sample(budget, method='balanced', *, random_state=None, **kwargs)[source]

Generate probabilistic samples using various methods.

Args:

budget: Number of items to sample. method: Sampling method (“balanced”, “core_plus_tail”, or “adaptive”). random_state: Random state for reproducible sampling. Can be None, int, or Generator. **kwargs: Method-specific parameters (e.g., tail_frac, pi_min, tolerances).

Returns:

Sampled item identifiers.

Raises:

ValidationError: If the method name is unknown.

Parameters:
Return type:

SamplingResult | CoreTailResult

select(budget, method='deterministic')[source]

Select items using deterministic algorithms.

Args:

budget: Number of items to select. method: Selection algorithm: “deterministic” (batch) or “greedy” (sequential).

Returns:

Selection result with items, influence weights, and diagnostics.

Raises:

ValidationError: If the method name is unknown.

Parameters:
  • budget (int)

  • method (Literal['deterministic', 'greedy'])

Return type:

SelectionResult

fewlab.items_to_label(counts, X, budget, *, ensure_full_rank=True, ridge=None)[source]

Select items to label using deterministic A-optimal design.

Influence weights are computed as w_j = g_j^T (X^T X)^{-1} g_j, and the top entries are returned.

Args:

counts: Count matrix with units as rows and items as columns. X: Feature matrix aligned with counts.index. budget: Number of items to select. ensure_full_rank: Whether to add a ridge term when X^T X is ill-conditioned. ridge: Optional ridge parameter overriding the automatic heuristic.

Returns:

Selection result with items, influence weights, and diagnostics.

See Also:

pi_aopt_for_budget: Compute inclusion probabilities for the same design. greedy_aopt_selection: Greedy sequential variant. core_plus_tail: Hybrid deterministic and probabilistic selection.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import items_to_label
>>>
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>> result = items_to_label(counts, X, budget=50)
>>> len(result.selected)
50
Parameters:
Return type:

SelectionResult

fewlab.pi_aopt_for_budget(counts, X, budget, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]

Compute A-optimal first-order inclusion probabilities for a target budget.

The probabilities follow the square-root rule pi_j = clip(c * sqrt(w_j), [pi_min, 1]) with c chosen so that sum(pi) = budget.

Args:

counts: Count matrix with non-negative values. X: Feature matrix aligned with counts.index. budget: Expected total budget (sum of inclusion probabilities). pi_min: Minimum allowed inclusion probability. ensure_full_rank: Whether to add a small ridge term when X^T X is ill-conditioned. ridge: Explicit ridge parameter overriding the automatic heuristic.

Returns:

Probability result with inclusion probabilities and computation diagnostics.

Note:

If budget < m * pi_min (where m is the number of items), the budget constraint cannot be satisfied. In this case, the function returns all probabilities as pi_min, resulting in sum(pi) = m * pi_min > budget, and issues a warning. The violation details are included in the result’s diagnostics under budget_violation.

See Also:

items_to_label: Deterministic selection using the same influence weights. balanced_fixed_size: Fixed-size balanced sampling using these probabilities.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import pi_aopt_for_budget
>>>
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>> result = pi_aopt_for_budget(counts, X, budget=50)
>>> round(result.budget_used, 1)
50.0
Parameters:
Return type:

ProbabilityResult

fewlab.balanced_fixed_size(pi, g, budget, *, random_state=None, max_swaps=MAX_SWAPS_BALANCED, tol=TOLERANCE_DEFAULT)[source]

Fixed-size balanced sampling with variance reduction.

Implements a two-step heuristic:

  1. Initial selection proportional to inclusion probabilities pi

  2. Greedy local search to minimize calibration residual ||sum((I/pi)-1) g||_2

This balancing procedure aims to reduce the variance of Horvitz-Thompson estimators by making the sample more representative.

Args:

pi: Inclusion probabilities for items. Index contains item identifiers. g: Regression projections g_j = X^T v_j for each item j (shape (p, m)). budget: Fixed sample size (number of items to select). random_state: Random state for reproducible sampling. Can be None, int, or Generator. max_swaps: Maximum number of swap iterations for balancing. tol: Tolerance for stopping criterion (residual norm).

Returns:

Index of selected items. Length equals budget.

Raises:

ValidationError: If pi, g, or budget fail validation checks.

See Also:

pi_aopt_for_budget: Compute optimal inclusion probabilities. core_plus_tail: Hybrid deterministic + balanced sampling. calibrate_weights: Post-stratification weight adjustment.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import pi_aopt_for_budget, balanced_fixed_size, _influence
>>>
>>> # Setup data
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>>
>>> # Compute probabilities and influence matrix
>>> pi = pi_aopt_for_budget(counts, X, budget=30)
>>> inf = _influence(counts, X)
>>>
>>> # Balanced sampling
>>> selected = balanced_fixed_size(pi, inf.g, budget=30, random_state=42)
>>> print(f"Selected {len(selected)} items with balanced design")
Notes:

The balancing algorithm aims to make sum_S (I_j/pi_j - 1) * g_j ≈ 0, where S is the selected sample and I_j are selection indicators. This reduces variance in calibrated estimators.

Parameters:
Return type:

Index

fewlab.row_se_min_labels(counts, eps2, *, pi_min=PI_MIN_DEFAULT, max_iter=MAX_ITER_ROWSE, tol=TOLERANCE_DEFAULT, random_state=None, return_result=False, raise_on_failure=False)[source]

Compute inclusion probabilities that minimize total expected labels under row-wise SE limits.

The routine solves:

` minimize   sum_j pi_j subject to sum_j q_ij / pi_j <= eps2_i + sum_j q_ij,  q_ij = (c_ij / T_i)^2 `

Args:

counts: Non-negative count matrix with units as rows and items as columns. eps2: Row-wise squared standard-error tolerance; scalar applies to every row. pi_min: Minimum allowable inclusion probability. max_iter: Maximum optimization iterations. tol: Convergence tolerance for constraint violations. random_state: Random state for the stochastic subgradient steps. Can be None, int, or Generator. return_result: If True, return a RowSEResult with diagnostics. raise_on_failure: If True, raise a ValidationError when constraints remain violated.

Returns:

Probability series if return_result is False (default) or a RowSEResult with diagnostics when return_result is True.

Raises:
ValidationError: If inputs are invalid or raise_on_failure is True and the constraints

remain violated after optimization.

See Also:

pi_aopt_for_budget: A-optimal probabilities for a fixed budget. items_to_label: Deterministic selection without SE constraints.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import row_se_min_labels
>>>
>>> counts = pd.DataFrame(np.random.poisson(10, (500, 50)))
>>> pi = row_se_min_labels(counts, eps2=0.05**2)
>>> float(pi.sum())
Parameters:
Return type:

RowSEResult | Series

fewlab.topk(arr, k, *, index=None)[source]

Return indices of the top-k entries of arr in descending order.

Args:

arr: Array of scores to rank. k: Number of entries to keep. index: Optional index to map positions back to labels.

Returns:

Index of the top-k entries ordered by decreasing value.

Parameters:
Return type:

Index

fewlab.calibrate_weights(pi, g, selected, pop_totals=None, *, distance='chi2', ridge=SMALL_RIDGE, nonneg=True)[source]

Compute calibrated weights for selected items using GREG/Deville-Särndal calibration.

Args:

pi: Inclusion probabilities for all items (index = item names). g: Regression projections g_j = X^T v_j for all items (shape (p, m)). selected: Item identifiers drawn in the sample. pop_totals: Known population totals (shape (p,)); defaults to g.sum(axis=1). distance: Calibration distance measure; currently only “chi2” is supported. ridge: Ridge regularization parameter for numerical stability. nonneg: Whether to enforce non-negative calibrated weights.

Returns:

Calibrated weights indexed by the selected items.

Raises:

NotImplementedError: If distance is not “chi2”. ValueError: If pop_totals has the wrong shape.

Notes:

The closed-form solution for chi-square distance is w* = d_S + G_S^T (G_S G_S^T + ridge I)^{-1} (t - G_S d_S) where d_S are base weights.

References:

Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376-382.

Parameters:
Return type:

Series

fewlab.calibrated_ht_estimator(counts, labels, weights, *, normalize_by_total=True)[source]

Compute calibrated Horvitz-Thompson estimator for row shares.

Args:

counts: Count matrix with rows as units and columns as items. labels: Item labels for the selected items. weights: Calibrated weights for the selected items. normalize_by_total: Whether to divide by row totals to obtain shares.

Returns:

Estimated row shares (or totals if normalize_by_total is False).

Parameters:
Return type:

Series

fewlab.core_plus_tail(counts, X, budget, *, tail_frac=0.2, random_state=None, ensure_full_rank=True, ridge=None)[source]

Hybrid sampler combining a deterministic core with a balanced probabilistic tail.

Strategy:
  1. Select budget_core = (1 - tail_frac) * budget items deterministically (largest w_j).

  2. Compute A-optimal inclusion probabilities for the full budget.

  3. Draw the remaining budget_tail items using balanced sampling.

Args:

counts: Count matrix with units as rows and candidate items as columns. X: Feature matrix aligned with counts.index. budget: Total number of items to select. tail_frac: Fraction of the budget allocated to the probabilistic tail. random_state: Random state for balanced tail selection. Can be None, int, or Generator. ensure_full_rank: Whether to regularize X^T X if it is rank-deficient. ridge: Optional ridge penalty added to X^T X.

Returns:

Selection result containing the chosen items, inclusion probabilities, and metadata.

Raises:

ValidationError: If inputs fail validation or the core/tail split is infeasible.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import core_plus_tail
>>>
>>> counts = pd.DataFrame(np.random.poisson(10, (1000, 200)))
>>> X = pd.DataFrame(np.random.randn(1000, 5))
>>> result = core_plus_tail(counts, X, budget=50, tail_frac=0.2)
>>> result.selected.shape
(50,)
Parameters:
Return type:

CoreTailResult

fewlab.adaptive_core_tail(counts, X, budget, *, min_tail_frac=0.1, max_tail_frac=0.4, condition_threshold=1e6, random_state=None)[source]

Adaptive core+tail selection with a data-driven tail fraction.

The routine increases the tail fraction when X^T X is poorly conditioned and decreases it when influence weights are highly concentrated.

Args:

counts: Count matrix. X: Feature matrix. budget: Total number of items to select. min_tail_frac: Minimum allowable tail fraction. max_tail_frac: Maximum allowable tail fraction. condition_threshold: Baseline condition number scale. random_state: Random state for the balanced sampling step. Can be None, int, or Generator.

Returns:

Selection result identical to core_plus_tail, with adaptive metadata in info.

Parameters:
Return type:

CoreTailResult

fewlab.greedy_aopt_selection(counts, X, budget, *, ensure_full_rank=True, ridge=None)[source]

Select items using greedy A-optimal sequential selection.

The algorithm iteratively chooses the item that maximally reduces the trace of the covariance matrix using Sherman-Morrison updates.

Args:

counts: Count matrix with non-negative entries. X: Feature matrix aligned with counts.index. budget: Number of items to select sequentially. ensure_full_rank: Whether to add a ridge if the information matrix becomes singular. ridge: Optional explicit ridge parameter.

Returns:

Selection result with items, influence weights, and diagnostics.

See Also:

items_to_label: Batch A-optimal selection (faster, different results). pi_aopt_for_budget: Compute inclusion probabilities for A-optimal design.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import greedy_aopt_selection
>>>
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>> result = greedy_aopt_selection(counts, X, budget=20)
>>> len(result.selected)
20
Parameters:
Return type:

SelectionResult

class fewlab.CoreTailResult(selected, probabilities, core, tail, ht_weights, mixed_weights, diagnostics)[source]

Bases: object

Structured result for hybrid core+tail selection methods.

Attributes:

selected: All selected item identifiers (core + tail). probabilities: A-optimal inclusion probabilities for all items. core: Deterministic core items (highest influence). tail: Probabilistic tail items (balanced sampling). ht_weights: Standard Horvitz-Thompson weights for the selected items. mixed_weights: Mixed weights (1/pi for core, 1.0 for tail) for variance reduction. diagnostics: Additional metadata such as budget splits and tail fraction.

Properties:

budget_used: Total number of items selected. budget_core: Number of items in the deterministic core. budget_tail: Number of items in the probabilistic tail. tail_frac: Fraction of the budget allocated to the tail.

Examples:
>>> result = design.sample(budget=50, method="core_plus_tail", tail_frac=0.2)
>>> len(result.selected), len(result.core), len(result.tail)
(50, 40, 10)
Parameters:
selected: Index
probabilities: Series
core: Index
tail: Index
ht_weights: Series
mixed_weights: Series
diagnostics: dict[str, Any]
__getitem__(key)[source]

Access selected items by index.

__init__(selected, probabilities, core, tail, ht_weights, mixed_weights, diagnostics)
Parameters:
Return type:

None

__iter__()[source]

Iterate over selected items.

__len__()[source]

Number of selected items.

Return type:

int

__repr__()[source]

String representation.

Return type:

str

property budget_core: int

Number of items in deterministic core.

property budget_tail: int

Number of items in probabilistic tail.

property budget_used: int

Total number of items selected.

property probability_sum: float

Sum of inclusion probabilities.

property tail_frac: float

Fraction of budget allocated to tail.

class fewlab.SamplingResult(sample, probabilities, weights, diagnostics)[source]

Bases: object

Structured result for probabilistic sampling methods.

Attributes:

sample: Sampled item identifiers. probabilities: Inclusion probabilities used for sampling. weights: Suggested sampling weights for the sampled items. diagnostics: Sampling diagnostics and metadata.

Properties:

sample_size: Number of sampled items.

Examples:
>>> result = design.sample(budget=30, method="balanced")
>>> result.sample_size
30
Parameters:
sample: Index
probabilities: Series
weights: Series
diagnostics: dict[str, Any]
__getitem__(key)[source]

Access sampled items by index.

__init__(sample, probabilities, weights, diagnostics)
Parameters:
Return type:

None

__iter__()[source]

Iterate over sampled items.

__len__()[source]

Number of sampled items.

Return type:

int

__repr__()[source]

String representation.

Return type:

str

property probability_sum: float

Sum of inclusion probabilities.

property sample_size: int

Number of sampled items.

class fewlab.SelectionResult(selected, influence_weights, diagnostics)[source]

Bases: object

Structured result for deterministic selection methods.

Attributes:

selected: Selected item identifiers ordered by influence. influence_weights: A-optimal influence weights used for selection. diagnostics: Selection diagnostics and metadata.

Properties:

budget_used: Number of items selected.

Examples:
>>> result = design.select(budget=30, method="deterministic")
>>> len(result.selected)
30
Parameters:
selected: Index
influence_weights: Series
diagnostics: dict[str, Any]
__getitem__(key)[source]

Access selected items by index.

__init__(selected, influence_weights, diagnostics)
Parameters:
Return type:

None

__iter__()[source]

Iterate over selected items.

__len__()[source]

Number of selected items.

Return type:

int

__repr__()[source]

String representation.

Return type:

str

property budget_used: int

Number of items selected.

class fewlab.ProbabilityResult(probabilities, influence_projections, diagnostics)[source]

Bases: object

Structured result for probability computation methods.

Provides access to computed probabilities, influence projections, and computation diagnostics.

Parameters:
probabilities

Inclusion probabilities indexed by item identifiers.

Type:

pd.Series

influence_projections

Regression projections g_j = X^T v_j for all items (shape (p, m)). Used for balanced sampling and weight calibration.

Type:

np.ndarray

diagnostics

Computation diagnostics and metadata.

Type:

dict[str, Any]

Properties
----------
budget_used

Sum of inclusion probabilities.

Type:

float

Examples

>>> result = design.inclusion_probabilities(budget=50, method="aopt")
>>> print(f"Budget used: {result.budget_used:.1f}")
>>> # Now you can use influence_projections for balanced sampling
>>> from fewlab import balanced_fixed_size
>>> selected = balanced_fixed_size(result.probabilities, result.influence_projections, 50)
probabilities: Series
influence_projections: ndarray
diagnostics: dict[str, Any]
__init__(probabilities, influence_projections, diagnostics)
Parameters:
Return type:

None

__len__()[source]

Number of items with probabilities.

Return type:

int

__repr__()[source]

String representation.

Return type:

str

property budget_used: float

Sum of inclusion probabilities.

class fewlab.EstimationResult(estimates, weights, selected, diagnostics)[source]

Bases: object

Structured result for estimation methods.

Parameters:
estimates

Row-wise estimates.

Type:

pd.Series

weights

Calibrated weights used for estimation.

Type:

pd.Series

selected

Items used for estimation.

Type:

pd.Index

diagnostics

Estimation diagnostics.

Type:

dict[str, Any]

Examples

>>> result = design.estimate(selected, labels)
>>> print(f"Mean estimate: {result.estimates.mean():.3f}")
estimates: Series
weights: Series
selected: Index
diagnostics: dict[str, Any]
__init__(estimates, weights, selected, diagnostics)
Parameters:
Return type:

None

__repr__()[source]

String representation.

Return type:

str

class fewlab.RowSEResult(probabilities, max_violation, tolerance, iterations, best_iteration, feasible)[source]

Bases: object

Result container for row_se_min_labels.

Attributes:

probabilities: Inclusion probabilities indexed by item identifiers. max_violation: Maximum constraint violation encountered. tolerance: Target violation tolerance. iterations: Number of iterations executed. best_iteration: Iteration index where the best solution was recorded. feasible: Whether the best solution satisfies the tolerance.

Parameters:
probabilities: Series
max_violation: float
tolerance: float
iterations: int
best_iteration: int
feasible: bool
__init__(probabilities, max_violation, tolerance, iterations, best_iteration, feasible)
Parameters:
Return type:

None

__repr__()[source]

String representation.

Return type:

str

to_dict()[source]

Return diagnostic information as a dict.

Return type:

dict[str, Any]

to_series()[source]

Return a copy of the probabilities as a Series.

Return type:

Series

Core Module

class fewlab.core.Influence(w, g, cols)[source]

Bases: object

Influence data structure with memory-optimized slots.

Parameters:
w: ndarray
g: ndarray
cols: list[str]
__init__(w, g, cols)
Parameters:
Return type:

None

fewlab.core.pi_aopt_for_budget(counts, X, budget, *, pi_min=PI_MIN_DEFAULT, ensure_full_rank=True, ridge=None)[source]

Compute A-optimal first-order inclusion probabilities for a target budget.

The probabilities follow the square-root rule pi_j = clip(c * sqrt(w_j), [pi_min, 1]) with c chosen so that sum(pi) = budget.

Args:

counts: Count matrix with non-negative values. X: Feature matrix aligned with counts.index. budget: Expected total budget (sum of inclusion probabilities). pi_min: Minimum allowed inclusion probability. ensure_full_rank: Whether to add a small ridge term when X^T X is ill-conditioned. ridge: Explicit ridge parameter overriding the automatic heuristic.

Returns:

Probability result with inclusion probabilities and computation diagnostics.

Note:

If budget < m * pi_min (where m is the number of items), the budget constraint cannot be satisfied. In this case, the function returns all probabilities as pi_min, resulting in sum(pi) = m * pi_min > budget, and issues a warning. The violation details are included in the result’s diagnostics under budget_violation.

See Also:

items_to_label: Deterministic selection using the same influence weights. balanced_fixed_size: Fixed-size balanced sampling using these probabilities.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import pi_aopt_for_budget
>>>
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>> result = pi_aopt_for_budget(counts, X, budget=50)
>>> round(result.budget_used, 1)
50.0
Parameters:
Return type:

ProbabilityResult

fewlab.core.items_to_label(counts, X, budget, *, ensure_full_rank=True, ridge=None)[source]

Select items to label using deterministic A-optimal design.

Influence weights are computed as w_j = g_j^T (X^T X)^{-1} g_j, and the top entries are returned.

Args:

counts: Count matrix with units as rows and items as columns. X: Feature matrix aligned with counts.index. budget: Number of items to select. ensure_full_rank: Whether to add a ridge term when X^T X is ill-conditioned. ridge: Optional ridge parameter overriding the automatic heuristic.

Returns:

Selection result with items, influence weights, and diagnostics.

See Also:

pi_aopt_for_budget: Compute inclusion probabilities for the same design. greedy_aopt_selection: Greedy sequential variant. core_plus_tail: Hybrid deterministic and probabilistic selection.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import items_to_label
>>>
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 200)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>> result = items_to_label(counts, X, budget=50)
>>> len(result.selected)
50
Parameters:
Return type:

SelectionResult

Selection Module

fewlab.selection.topk(arr, k, *, index=None)[source]

Return indices of the top-k entries of arr in descending order.

Args:

arr: Array of scores to rank. k: Number of entries to keep. index: Optional index to map positions back to labels.

Returns:

Index of the top-k entries ordered by decreasing value.

Parameters:
Return type:

Index

Balanced Sampling

fewlab.balanced.balanced_fixed_size(pi, g, budget, *, random_state=None, max_swaps=MAX_SWAPS_BALANCED, tol=TOLERANCE_DEFAULT)[source]

Fixed-size balanced sampling with variance reduction.

Implements a two-step heuristic:

  1. Initial selection proportional to inclusion probabilities pi

  2. Greedy local search to minimize calibration residual ||sum((I/pi)-1) g||_2

This balancing procedure aims to reduce the variance of Horvitz-Thompson estimators by making the sample more representative.

Args:

pi: Inclusion probabilities for items. Index contains item identifiers. g: Regression projections g_j = X^T v_j for each item j (shape (p, m)). budget: Fixed sample size (number of items to select). random_state: Random state for reproducible sampling. Can be None, int, or Generator. max_swaps: Maximum number of swap iterations for balancing. tol: Tolerance for stopping criterion (residual norm).

Returns:

Index of selected items. Length equals budget.

Raises:

ValidationError: If pi, g, or budget fail validation checks.

See Also:

pi_aopt_for_budget: Compute optimal inclusion probabilities. core_plus_tail: Hybrid deterministic + balanced sampling. calibrate_weights: Post-stratification weight adjustment.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import pi_aopt_for_budget, balanced_fixed_size, _influence
>>>
>>> # Setup data
>>> counts = pd.DataFrame(np.random.poisson(5, (1000, 100)))
>>> X = pd.DataFrame(np.random.randn(1000, 3))
>>>
>>> # Compute probabilities and influence matrix
>>> pi = pi_aopt_for_budget(counts, X, budget=30)
>>> inf = _influence(counts, X)
>>>
>>> # Balanced sampling
>>> selected = balanced_fixed_size(pi, inf.g, budget=30, random_state=42)
>>> print(f"Selected {len(selected)} items with balanced design")
Notes:

The balancing algorithm aims to make sum_S (I_j/pi_j - 1) * g_j ≈ 0, where S is the selected sample and I_j are selection indicators. This reduces variance in calibrated estimators.

Parameters:
Return type:

Index

Row Standard Error Minimization

fewlab.rowse.row_se_min_labels(counts, eps2, *, pi_min=PI_MIN_DEFAULT, max_iter=MAX_ITER_ROWSE, tol=TOLERANCE_DEFAULT, random_state=None, return_result=False, raise_on_failure=False)[source]

Compute inclusion probabilities that minimize total expected labels under row-wise SE limits.

The routine solves:

` minimize   sum_j pi_j subject to sum_j q_ij / pi_j <= eps2_i + sum_j q_ij,  q_ij = (c_ij / T_i)^2 `

Args:

counts: Non-negative count matrix with units as rows and items as columns. eps2: Row-wise squared standard-error tolerance; scalar applies to every row. pi_min: Minimum allowable inclusion probability. max_iter: Maximum optimization iterations. tol: Convergence tolerance for constraint violations. random_state: Random state for the stochastic subgradient steps. Can be None, int, or Generator. return_result: If True, return a RowSEResult with diagnostics. raise_on_failure: If True, raise a ValidationError when constraints remain violated.

Returns:

Probability series if return_result is False (default) or a RowSEResult with diagnostics when return_result is True.

Raises:
ValidationError: If inputs are invalid or raise_on_failure is True and the constraints

remain violated after optimization.

See Also:

pi_aopt_for_budget: A-optimal probabilities for a fixed budget. items_to_label: Deterministic selection without SE constraints.

Examples:
>>> import pandas as pd
>>> import numpy as np
>>> from fewlab import row_se_min_labels
>>>
>>> counts = pd.DataFrame(np.random.poisson(10, (500, 50)))
>>> pi = row_se_min_labels(counts, eps2=0.05**2)
>>> float(pi.sum())
Parameters:
Return type:

RowSEResult | Series