Mathematical Background¶

This section explains the mathematical foundations behind fewlab’s item selection algorithms.

Problem Setup¶

Suppose we have:

\(n\) users (rows) and \(m\) items (columns)
Usage count matrix \(C \in \mathbb{R}^{n \times m}\) where \(C_{ij}\) is the number of times user \(i\) used item \(j\)
User feature matrix \(X \in \mathbb{R}^{n \times p}\) containing \(p\) features for each user
Budget to label \(K\) items out of \(m\) total items

Our goal is to select which \(K\) items to label to minimize the variance of regression coefficients when estimating:

\[y_j = X\beta_j + \epsilon_j\]

where \(y_j\) is the (unknown) label vector for item \(j\).

Statistical Framework¶

A-Optimal Design¶

We use A-optimal experimental design theory. For a given item \(j\), if we observe its labels with probability \(\pi_j\), the variance of the regression coefficient estimator is proportional to:

\[\text{Var}(\hat{\beta}_j) \propto \text{tr}((X^T X)^{-1} X^T \Sigma_j X (X^T X)^{-1}) / \pi_j\]

where \(\Sigma_j\) is the covariance matrix of the response for item \(j\).

The A-optimal weights are:

\[w_j = \text{tr}((X^T X)^{-1} X^T \Sigma_j X (X^T X)^{-1})\]

For usage data, we approximate \(\Sigma_j\) using the empirical usage patterns.

Usage-Based Approximation¶

We approximate the response covariance using usage proportions. For user \(i\) and item \(j\):

\[V_{ij} = \frac{C_{ij}}{T_i}\]

where \(T_i = \sum_{j'} C_{ij'}\) is the total usage for user \(i\).

The weight becomes:

\[w_j = g_j^T (X^T X)^{-1} g_j\]

where \(g_j = X^T v_j\) and \(v_j\) is the \(j\)-th column of \(V\).

Algorithm Details¶

Main Algorithm (`fewlab.items_to_label()`)¶

Compute usage proportions: \(V_{ij} = C_{ij} / T_i\) for users with \(T_i > 0\)
Calculate regression projections: \(G = X^T V\) where \(G \in \mathbb{R}^{p \times m}\)
Compute A-optimal weights:

\[w_j = g_j^T (X^T X)^{-1} g_j\]

where \(g_j\) is the \(j\)-th column of \(G\).
Select top K items: Return items with largest \(w_j\) values.

Budget-Constrained Optimization (`fewlab.pi_aopt_for_budget()`)¶

For a given labeling budget \(B\), solve:

\[\min_{\pi} \sum_{j=1}^m \frac{w_j}{\pi_j}\]

subject to:

\[\sum_{j=1}^m \pi_j \leq B, \quad \pi_j \geq 0\]

The optimal solution has the form:

\[\pi_j^* = \frac{\sqrt{w_j}}{\lambda} \text{ if } \sqrt{w_j} > \lambda \pi_{\min}, \text{ else } \pi_{\min}\]

where \(\lambda\) is chosen to satisfy the budget constraint.

Balanced Fixed-Size Sampling (`fewlab.balanced_fixed_size()`)¶

This heuristic algorithm:

Initial sampling: Draw \(K\) items proportional to \(\pi_j\)
Greedy improvement: Iteratively swap items to minimize \(\|\sum_{j \in S} (I_j/\pi_j - 1) g_j\|_2^2\)

where \(S\) is the selected set and \(I_j\) is the indicator that item \(j\) is selected.

Row-Wise Standard Error Constraints (`fewlab.row_se_min_labels()`)¶

Minimizes total labeling subject to per-row standard error constraints:

\[\min_{\pi} \sum_{j=1}^m \pi_j\]

subject to:

\[\sum_{j=1}^m \frac{q_{ij}}{\pi_j} \leq \epsilon_i^2 + \sum_{j=1}^m q_{ij} \quad \forall i\]

where \(q_{ij} = (C_{ij}/T_i)^2\).

This is solved using a dual optimization approach with gradient descent.

Numerical Considerations¶

Ridge Regularization¶

When \(X^T X\) is ill-conditioned, we add ridge regularization:

\[(X^T X + \lambda I)^{-1}\]

The default strategy chooses \(\lambda = 10^{-8}\) when \(\text{cond}(X^T X) > 10^{12}\).

Handling Sparse Data¶

Users with zero total usage (\(T_i = 0\)) are automatically removed
Numerical stability is maintained through careful matrix conditioning checks
Small regularization prevents divide-by-zero issues

Computational Complexity¶

Time complexity: \(O(np^2 + mp^2)\) for the main algorithm
Space complexity: \(O(nm + mp + p^2)\)
Bottleneck: Matrix inversion \((X^T X)^{-1}\) when \(p\) is large

The algorithm scales well for typical usage scenarios where \(p \ll n\) and \(p \ll m\).

Theoretical Properties¶

Optimality¶

Under the assumption that item labels follow the usage-proportion model, the A-optimal weights \(w_j\) minimize the expected trace of the covariance matrix of regression coefficient estimates.

Robustness¶

The algorithm is robust to:

Missing data: Handled through zero usage entries
Outliers: Proportional scaling reduces sensitivity to extreme usage patterns
Collinearity: Ridge regularization prevents numerical instability

Limitations¶

Assumes usage patterns are informative about labeling importance
Optimal for trace-of-covariance criterion (A-optimality), not other design criteria
Performance depends on the assumption that high-influence items should be prioritized

References¶

The mathematical framework builds on classical optimal experimental design theory:

Fedorov, V.V. (1972). Theory of Optimal Experiments
Pukelsheim, F. (2006). Optimal Design of Experiments
Atkinson, A.C., Donev, A.N., and Tobias, R.D. (2007). Optimum Experimental Designs