Mathematical Background¶
The GLS Problem¶
Consider a system of K regression equations with N observations:
y_j = X_j β_j + ε_j, j = 1, …, K
where:
y_j is N×1 response vector for equation j
X_j is N×p_j design matrix for equation j
β_j is p_j×1 coefficient vector
ε_j is N×1 error vector
Stacking all equations:
Y = [y_1, …, y_K] (N×K matrix) ε = [ε_1, …, ε_K] (N×K matrix)
Error Structure¶
The key assumption is that errors are correlated across equations but independent across observations:
E[ε_i ε_i^T] = Σ (K×K covariance matrix)
where ε_i is the i-th row of ε (errors for observation i across all K equations).
Low-Rank + Diagonal Structure¶
Instead of estimating the full K×K covariance matrix (K²/2 parameters), we assume:
Σ = FF^T + diag(D)
where:
F is K×k factor loadings matrix (Kk parameters)
D is K×1 diagonal noise vector (K parameters)
k << K is the latent factor rank
Total parameters: K(k+1) << K²
The Woodbury Identity¶
The key to efficient computation is the Woodbury matrix identity:
(A + UCV)^(-1) = A^(-1) - A^(-1)U(C^(-1) + VA^(-1)U)^(-1)VA^(-1)
Applied to our case with A = diag(D), U = F, C = I, V = F^T:
Σ^(-1) = D^(-1) - D^(-1)F(I + F^T D^(-1)F)^(-1)F^T D^(-1)
This reduces the inversion from K×K to k×k.
ALS Algorithm¶
The algorithm alternates between two steps:
Step 1: Update β given (F, D)¶
For each equation j, solve the weighted least squares problem:
β_j = argmin Σ_i w_ij ||y_ij - x_ij^T β_j||²
where the weights come from Σ^(-1). We use conjugate gradient to avoid forming normal equations.
Step 2: Update (F, D) given β¶
Given residuals R = Y - Ŷ, estimate the factor model:
Compute sample covariance: S = R^T R / N
Extract k leading eigenvectors of S → F
Set D = diag(S - FF^T)
Objective Function¶
Both ALS and EM minimize the negative log-likelihood:
NLL = (N/2)[K log(2π) + log|Σ| + tr(Σ^(-1)S)]
where S is the sample covariance of residuals.
Regularization¶
To ensure stability, we add ridge penalties:
λ_F||F||² for factor loadings
λ_B||β||² for regression coefficients
d_floor as minimum diagonal variance
Convergence Criteria¶
The algorithm stops when:
Relative NLL improvement < rel_tol (default 1e-6)
Number of sweeps exceeds max_sweeps
Line search fails (accept_t = 0)
Computational Complexity¶
Operation |
EM |
ALS |
|---|---|---|
Memory |
O(K²) |
O(Kk) |
Per iteration |
O(K³ + NKp) |
O(k³ + NKp) |
Iterations |
20-50 |
5-10 |
Statistical Properties¶
Under standard regularity conditions:
Consistency: β̂ → β₀ as N → ∞
Efficiency: Achieves GLS efficiency when (F, D) are known
Robustness: Ridge regularization prevents singularities
Extensions¶
The low-rank framework extends to:
Time-varying factors: F_t
Sparse factors: L1 penalty on F
Hierarchical models: Multi-level factor structures
Missing data: EM naturally handles missing Y values