# Mathematical Background ## The GLS Problem Consider a system of K regression equations with N observations: y_j = X_j β_j + ε_j, j = 1, ..., K where: - y_j is N×1 response vector for equation j - X_j is N×p_j design matrix for equation j - β_j is p_j×1 coefficient vector - ε_j is N×1 error vector Stacking all equations: Y = [y_1, ..., y_K] (N×K matrix) ε = [ε_1, ..., ε_K] (N×K matrix) ## Error Structure The key assumption is that errors are correlated across equations but independent across observations: E[ε_i ε_i^T] = Σ (K×K covariance matrix) where ε_i is the i-th row of ε (errors for observation i across all K equations). ## Low-Rank + Diagonal Structure Instead of estimating the full K×K covariance matrix (K²/2 parameters), we assume: Σ = FF^T + diag(D) where: - F is K×k factor loadings matrix (Kk parameters) - D is K×1 diagonal noise vector (K parameters) - k << K is the latent factor rank Total parameters: K(k+1) << K² ## The Woodbury Identity The key to efficient computation is the Woodbury matrix identity: (A + UCV)^(-1) = A^(-1) - A^(-1)U(C^(-1) + VA^(-1)U)^(-1)VA^(-1) Applied to our case with A = diag(D), U = F, C = I, V = F^T: Σ^(-1) = D^(-1) - D^(-1)F(I + F^T D^(-1)F)^(-1)F^T D^(-1) This reduces the inversion from K×K to k×k. ## ALS Algorithm The algorithm alternates between two steps: ### Step 1: Update β given (F, D) For each equation j, solve the weighted least squares problem: β_j = argmin Σ_i w_ij ||y_ij - x_ij^T β_j||² where the weights come from Σ^(-1). We use conjugate gradient to avoid forming normal equations. ### Step 2: Update (F, D) given β Given residuals R = Y - Ŷ, estimate the factor model: 1. Compute sample covariance: S = R^T R / N 2. Extract k leading eigenvectors of S → F 3. Set D = diag(S - FF^T) ## Objective Function Both ALS and EM minimize the negative log-likelihood: NLL = (N/2)[K log(2π) + log|Σ| + tr(Σ^(-1)S)] where S is the sample covariance of residuals. ## Regularization To ensure stability, we add ridge penalties: - λ_F||F||² for factor loadings - λ_B||β||² for regression coefficients - d_floor as minimum diagonal variance ## Convergence Criteria The algorithm stops when: 1. Relative NLL improvement < rel_tol (default 1e-6) 2. Number of sweeps exceeds max_sweeps 3. Line search fails (accept_t = 0) ## Computational Complexity | Operation | EM | ALS | |-----------|-----|-----| | Memory | O(K²) | O(Kk) | | Per iteration | O(K³ + NKp) | O(k³ + NKp) | | Iterations | 20-50 | 5-10 | ## Statistical Properties Under standard regularity conditions: 1. **Consistency**: β̂ → β₀ as N → ∞ 2. **Efficiency**: Achieves GLS efficiency when (F, D) are known 3. **Robustness**: Ridge regularization prevents singularities ## Extensions The low-rank framework extends to: - Time-varying factors: F_t - Sparse factors: L1 penalty on F - Hierarchical models: Multi-level factor structures - Missing data: EM naturally handles missing Y values