Mathematical Background¶

The GLS Problem¶

Consider a system of K regression equations with N observations:

y_j = X_j β_j + ε_j, j = 1, …, K

where:

y_j is N×1 response vector for equation j
X_j is N×p_j design matrix for equation j
β_j is p_j×1 coefficient vector
ε_j is N×1 error vector

Stacking all equations:

Y = [y_1, …, y_K] (N×K matrix) ε = [ε_1, …, ε_K] (N×K matrix)

Error Structure¶

The key assumption is that errors are correlated across equations but independent across observations:

E[ε_i ε_i^T] = Σ (K×K covariance matrix)

where ε_i is the i-th row of ε (errors for observation i across all K equations).

Low-Rank + Diagonal Structure¶

Instead of estimating the full K×K covariance matrix (K²/2 parameters), we assume:

Σ = FF^T + diag(D)

where:

F is K×k factor loadings matrix (Kk parameters)
D is K×1 diagonal noise vector (K parameters)
k << K is the latent factor rank

Total parameters: K(k+1) << K²

The Woodbury Identity¶

The key to efficient computation is the Woodbury matrix identity:

(A + UCV)^(-1) = A^(-1) - A^(-1)U(C^(-1) + VA^(-1)U)^(-1)VA^(-1)

Applied to our case with A = diag(D), U = F, C = I, V = F^T:

Σ^(-1) = D^(-1) - D^(-1)F(I + F^T D^(-1)F)^(-1)F^T D^(-1)

This reduces the inversion from K×K to k×k.

ALS Algorithm¶

The algorithm alternates between two steps:

Step 1: Update β given (F, D)¶

For each equation j, solve the weighted least squares problem:

β_j = argmin Σ_i w_ij ||y_ij - x_ij^T β_j||²

where the weights come from Σ^(-1). We use conjugate gradient to avoid forming normal equations.

Step 2: Update (F, D) given β¶

Given residuals R = Y - Ŷ, estimate the factor model:

Compute sample covariance: S = R^T R / N
Extract k leading eigenvectors of S → F
Set D = diag(S - FF^T)

Objective Function¶

Both ALS and EM minimize the negative log-likelihood:

NLL = (N/2)[K log(2π) + log|Σ| + tr(Σ^(-1)S)]

where S is the sample covariance of residuals.

Regularization¶

To ensure stability, we add ridge penalties:

λ_F||F||² for factor loadings
λ_B||β||² for regression coefficients
d_floor as minimum diagonal variance

Convergence Criteria¶

The algorithm stops when:

Relative NLL improvement < rel_tol (default 1e-6)
Number of sweeps exceeds max_sweeps
Line search fails (accept_t = 0)

Computational Complexity¶

Operation	EM	ALS
Memory	O(K²)	O(Kk)
Per iteration	O(K³ + NKp)	O(k³ + NKp)
Iterations	20-50	5-10

Statistical Properties¶

Under standard regularity conditions:

Consistency: β̂ → β₀ as N → ∞
Efficiency: Achieves GLS efficiency when (F, D) are known
Robustness: Ridge regularization prevents singularities

Extensions¶

The low-rank framework extends to:

Time-varying factors: F_t
Sparse factors: L1 penalty on F
Hierarchical models: Multi-level factor structures
Missing data: EM naturally handles missing Y values