ALS vs EM Comparison¶
This section compares the Alternating Least Squares (ALS) and Expectation-Maximization (EM) approaches for low-rank+diagonal GLS estimation.
Mathematical Background¶
Both algorithms solve the same statistical problem: estimating regression coefficients β when the error covariance has a low-rank+diagonal structure:
Σ = FF^T + diag(D)
where F is K×k factor loadings and D is K×1 diagonal noise.
EM Algorithm¶
The EM algorithm alternates between:
E-step: Compute expected sufficient statistics given current parameters M-step: Update parameters by solving weighted least squares with full Σ^(-1)
The critical issue: EM’s M-step explicitly forms the K×K inverse covariance matrix, requiring O(K²) memory even though the model has only O(Kk) parameters.
# EM's expensive step (pseudocode)
Sigma = F @ F.T + np.diag(D) # K×K dense matrix
Sigma_inv = np.linalg.inv(Sigma) # K×K inversion
# Use Sigma_inv for β updates...
ALS Algorithm¶
ALS also alternates between updating β and updating (F, D), but uses the Woodbury matrix identity to avoid forming dense matrices:
Σ^(-1) = D^(-1) - D^(-1)F(I + F^T D^(-1)F)^(-1)F^T D^(-1)
The key insight: we only need to invert a k×k matrix (I + F^T D^(-1)F), not K×K.
# ALS's efficient step (pseudocode)
D_inv = 1.0 / D # K×1 vector
small_inv = np.linalg.inv(np.eye(k) + F.T @ (D_inv[:, None] * F)) # k×k
# Apply Woodbury formula without forming K×K matrices
Memory Comparison¶
Algorithm |
Largest Array |
Memory Complexity |
|---|---|---|
EM |
Σ^(-1) (K×K dense) |
O(K²) |
ALS |
F (K×k skinny) |
O(Kk) |
For K=100 equations and k=5 factors:
EM: 100×100 = 10,000 floats
ALS: 100×5 = 500 floats
Ratio: 20×
Computational Comparison¶
from alsgls import simulate_sur, als_gls, em_gls
import time
# Generate test problem
K = 100 # equations
N = 300 # observations
k = 5 # factors
Xs_tr, Y_tr, _, _ = simulate_sur(N_tr=N, N_te=50, K=K, p=3, k=k)
# Time ALS
t0 = time.time()
B_als, F_als, D_als, mem_als, info_als = als_gls(
Xs_tr, Y_tr, k=k, sweeps=8
)
time_als = time.time() - t0
# Time EM
t0 = time.time()
B_em, F_em, D_em, mem_em, info_em = em_gls(
Xs_tr, Y_tr, k=k, iters=30
)
time_em = time.time() - t0
print(f"ALS: {time_als:.2f}s, {mem_als:.1f}MB")
print(f"EM: {time_em:.2f}s, {mem_em:.1f}MB")
print(f"Memory ratio: {mem_em/mem_als:.1f}×")
Statistical Equivalence¶
Despite different computational approaches, both algorithms optimize the same likelihood and produce statistically indistinguishable estimates:
from alsgls import XB_from_Blist, mse
import numpy as np
# Compare predictions
Y_pred_als = XB_from_Blist(Xs_te, B_als)
Y_pred_em = XB_from_Blist(Xs_te, B_em)
# MSE should be nearly identical
mse_als = mse(Y_te, Y_pred_als)
mse_em = mse(Y_te, Y_pred_em)
print(f"ALS MSE: {mse_als:.6f}")
print(f"EM MSE: {mse_em:.6f}")
print(f"Difference: {abs(mse_als - mse_em):.2e}")
# Factor structures should be equivalent (up to rotation)
cov_als = F_als @ F_als.T + np.diag(D_als)
cov_em = F_em @ F_em.T + np.diag(D_em)
print(f"Covariance difference: {np.linalg.norm(cov_als - cov_em):.2e}")
When to Use Each¶
Use ALS when:
K is large (>50 equations)
Memory is constrained
You need the Woodbury form for other computations
k << K (low-rank assumption holds)
Use EM when:
K is small (<30 equations)
You need the full covariance matrix anyway
Implementing a standard EM framework
Debugging (EM is conceptually simpler)
Implementation Details¶
The package provides both for comparison:
als_gls(): Memory-efficient ALS implementationem_gls(): Baseline EM implementation
Both use the same convergence criteria and regularization options for fair comparison.