Mathematical Theory¶
Overview¶
Projection pursuit finds interesting low-dimensional projections of multivariate data by optimizing an “interestingness” index. pyppur implements two main approaches for dimensionality reduction:
Distance Distortion: Preserves pairwise distances between data points
Reconstruction Error: Minimizes reconstruction error using ridge functions
Ridge Functions¶
pyppur uses the hyperbolic tangent as the ridge function:
where \(\alpha\) is the steepness parameter that controls the degree of nonlinearity.
The gradient of the ridge function is:
Reconstruction Objectives¶
Tied-Weights Ridge Autoencoder (Default)¶
The default reconstruction approach uses tied weights where the encoder and decoder share the same parameters:
The objective is to minimize the reconstruction error:
where:
\(X \in \mathbb{R}^{n \times p}\) is the input data matrix
\(A \in \mathbb{R}^{k \times p}\) are the encoder/decoder weights
\(Z \in \mathbb{R}^{n \times k}\) is the projected data
\(g(\cdot)\) is applied element-wise
\(n\) is the number of samples, \(p\) is the number of features, \(k\) is the number of components
Free Decoder Ridge Autoencoder¶
With tied_weights=False, the encoder and decoder have separate parameters:
The objective includes optional L2 regularization on the decoder:
where:
\(A \in \mathbb{R}^{k \times p}\) are the encoder weights
\(B \in \mathbb{R}^{k \times p}\) are the decoder weights
\(\lambda \geq 0\) is the L2 regularization parameter
Distance Distortion Objectives¶
Distance Distortion with Nonlinearity (Default)¶
The default distance objective compares pairwise distances between the original space and the nonlinearly transformed projection:
where:
\(d_X(i,j) = \|x_i - x_j\|_2\) are pairwise distances in the original space
\(d_Z(i,j) = \|z_i - z_j\|_2\) are pairwise distances in the transformed space
\(Z = g(X A^T)\) is the nonlinearly transformed projection
Linear Distance Distortion¶
With use_nonlinearity_in_distance=False, distances are compared in the linear projection space:
where:
\(Y = X A^T\) is the linear projection
\(d_Y(i,j) = \|y_i - y_j\|_2\) are distances in the linear projection space
This reduces to classical multidimensional scaling when \(k = p\).
Weighted Distance Distortion¶
When weight_by_distance=True, the objective is weighted to emphasize preservation of small distances:
where the weights are:
with \(\epsilon = 0.1\) to avoid division by zero, and weights are normalized to sum to 1.
Optimization¶
Constraint Handling¶
All optimization is performed under the constraint that encoder directions have unit norm:
This constraint is enforced by:
Normalizing the encoder directions after each optimization step
Using a projected gradient approach within the L-BFGS-B optimizer
Initialization Strategies¶
pyppur uses multiple initialization strategies:
PCA Initialization: Use the first \(k\) principal components as starting points
Random Initialization: Sample \(n_{\text{init}}\) random orthonormal matrices
For untied weights, the decoder is initialized with small random values scaled by 0.1.
The best result across all initializations is retained.
Multi-Start Optimization¶
The optimization procedure:
Try PCA initialization
Try
n_initrandom initializationsSelect the result with the lowest objective value
Return the optimal encoder (and decoder if applicable)
Computational Complexity¶
Time Complexity¶
For \(n\) samples, \(p\) features, and \(k\) components:
Distance distortion: \(O(n^2 k + n p k)\) per iteration (dominated by distance computation)
Reconstruction: \(O(n p k)\) per iteration
Overall: \(O(T \cdot (\text{per-iteration cost}))\) where \(T\) is the number of iterations
Space Complexity¶
Distance distortion: \(O(n^2)\) for storing distance matrices
Reconstruction: \(O(n k + p k)\) for intermediate computations
Parameters: \(O(p k)\) for tied weights, \(O(2 p k)\) for untied weights
Theoretical Properties¶
Convergence¶
The optimization uses L-BFGS-B, which has the following properties:
Local convergence: Guaranteed to converge to a local minimum under regularity conditions
Global optimum: Not guaranteed due to non-convexity of the objectives
Multi-start: Helps find better local optima
Expressiveness¶
The ridge function autoencoder can represent a rich class of nonlinear transformations:
Universal approximation: With sufficient components, can approximate any continuous function
Regularization: The \(\tanh\) nonlinearity provides natural bounded outputs
Interpretability: Each component corresponds to a direction in the original space