Mathematical Theory¶

Overview¶

Projection pursuit finds interesting low-dimensional projections of multivariate data by optimizing an “interestingness” index. pyppur implements two main approaches for dimensionality reduction:

Distance Distortion: Preserves pairwise distances between data points
Reconstruction Error: Minimizes reconstruction error using ridge functions

Ridge Functions¶

pyppur uses the hyperbolic tangent as the ridge function:

\[g(z) = \tanh(\alpha \cdot z)\]

where \(\alpha\) is the steepness parameter that controls the degree of nonlinearity.

The gradient of the ridge function is:

\[g'(z) = \alpha \cdot \operatorname{sech}^2(\alpha \cdot z) = \alpha \cdot (1 - \tanh^2(\alpha \cdot z))\]

Reconstruction Objectives¶

Tied-Weights Ridge Autoencoder (Default)¶

The default reconstruction approach uses tied weights where the encoder and decoder share the same parameters:

\[\begin{split}\begin{align} Z &= g(X A^T) \\ \hat{X} &= Z A \end{align}\end{split}\]

The objective is to minimize the reconstruction error:

\[L_{\text{tied}} = \frac{1}{n} \|X - \hat{X}\|_F^2 = \frac{1}{n} \|X - g(X A^T) A\|_F^2\]

where:

\(X \in \mathbb{R}^{n \times p}\) is the input data matrix
\(A \in \mathbb{R}^{k \times p}\) are the encoder/decoder weights
\(Z \in \mathbb{R}^{n \times k}\) is the projected data
\(g(\cdot)\) is applied element-wise
\(n\) is the number of samples, \(p\) is the number of features, \(k\) is the number of components

Free Decoder Ridge Autoencoder¶

With tied_weights=False, the encoder and decoder have separate parameters:

\[\begin{split}\begin{align} Z &= g(X A^T) \\ \hat{X} &= Z B \end{align}\end{split}\]

The objective includes optional L2 regularization on the decoder:

\[L_{\text{untied}} = \frac{1}{n} \|X - Z B\|_F^2 + \lambda \|B\|_F^2\]

where:

\(A \in \mathbb{R}^{k \times p}\) are the encoder weights
\(B \in \mathbb{R}^{k \times p}\) are the decoder weights
\(\lambda \geq 0\) is the L2 regularization parameter

Distance Distortion Objectives¶

Distance Distortion with Nonlinearity (Default)¶

The default distance objective compares pairwise distances between the original space and the nonlinearly transformed projection:

\[L_{\text{dist-nl}} = \frac{1}{n^2} \sum_{i,j} (d_{X}(i,j) - d_{Z}(i,j))^2\]

where:

\(d_X(i,j) = \|x_i - x_j\|_2\) are pairwise distances in the original space
\(d_Z(i,j) = \|z_i - z_j\|_2\) are pairwise distances in the transformed space
\(Z = g(X A^T)\) is the nonlinearly transformed projection

Linear Distance Distortion¶

With use_nonlinearity_in_distance=False, distances are compared in the linear projection space:

\[L_{\text{dist-linear}} = \frac{1}{n^2} \sum_{i,j} (d_{X}(i,j) - d_{Y}(i,j))^2\]

where:

\(Y = X A^T\) is the linear projection
\(d_Y(i,j) = \|y_i - y_j\|_2\) are distances in the linear projection space

This reduces to classical multidimensional scaling when \(k = p\).

Weighted Distance Distortion¶

When weight_by_distance=True, the objective is weighted to emphasize preservation of small distances:

\[L_{\text{weighted}} = \sum_{i,j} w_{ij} (d_{X}(i,j) - d_{Z}(i,j))^2\]

where the weights are:

\[w_{ij} = \frac{1}{d_X(i,j) + \epsilon}\]

with \(\epsilon = 0.1\) to avoid division by zero, and weights are normalized to sum to 1.

Optimization¶

Constraint Handling¶

All optimization is performed under the constraint that encoder directions have unit norm:

\[\|a_j\|_2 = 1 \quad \text{for all } j = 1, \ldots, k\]

This constraint is enforced by:

Normalizing the encoder directions after each optimization step
Using a projected gradient approach within the L-BFGS-B optimizer

Initialization Strategies¶

pyppur uses multiple initialization strategies:

PCA Initialization: Use the first \(k\) principal components as starting points
Random Initialization: Sample \(n_{\text{init}}\) random orthonormal matrices

For untied weights, the decoder is initialized with small random values scaled by 0.1.

The best result across all initializations is retained.

Multi-Start Optimization¶

The optimization procedure:

Try PCA initialization
Try n_init random initializations
Select the result with the lowest objective value
Return the optimal encoder (and decoder if applicable)

Computational Complexity¶

Time Complexity¶

For \(n\) samples, \(p\) features, and \(k\) components:

Distance distortion: \(O(n^2 k + n p k)\) per iteration (dominated by distance computation)
Reconstruction: \(O(n p k)\) per iteration
Overall: \(O(T \cdot (\text{per-iteration cost}))\) where \(T\) is the number of iterations

Space Complexity¶

Distance distortion: \(O(n^2)\) for storing distance matrices
Reconstruction: \(O(n k + p k)\) for intermediate computations
Parameters: \(O(p k)\) for tied weights, \(O(2 p k)\) for untied weights

Theoretical Properties¶

Convergence¶

The optimization uses L-BFGS-B, which has the following properties:

Local convergence: Guaranteed to converge to a local minimum under regularity conditions
Global optimum: Not guaranteed due to non-convexity of the objectives
Multi-start: Helps find better local optima

Expressiveness¶

The ridge function autoencoder can represent a rich class of nonlinear transformations:

Universal approximation: With sufficient components, can approximate any continuous function
Regularization: The \(\tanh\) nonlinearity provides natural bounded outputs
Interpretability: Each component corresponds to a direction in the original space

Mathematical Theory¶

Overview¶

Ridge Functions¶

Reconstruction Objectives¶

Tied-Weights Ridge Autoencoder (Default)¶

Free Decoder Ridge Autoencoder¶

Distance Distortion Objectives¶

Distance Distortion with Nonlinearity (Default)¶

Linear Distance Distortion¶

Weighted Distance Distortion¶

Optimization¶

Constraint Handling¶

Initialization Strategies¶

Multi-Start Optimization¶

Computational Complexity¶

Time Complexity¶

Space Complexity¶

Theoretical Properties¶

Convergence¶

Expressiveness¶

Comparison to Related Methods¶

vs. Linear PCA¶

vs. Kernel PCA¶

vs. Autoencoders¶