Mathematical Theory

Overview

Projection pursuit finds interesting low-dimensional projections of multivariate data by optimizing an “interestingness” index. pyppur implements two main approaches for dimensionality reduction:

  1. Distance Distortion: Preserves pairwise distances between data points

  2. Reconstruction Error: Minimizes reconstruction error using ridge functions

Ridge Functions

pyppur uses the hyperbolic tangent as the ridge function:

\[g(z) = \tanh(\alpha \cdot z)\]

where \(\alpha\) is the steepness parameter that controls the degree of nonlinearity.

The gradient of the ridge function is:

\[g'(z) = \alpha \cdot \operatorname{sech}^2(\alpha \cdot z) = \alpha \cdot (1 - \tanh^2(\alpha \cdot z))\]

Reconstruction Objectives

Tied-Weights Ridge Autoencoder (Default)

The default reconstruction approach uses tied weights where the encoder and decoder share the same parameters:

\[\begin{split}\begin{align} Z &= g(X A^T) \\ \hat{X} &= Z A \end{align}\end{split}\]

The objective is to minimize the reconstruction error:

\[L_{\text{tied}} = \frac{1}{n} \|X - \hat{X}\|_F^2 = \frac{1}{n} \|X - g(X A^T) A\|_F^2\]

where:

  • \(X \in \mathbb{R}^{n \times p}\) is the input data matrix

  • \(A \in \mathbb{R}^{k \times p}\) are the encoder/decoder weights

  • \(Z \in \mathbb{R}^{n \times k}\) is the projected data

  • \(g(\cdot)\) is applied element-wise

  • \(n\) is the number of samples, \(p\) is the number of features, \(k\) is the number of components

Free Decoder Ridge Autoencoder

With tied_weights=False, the encoder and decoder have separate parameters:

\[\begin{split}\begin{align} Z &= g(X A^T) \\ \hat{X} &= Z B \end{align}\end{split}\]

The objective includes optional L2 regularization on the decoder:

\[L_{\text{untied}} = \frac{1}{n} \|X - Z B\|_F^2 + \lambda \|B\|_F^2\]

where:

  • \(A \in \mathbb{R}^{k \times p}\) are the encoder weights

  • \(B \in \mathbb{R}^{k \times p}\) are the decoder weights

  • \(\lambda \geq 0\) is the L2 regularization parameter

Distance Distortion Objectives

Distance Distortion with Nonlinearity (Default)

The default distance objective compares pairwise distances between the original space and the nonlinearly transformed projection:

\[L_{\text{dist-nl}} = \frac{1}{n^2} \sum_{i,j} (d_{X}(i,j) - d_{Z}(i,j))^2\]

where:

  • \(d_X(i,j) = \|x_i - x_j\|_2\) are pairwise distances in the original space

  • \(d_Z(i,j) = \|z_i - z_j\|_2\) are pairwise distances in the transformed space

  • \(Z = g(X A^T)\) is the nonlinearly transformed projection

Linear Distance Distortion

With use_nonlinearity_in_distance=False, distances are compared in the linear projection space:

\[L_{\text{dist-linear}} = \frac{1}{n^2} \sum_{i,j} (d_{X}(i,j) - d_{Y}(i,j))^2\]

where:

  • \(Y = X A^T\) is the linear projection

  • \(d_Y(i,j) = \|y_i - y_j\|_2\) are distances in the linear projection space

This reduces to classical multidimensional scaling when \(k = p\).

Weighted Distance Distortion

When weight_by_distance=True, the objective is weighted to emphasize preservation of small distances:

\[L_{\text{weighted}} = \sum_{i,j} w_{ij} (d_{X}(i,j) - d_{Z}(i,j))^2\]

where the weights are:

\[w_{ij} = \frac{1}{d_X(i,j) + \epsilon}\]

with \(\epsilon = 0.1\) to avoid division by zero, and weights are normalized to sum to 1.

Optimization

Constraint Handling

All optimization is performed under the constraint that encoder directions have unit norm:

\[\|a_j\|_2 = 1 \quad \text{for all } j = 1, \ldots, k\]

This constraint is enforced by:

  1. Normalizing the encoder directions after each optimization step

  2. Using a projected gradient approach within the L-BFGS-B optimizer

Initialization Strategies

pyppur uses multiple initialization strategies:

  1. PCA Initialization: Use the first \(k\) principal components as starting points

  2. Random Initialization: Sample \(n_{\text{init}}\) random orthonormal matrices

For untied weights, the decoder is initialized with small random values scaled by 0.1.

The best result across all initializations is retained.

Multi-Start Optimization

The optimization procedure:

  1. Try PCA initialization

  2. Try n_init random initializations

  3. Select the result with the lowest objective value

  4. Return the optimal encoder (and decoder if applicable)

Computational Complexity

Time Complexity

For \(n\) samples, \(p\) features, and \(k\) components:

  • Distance distortion: \(O(n^2 k + n p k)\) per iteration (dominated by distance computation)

  • Reconstruction: \(O(n p k)\) per iteration

  • Overall: \(O(T \cdot (\text{per-iteration cost}))\) where \(T\) is the number of iterations

Space Complexity

  • Distance distortion: \(O(n^2)\) for storing distance matrices

  • Reconstruction: \(O(n k + p k)\) for intermediate computations

  • Parameters: \(O(p k)\) for tied weights, \(O(2 p k)\) for untied weights

Theoretical Properties

Convergence

The optimization uses L-BFGS-B, which has the following properties:

  • Local convergence: Guaranteed to converge to a local minimum under regularity conditions

  • Global optimum: Not guaranteed due to non-convexity of the objectives

  • Multi-start: Helps find better local optima

Expressiveness

The ridge function autoencoder can represent a rich class of nonlinear transformations:

  • Universal approximation: With sufficient components, can approximate any continuous function

  • Regularization: The \(\tanh\) nonlinearity provides natural bounded outputs

  • Interpretability: Each component corresponds to a direction in the original space