Principal Component Analysis (PCA)ΒΆ

Principal Component Analysis finds the directions of maximum variance in a dataset. It is one of the most widely used techniques for dimensionality reduction: by projecting data onto the top few principal components, we can capture most of the information in fewer dimensions.

The key insight is that the covariance matrix encodes all the variance structure. Its eigenvectors point along the principal directions and its eigenvalues measure the variance along each one.

In this notebook we:

  1. Generate 2D data with a known elongated, rotated structure
  2. Compute the covariance matrix with np_cov
  3. Find principal components via eigendecomposition with np_eig
  4. Project, visualize, and reconstruct the data
InΒ [12]:
load("numerics")$
load("ax-plots")$

Generating 2D Data with Known StructureΒΆ

We create an elongated cloud of points rotated at 30 degrees. The data is generated along two orthogonal "true" principal axes with very different variances ($\sigma_1 = 3$, $\sigma_2 = 0.5$), then rotated by the angle $\theta = \pi/6$ into the $x$-$y$ plane.

This gives us ground truth to verify that PCA recovers the correct directions and variance magnitudes.

InΒ [13]:
n : 200$
angle : float(%pi/6)$   /* 30 degrees */

/* Generate data along the true principal axes */
pc1 : np_scale(3.0, np_randn([n]))$   /* high variance axis */
pc2 : np_scale(0.5, np_randn([n]))$   /* low variance axis */

/* Rotate into x-y coordinates */
x : np_add(np_scale(cos(angle), pc1), np_scale(-sin(angle), pc2))$
y : np_add(np_scale(sin(angle), pc1), np_scale(cos(angle), pc2))$

print("Generated", n, "points with sigma1=3, sigma2=0.5, angle=30 deg")$
Generated 200 points with sigma1=3, sigma2=0.5, angle=30 deg
InΒ [14]:
/* Plot the raw data */
ax_draw2d(
  marker_size=4, opacity=0.5,
  points(x, y),
  title="Original Data",
  xlabel="x", ylabel="y",
  aspect_ratio=true
)$
No description has been provided for this image

Computing the Covariance MatrixΒΆ

The covariance matrix $C$ summarizes how each pair of variables co-varies. For our 2D data (stacked as an $n \times 2$ matrix), $C$ is $2 \times 2$:

$$C = \frac{1}{n-1} \sum_{i=1}^{n} (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T$$

The diagonal entries are the variances of $x$ and $y$; the off-diagonal entries are the covariance between them.

InΒ [15]:
/* Stack x, y into an n-by-2 matrix */
data : np_hstack(
  np_reshape(x, [n, 1]),
  np_reshape(y, [n, 1])
)$
print("Data shape:", np_shape(data))$

/* Compute the sample covariance matrix */
C : np_cov(data)$
print("Covariance matrix:")$
np_to_matrix(C);
Data shape: [200,2]
Covariance matrix:
Out[15]:

EigendecompositionΒΆ

The eigenvalues of the covariance matrix give the variance along each principal component, and the eigenvectors give the directions. The eigenvector with the largest eigenvalue points along the direction of greatest spread in the data.

We sort the eigenvalues in descending order so that PC1 always corresponds to the dominant component.

InΒ [16]:
/* Eigendecomposition of the covariance matrix */
[vals, vecs] : np_eig(C)$

/* Sort by descending eigenvalue so PC1 = dominant component */
if np_ref(vals, 0) < np_ref(vals, 1) then (
  vals : ndarray([np_ref(vals, 1), np_ref(vals, 0)], [2]),
  vecs : np_hstack(np_col(vecs, 1), np_col(vecs, 0))
)$

print("Eigenvalues (descending):")$
print(np_to_list(vals))$

print("Eigenvectors (columns):")$
np_to_matrix(vecs);
Eigenvalues (descending):
[9.883969942230777,0.24901138222797123]
Eigenvectors (columns):
Out[16]:
InΒ [17]:
/* Explained variance ratio: fraction of total variance per component */
total_var : np_sum(vals)$
ratios : np_to_list(np_scale(1.0 / total_var, vals))$

print("Explained variance ratios:")$
for i : 1 thru length(ratios) do
  printf(true, "  PC~d: ~,1f%~%", i, ratios[i] * 100)$
Explained variance ratios:
  PC1: 97.5%o144
o144  PC2: 2.5%o144
o144$$\mathbf{done}$$
false

Visualizing Principal ComponentsΒΆ

We overlay the eigenvectors on the scatter plot, scaled by the square root of the corresponding eigenvalue (so the arrow length is proportional to the standard deviation along that direction). The longer arrow should point along the elongated direction of the cloud.

InΒ [18]:
/* Compute the mean of x and y for centering the arrows */
mx : np_mean(x)$
my : np_mean(y)$

/* Extract eigenvectors and scale by sqrt(eigenvalue) */
ev1 : np_to_list(np_col(vecs, 0))$
ev2 : np_to_list(np_col(vecs, 1))$
scale1 : sqrt(lam[1])$
scale2 : sqrt(lam[2])$

ax_draw2d(
  color="steelblue", marker_size=4, opacity=0.4,
  points(x, y),
  /* First eigenvector (larger variance β€” dominant PC) */
  color="darkred", line_width=3,
  lines([mx, mx + scale1 * ev1[1]], [my, my + scale1 * ev1[2]]),
  /* Second eigenvector (smaller variance) */
  color="red", line_width=3,
  lines([mx, mx + scale2 * ev2[1]], [my, my + scale2 * ev2[2]]),
  title="Data with Principal Component Directions",
  xlabel="x", ylabel="y",
  aspect_ratio=true
)$
No description has been provided for this image

Projecting onto Principal ComponentsΒΆ

To project the data onto the principal components, we:

  1. Mean-center the data (subtract the column means)
  2. Multiply by the eigenvector matrix: $Z = (X - \bar{X}) \, V$

In the projected coordinates, the axes are aligned with the principal directions and the covariance matrix becomes diagonal.

InΒ [19]:
/* Mean-center each column */
mean_vec : np_mean(data, 0)$
centered : np_sub(data, np_reshape(mean_vec, [1, 2]))$

/* Project onto principal components */
projected : np_matmul(centered, vecs)$
print("Projected data shape:", np_shape(projected))$

/* Verify the projected covariance is (approximately) diagonal */
print("Covariance in PC space (should be nearly diagonal):")$
np_to_matrix(np_cov(projected));
Projected data shape: [200,2]
Covariance in PC space (should be nearly diagonal):
Out[19]:
InΒ [20]:
/* Plot the projected data -- axes should be aligned with variance */
proj_x : np_col(projected, 0)$
proj_y : np_col(projected, 1)$

ax_draw2d(
  color="darkorange", marker_size=4, opacity=0.5,
  points(proj_x, proj_y),
  title="Data in Principal Component Space",
  xlabel="PC1", ylabel="PC2",
  aspect_ratio=true
)$
No description has been provided for this image

Explained Variance PlotΒΆ

A bar chart of the explained variance ratio shows how much information each principal component captures. For our 2D data most of the variance should be in PC1 (the dominant eigenvalue, corresponding to $\sigma_1 = 3$).

InΒ [21]:
ax_draw2d(
  color="steelblue",
  ax_bar(["PC1", "PC2"], ratios),
  title="Explained Variance Ratio",
  ylabel="Fraction of Total Variance"
)$
No description has been provided for this image

Reconstruction from the First Principal ComponentΒΆ

Dimensionality reduction means keeping only the top $k$ principal components. Here we reconstruct the data using only PC1 (the dominant component) and discard PC2 (the low-variance direction).

The reconstruction formula is:

$$\hat{X} = Z_k \, V_k^T + \bar{X}$$

where $Z_k$ contains only the scores for the kept components and $V_k$ contains the corresponding eigenvectors.

InΒ [22]:
/* Keep only the dominant PC (column 0, the largest eigenvalue) */
scores_dominant : np_reshape(np_col(projected, 0), [n, 1])$
vec_dominant : np_reshape(np_col(vecs, 0), [1, 2])$

/* Reconstruct: project back and add the mean */
reconstructed : np_add(
  np_matmul(scores_dominant, vec_dominant),
  np_reshape(mean_vec, [1, 2])
)$

rx : np_col(reconstructed, 0)$
ry : np_col(reconstructed, 1)$

ax_draw2d(
  color="steelblue", marker_size=4, opacity=0.3,
  name="Original", points(x, y),
  color="red", marker_size=4, opacity=0.7,
  name="Reconstructed (1 PC)", points(rx, ry),
  title="Original vs Reconstructed from Dominant PC",
  xlabel="x", ylabel="y",
  aspect_ratio=true
)$
No description has been provided for this image

SummaryΒΆ

Principal Component Analysis decomposes the covariance matrix to find the orthogonal directions of maximum variance:

  1. Covariance matrix (np_cov) -- encodes all pairwise variance and covariance between variables.
  2. Eigendecomposition (np_eig) -- eigenvalues give the variance along each principal direction; eigenvectors give the directions.
  3. Projection (np_matmul) -- transforms data into the principal component coordinate system where the covariance is diagonal.
  4. Reconstruction -- keeping only the top $k$ components gives a low-dimensional approximation that preserves the most variance.

PCA is the foundation for many techniques in statistics, machine learning, and signal processing. The covariance matrix is the only input needed -- all the geometric structure of the data is encoded in its eigenstructure.