Dimensionality Reduction Reference

Complete reference for FormulaML dimensionality reduction functions including PCA and Kernel PCA.

Dimensionality Reduction Reference

Functions for reducing the number of features while preserving important information.

ML.DIM_REDUCTION Namespace

ML.DIM_REDUCTION.PCA()

Creates a Principal Component Analysis (PCA) transformer.

Syntax:

=ML.DIM_REDUCTION.PCA(n_components, whiten, svd_solver, tol, iterated_power, n_oversamples, power_iteration_normalizer, random_state)

Parameters:

  • n_components (Integer/String, Optional): Number of components to keep
    • Integer: Exact number of components
    • “mle”: Automatic selection via MLE
    • None: Keep all components
  • whiten (Boolean, Optional): Ensure uncorrelated outputs (default: FALSE)
  • svd_solver (String, Optional): Decomposition algorithm (default: “auto”)
    • “auto”: Automatic selection
    • “full”: Full SVD
    • “arpack”: Truncated SVD
    • “randomized”: Faster approximation
  • tol (Number, Optional): Tolerance for ‘arpack’ solver (default: 0.0)
  • iterated_power (String/Integer, Optional): Power iterations for ‘randomized’ (default: “auto”)
  • n_oversamples (Integer, Optional): Oversampling for ‘randomized’ (default: 10)
  • power_iteration_normalizer (String, Optional): Normalizer (default: “auto”)
    • “auto”, “QR”, “LU”, “none”
  • random_state (Integer, Optional): Random seed for reproducibility

Returns: PCA transformer object

Use Case: Reduce dimensions, visualization, noise reduction

Example:

# Basic PCA to 2 components
Cell A1: =ML.DIM_REDUCTION.PCA(2)
Result: <PCA>

# PCA with whitening
Cell A2: =ML.DIM_REDUCTION.PCA(10, TRUE)

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)

ML.DIM_REDUCTION.PCA.RESULTS()

Extracts detailed PCA results and statistics.

Syntax:

=ML.DIM_REDUCTION.PCA.RESULTS(pca_obj)

Parameters:

  • pca_obj (Object, Required): Fitted PCA object

Returns: DataFrame with PCA statistics

  • Columns: Components | Explained Variance | Explained Variance Ratio | Singular Values

Use Case: Analyze PCA components, determine optimal n_components

Example:

# Fit PCA
Cell A1: =ML.DIM_REDUCTION.PCA()
Cell B1: =ML.FIT(A1, X_data)

# Get detailed results
Cell C1: =ML.DIM_REDUCTION.PCA.RESULTS(B1)

# Result shows:
# Components | Explained Variance | Explained Variance Ratio | Singular Values
# 1          | 4.22               | 0.73                     | 25.09
# 2          | 0.93               | 0.16                     | 11.77
# ...

ML.DIM_REDUCTION.KERNEL_PCA() ⭐

Creates a Kernel PCA transformer for non-linear dimensionality reduction (Premium feature).

Syntax:

=ML.DIM_REDUCTION.KERNEL_PCA(n_components, kernel, degree, gamma, coef0)

Parameters:

  • n_components (Integer, Optional): Number of components (default: None = all)
  • kernel (String, Optional): Kernel type (default: “linear”)
    • “linear”: Linear kernel
    • “poly”: Polynomial kernel
    • “rbf”: Radial basis function
    • “sigmoid”: Sigmoid kernel
    • “cosine”: Cosine similarity
  • degree (Integer, Optional): Polynomial degree (default: 3)
  • gamma (Number, Optional): Kernel coefficient (default: 1.0)
  • coef0 (Number, Optional): Independent term for poly/sigmoid (default: 0.0)

Returns: Kernel PCA transformer object

Use Case: Non-linear dimensionality reduction, complex data patterns

Example:

# RBF Kernel PCA
Cell A1: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.1)
Result: <KernelPCA>

# Polynomial Kernel PCA
Cell A2: =ML.DIM_REDUCTION.KERNEL_PCA(2, "poly", 3, 1.0, 1.0)

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)

Common Patterns

Basic PCA for Visualization

# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # 64 features

# Scale data (important for PCA)
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Reduce to 2D for visualization
Cell E1: =ML.DIM_REDUCTION.PCA(2)
Cell F1: =ML.FIT_TRANSFORM(E1, D1)

# Sample results
Cell G1: =ML.DATA.SAMPLE(F1, 100)
# Plot in Excel: scatter chart with 2 columns

Finding Optimal Number of Components

# Load and scale data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Fit PCA with all components
Cell E1: =ML.DIM_REDUCTION.PCA()
Cell F1: =ML.FIT(E1, D1)

# Get explained variance
Cell G1: =ML.DIM_REDUCTION.PCA.RESULTS(F1)

# Analyze cumulative variance
# Choose n where cumulative variance > 0.95 (95%)

PCA in Pipeline

# Create complete pipeline
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.DIM_REDUCTION.PCA(10)
Cell A3: =ML.CLASSIFICATION.SVM()

# Combine steps
Cell B1: =ML.PIPELINE(A1, A2, A3)

# Train pipeline
Cell C1: =ML.FIT(B1, X_train, y_train)

# Predict
Cell D1: =ML.PREDICT(C1, X_test)

# Evaluate
Cell E1: =ML.EVAL.SCORE(C1, X_test, y_test)

Feature Extraction and Model Training

# Prepare data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)      # Target

# Split data
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0)
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1)
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0)
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1)

# Extract principal components
Cell F1: =ML.DIM_REDUCTION.PCA(20)
Cell G1: =ML.FIT_TRANSFORM(F1, D1)  # Reduced train
Cell G2: =ML.TRANSFORM(F1, D2)       # Reduced test

# Train on reduced features
Cell H1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100)
Cell I1: =ML.FIT(H1, G1, E1)

# Evaluate
Cell J1: =ML.EVAL.SCORE(I1, G2, E2)

Comparing Linear and Kernel PCA

# Prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Linear PCA
Cell E1: =ML.DIM_REDUCTION.PCA(2)
Cell F1: =ML.FIT_TRANSFORM(E1, D1)

# RBF Kernel PCA
Cell E2: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.5)
Cell F2: =ML.FIT_TRANSFORM(E2, D1)

# Compare transformed data
Cell G1: =ML.DATA.SAMPLE(F1, 20)  # Linear PCA
Cell G2: =ML.DATA.SAMPLE(F2, 20)  # Kernel PCA

PCA for Noise Reduction

# Load noisy data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:J1000, TRUE)

# Scale features
Cell B1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C1: =ML.FIT_TRANSFORM(B1, A1)

# Keep components explaining 95% variance
Cell D1: =ML.DIM_REDUCTION.PCA()
Cell E1: =ML.FIT(D1, C1)

# Check explained variance
Cell F1: =ML.DIM_REDUCTION.PCA.RESULTS(E1)

# Reduce dimensions (remove noise)
Cell G1: =ML.DIM_REDUCTION.PCA(5)  # Based on F1 analysis
Cell H1: =ML.FIT_TRANSFORM(G1, C1)

Grid Search with PCA

# Create pipeline with PCA
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.DIM_REDUCTION.PCA()
Cell A3: =ML.CLASSIFICATION.SVM()
Cell B1: =ML.PIPELINE(A1, A2, A3)

# Parameter grid
# Model | Parameter      | V1  | V2  | V3
Cell C1: "pca" | "n_components" | 5   | 10  | 20
Cell C2: "svm" | "C"            | 0.1 | 1   | 10
Cell C3: "svm" | "kernel"       | "linear" | "rbf" |

# Grid search
Cell D1: =ML.EVAL.GRID_SEARCH(B1, C1:E3, "accuracy", 5, TRUE)
Cell E1: =ML.FIT(D1, X_train, y_train)

# Best PCA components and SVM params
Cell F1: =ML.EVAL.BEST_PARAMS(E1)

Tips and Best Practices

  1. When to Use PCA

    • High-dimensional data (many features)
    • Visualization (reduce to 2-3D)
    • Remove multicollinearity
    • Speed up training
    • Noise reduction
  2. Preprocessing for PCA

    • Always scale features first
    • Use StandardScaler or MinMaxScaler
    • PCA sensitive to feature scales
    • Center data (automatic in PCA)
  3. Choosing n_components

    • Visualization: 2 or 3 components
    • 95% variance: Common threshold
    • Elbow method: Plot explained variance
    • Cross-validation: Test different values
    • None: Keep all for analysis
  4. Interpreting Results

    • Explained variance ratio: Proportion of information retained
    • Cumulative variance: Total information up to component n
    • Singular values: Scale of each component
    • Higher components capture less variance
  5. Linear vs Kernel PCA

    • Linear PCA: Fast, linear relationships
    • Kernel PCA: Slower, non-linear patterns
    • RBF kernel: Good default for non-linear
    • Linear kernel: Equivalent to standard PCA
  6. Performance Optimization

    • Use svd_solver="randomized" for large datasets
    • Specify n_components to speed up
    • Consider incremental PCA for very large data
    • Cache fitted PCA objects
  7. Common Patterns

    Visualization: Scale → PCA(2) → Plot
    Preprocessing: Scale → PCA(95%) → Model
    Analysis: PCA(all) → Analyze variance → Reduce
    Pipeline: Scale → PCA → Classifier
    
  8. Avoiding Pitfalls

    • ❌ PCA on unscaled data
    • ❌ Fitting PCA on test data
    • ❌ Over-reducing (too few components)
    • ❌ Interpretability loss
    • ✅ Scale before PCA
    • ✅ Fit on train only
    • ✅ Preserve 90-95% variance
    • ✅ Document transformation