Table of Contents
Dimensionality Reduction Reference
Functions for reducing the number of features while preserving important information.
ML.DIM_REDUCTION Namespace
ML.DIM_REDUCTION.PCA()
Creates a Principal Component Analysis (PCA) transformer.
Syntax:
=ML.DIM_REDUCTION.PCA(n_components, whiten, svd_solver, tol, iterated_power, n_oversamples, power_iteration_normalizer, random_state)
Parameters:
n_components
(Integer/String, Optional): Number of components to keep- Integer: Exact number of components
- “mle”: Automatic selection via MLE
- None: Keep all components
whiten
(Boolean, Optional): Ensure uncorrelated outputs (default: FALSE)svd_solver
(String, Optional): Decomposition algorithm (default: “auto”)- “auto”: Automatic selection
- “full”: Full SVD
- “arpack”: Truncated SVD
- “randomized”: Faster approximation
tol
(Number, Optional): Tolerance for ‘arpack’ solver (default: 0.0)iterated_power
(String/Integer, Optional): Power iterations for ‘randomized’ (default: “auto”)n_oversamples
(Integer, Optional): Oversampling for ‘randomized’ (default: 10)power_iteration_normalizer
(String, Optional): Normalizer (default: “auto”)- “auto”, “QR”, “LU”, “none”
random_state
(Integer, Optional): Random seed for reproducibility
Returns: PCA transformer object
Use Case: Reduce dimensions, visualization, noise reduction
Example:
# Basic PCA to 2 components
Cell A1: =ML.DIM_REDUCTION.PCA(2)
Result: <PCA>
# PCA with whitening
Cell A2: =ML.DIM_REDUCTION.PCA(10, TRUE)
# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)
ML.DIM_REDUCTION.PCA.RESULTS()
Extracts detailed PCA results and statistics.
Syntax:
=ML.DIM_REDUCTION.PCA.RESULTS(pca_obj)
Parameters:
pca_obj
(Object, Required): Fitted PCA object
Returns: DataFrame with PCA statistics
- Columns: Components | Explained Variance | Explained Variance Ratio | Singular Values
Use Case: Analyze PCA components, determine optimal n_components
Example:
# Fit PCA
Cell A1: =ML.DIM_REDUCTION.PCA()
Cell B1: =ML.FIT(A1, X_data)
# Get detailed results
Cell C1: =ML.DIM_REDUCTION.PCA.RESULTS(B1)
# Result shows:
# Components | Explained Variance | Explained Variance Ratio | Singular Values
# 1 | 4.22 | 0.73 | 25.09
# 2 | 0.93 | 0.16 | 11.77
# ...
ML.DIM_REDUCTION.KERNEL_PCA() ⭐
Creates a Kernel PCA transformer for non-linear dimensionality reduction (Premium feature).
Syntax:
=ML.DIM_REDUCTION.KERNEL_PCA(n_components, kernel, degree, gamma, coef0)
Parameters:
n_components
(Integer, Optional): Number of components (default: None = all)kernel
(String, Optional): Kernel type (default: “linear”)- “linear”: Linear kernel
- “poly”: Polynomial kernel
- “rbf”: Radial basis function
- “sigmoid”: Sigmoid kernel
- “cosine”: Cosine similarity
degree
(Integer, Optional): Polynomial degree (default: 3)gamma
(Number, Optional): Kernel coefficient (default: 1.0)coef0
(Number, Optional): Independent term for poly/sigmoid (default: 0.0)
Returns: Kernel PCA transformer object
Use Case: Non-linear dimensionality reduction, complex data patterns
Example:
# RBF Kernel PCA
Cell A1: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.1)
Result: <KernelPCA>
# Polynomial Kernel PCA
Cell A2: =ML.DIM_REDUCTION.KERNEL_PCA(2, "poly", 3, 1.0, 1.0)
# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Common Patterns
Basic PCA for Visualization
# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63") # 64 features
# Scale data (important for PCA)
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Reduce to 2D for visualization
Cell E1: =ML.DIM_REDUCTION.PCA(2)
Cell F1: =ML.FIT_TRANSFORM(E1, D1)
# Sample results
Cell G1: =ML.DATA.SAMPLE(F1, 100)
# Plot in Excel: scatter chart with 2 columns
Finding Optimal Number of Components
# Load and scale data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Fit PCA with all components
Cell E1: =ML.DIM_REDUCTION.PCA()
Cell F1: =ML.FIT(E1, D1)
# Get explained variance
Cell G1: =ML.DIM_REDUCTION.PCA.RESULTS(F1)
# Analyze cumulative variance
# Choose n where cumulative variance > 0.95 (95%)
PCA in Pipeline
# Create complete pipeline
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.DIM_REDUCTION.PCA(10)
Cell A3: =ML.CLASSIFICATION.SVM()
# Combine steps
Cell B1: =ML.PIPELINE(A1, A2, A3)
# Train pipeline
Cell C1: =ML.FIT(B1, X_train, y_train)
# Predict
Cell D1: =ML.PREDICT(C1, X_test)
# Evaluate
Cell E1: =ML.EVAL.SCORE(C1, X_test, y_test)
Feature Extraction and Model Training
# Prepare data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63") # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64) # Target
# Split data
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0)
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1)
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0)
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1)
# Extract principal components
Cell F1: =ML.DIM_REDUCTION.PCA(20)
Cell G1: =ML.FIT_TRANSFORM(F1, D1) # Reduced train
Cell G2: =ML.TRANSFORM(F1, D2) # Reduced test
# Train on reduced features
Cell H1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100)
Cell I1: =ML.FIT(H1, G1, E1)
# Evaluate
Cell J1: =ML.EVAL.SCORE(I1, G2, E2)
Comparing Linear and Kernel PCA
# Prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)
# Linear PCA
Cell E1: =ML.DIM_REDUCTION.PCA(2)
Cell F1: =ML.FIT_TRANSFORM(E1, D1)
# RBF Kernel PCA
Cell E2: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.5)
Cell F2: =ML.FIT_TRANSFORM(E2, D1)
# Compare transformed data
Cell G1: =ML.DATA.SAMPLE(F1, 20) # Linear PCA
Cell G2: =ML.DATA.SAMPLE(F2, 20) # Kernel PCA
PCA for Noise Reduction
# Load noisy data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:J1000, TRUE)
# Scale features
Cell B1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C1: =ML.FIT_TRANSFORM(B1, A1)
# Keep components explaining 95% variance
Cell D1: =ML.DIM_REDUCTION.PCA()
Cell E1: =ML.FIT(D1, C1)
# Check explained variance
Cell F1: =ML.DIM_REDUCTION.PCA.RESULTS(E1)
# Reduce dimensions (remove noise)
Cell G1: =ML.DIM_REDUCTION.PCA(5) # Based on F1 analysis
Cell H1: =ML.FIT_TRANSFORM(G1, C1)
Grid Search with PCA
# Create pipeline with PCA
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.DIM_REDUCTION.PCA()
Cell A3: =ML.CLASSIFICATION.SVM()
Cell B1: =ML.PIPELINE(A1, A2, A3)
# Parameter grid
# Model | Parameter | V1 | V2 | V3
Cell C1: "pca" | "n_components" | 5 | 10 | 20
Cell C2: "svm" | "C" | 0.1 | 1 | 10
Cell C3: "svm" | "kernel" | "linear" | "rbf" |
# Grid search
Cell D1: =ML.EVAL.GRID_SEARCH(B1, C1:E3, "accuracy", 5, TRUE)
Cell E1: =ML.FIT(D1, X_train, y_train)
# Best PCA components and SVM params
Cell F1: =ML.EVAL.BEST_PARAMS(E1)
Tips and Best Practices
-
When to Use PCA
- High-dimensional data (many features)
- Visualization (reduce to 2-3D)
- Remove multicollinearity
- Speed up training
- Noise reduction
-
Preprocessing for PCA
- Always scale features first
- Use StandardScaler or MinMaxScaler
- PCA sensitive to feature scales
- Center data (automatic in PCA)
-
Choosing n_components
- Visualization: 2 or 3 components
- 95% variance: Common threshold
- Elbow method: Plot explained variance
- Cross-validation: Test different values
- None: Keep all for analysis
-
Interpreting Results
- Explained variance ratio: Proportion of information retained
- Cumulative variance: Total information up to component n
- Singular values: Scale of each component
- Higher components capture less variance
-
Linear vs Kernel PCA
- Linear PCA: Fast, linear relationships
- Kernel PCA: Slower, non-linear patterns
- RBF kernel: Good default for non-linear
- Linear kernel: Equivalent to standard PCA
-
Performance Optimization
- Use
svd_solver="randomized"
for large datasets - Specify n_components to speed up
- Consider incremental PCA for very large data
- Cache fitted PCA objects
- Use
-
Common Patterns
Visualization: Scale → PCA(2) → Plot Preprocessing: Scale → PCA(95%) → Model Analysis: PCA(all) → Analyze variance → Reduce Pipeline: Scale → PCA → Classifier
-
Avoiding Pitfalls
- ❌ PCA on unscaled data
- ❌ Fitting PCA on test data
- ❌ Over-reducing (too few components)
- ❌ Interpretability loss
- ✅ Scale before PCA
- ✅ Fit on train only
- ✅ Preserve 90-95% variance
- ✅ Document transformation
Related Functions
- ML.PREPROCESSING.STANDARD_SCALER() - Scale before PCA
- ML.FIT_TRANSFORM() - Fit and reduce
- ML.PIPELINE() - Combine with models
- ML.CLUSTERING.KMEANS() - Cluster in reduced space