Dimensionality Reduction Reference

Functions for reducing the number of features while preserving important information.

ML.DIM_REDUCTION Namespace

ML.DIM_REDUCTION.PCA()

Creates a Principal Component Analysis (PCA) transformer.

Syntax:

=ML.DIM_REDUCTION.PCA(n_components, whiten, svd_solver, tol, iterated_power, n_oversamples, power_iteration_normalizer, random_state)

Parameters:

n_components (Integer/String, Optional): Number of components to keep
- Integer: Exact number of components
- “mle”: Automatic selection via MLE
- None: Keep all components
whiten (Boolean, Optional): Ensure uncorrelated outputs (default: FALSE)
svd_solver (String, Optional): Decomposition algorithm (default: “auto”)
- “auto”: Automatic selection
- “full”: Full SVD
- “arpack”: Truncated SVD
- “randomized”: Faster approximation
tol (Number, Optional): Tolerance for ‘arpack’ solver (default: 0.0)
iterated_power (String/Integer, Optional): Power iterations for ‘randomized’ (default: “auto”)
n_oversamples (Integer, Optional): Oversampling for ‘randomized’ (default: 10)
power_iteration_normalizer (String, Optional): Normalizer (default: “auto”)
- “auto”, “QR”, “LU”, “none”
random_state (Integer, Optional): Random seed for reproducibility

Returns: PCA transformer object

Use Case: Reduce dimensions, visualization, noise reduction

Example:

# Basic PCA to 2 components
Cell A1: =ML.DIM_REDUCTION.PCA(2)
Result: <PCA>

# PCA with whitening
Cell A2: =ML.DIM_REDUCTION.PCA(10, TRUE)

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)

ML.DIM_REDUCTION.PCA.RESULTS()

Extracts detailed PCA results and statistics.

Syntax:

=ML.DIM_REDUCTION.PCA.RESULTS(pca_obj)

Parameters:

pca_obj (Object, Required): Fitted PCA object

Returns: DataFrame with PCA statistics

Columns: Components | Explained Variance | Explained Variance Ratio | Singular Values

Use Case: Analyze PCA components, determine optimal n_components

Example:

# Fit PCA
Cell A1: =ML.DIM_REDUCTION.PCA()
Cell B1: =ML.FIT(A1, X_data)

# Get detailed results
Cell C1: =ML.DIM_REDUCTION.PCA.RESULTS(B1)

# Result shows:
# Components | Explained Variance | Explained Variance Ratio | Singular Values
# 1          | 4.22               | 0.73                     | 25.09
# 2          | 0.93               | 0.16                     | 11.77
# ...

ML.DIM_REDUCTION.KERNEL_PCA() ⭐

Creates a Kernel PCA transformer for non-linear dimensionality reduction (Premium feature).

Syntax:

=ML.DIM_REDUCTION.KERNEL_PCA(n_components, kernel, degree, gamma, coef0)

Parameters:

n_components (Integer, Optional): Number of components (default: None = all)
kernel (String, Optional): Kernel type (default: “linear”)
- “linear”: Linear kernel
- “poly”: Polynomial kernel
- “rbf”: Radial basis function
- “sigmoid”: Sigmoid kernel
- “cosine”: Cosine similarity
degree (Integer, Optional): Polynomial degree (default: 3)
gamma (Number, Optional): Kernel coefficient (default: 1.0)
coef0 (Number, Optional): Independent term for poly/sigmoid (default: 0.0)

Returns: Kernel PCA transformer object

Use Case: Non-linear dimensionality reduction, complex data patterns

Example:

# RBF Kernel PCA
Cell A1: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.1)
Result: <KernelPCA>

# Polynomial Kernel PCA
Cell A2: =ML.DIM_REDUCTION.KERNEL_PCA(2, "poly", 3, 1.0, 1.0)

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)

Common Patterns

Basic PCA for Visualization

# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # 64 features

# Scale data (important for PCA)
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Reduce to 2D for visualization
Cell E1: =ML.DIM_REDUCTION.PCA(2)
Cell F1: =ML.FIT_TRANSFORM(E1, D1)

# Sample results
Cell G1: =ML.DATA.SAMPLE(F1, 100)
# Plot in Excel: scatter chart with 2 columns

Finding Optimal Number of Components

# Load and scale data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Fit PCA with all components
Cell E1: =ML.DIM_REDUCTION.PCA()
Cell F1: =ML.FIT(E1, D1)

# Get explained variance
Cell G1: =ML.DIM_REDUCTION.PCA.RESULTS(F1)

# Analyze cumulative variance
# Choose n where cumulative variance > 0.95 (95%)

PCA in Pipeline

# Create complete pipeline
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.DIM_REDUCTION.PCA(10)
Cell A3: =ML.CLASSIFICATION.SVM()

# Combine steps
Cell B1: =ML.PIPELINE(A1, A2, A3)

# Train pipeline
Cell C1: =ML.FIT(B1, X_train, y_train)

# Predict
Cell D1: =ML.PREDICT(C1, X_test)

# Evaluate
Cell E1: =ML.EVAL.SCORE(C1, X_test, y_test)

Feature Extraction and Model Training

# Prepare data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # Features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)      # Target

# Split data
Cell D1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0)
Cell D2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1)
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0)
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1)

# Extract principal components
Cell F1: =ML.DIM_REDUCTION.PCA(20)
Cell G1: =ML.FIT_TRANSFORM(F1, D1)  # Reduced train
Cell G2: =ML.TRANSFORM(F1, D2)       # Reduced test

# Train on reduced features
Cell H1: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100)
Cell I1: =ML.FIT(H1, G1, E1)

# Evaluate
Cell J1: =ML.EVAL.SCORE(I1, G2, E2)

Comparing Linear and Kernel PCA

# Prepare data
Cell A1: =ML.DATASETS.IRIS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, {0,1,2,3})
Cell C1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell D1: =ML.FIT_TRANSFORM(C1, B1)

# Linear PCA
Cell E1: =ML.DIM_REDUCTION.PCA(2)
Cell F1: =ML.FIT_TRANSFORM(E1, D1)

# RBF Kernel PCA
Cell E2: =ML.DIM_REDUCTION.KERNEL_PCA(2, "rbf", , 0.5)
Cell F2: =ML.FIT_TRANSFORM(E2, D1)

# Compare transformed data
Cell G1: =ML.DATA.SAMPLE(F1, 20)  # Linear PCA
Cell G2: =ML.DATA.SAMPLE(F2, 20)  # Kernel PCA

PCA for Noise Reduction

# Load noisy data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:J1000, TRUE)

# Scale features
Cell B1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell C1: =ML.FIT_TRANSFORM(B1, A1)

# Keep components explaining 95% variance
Cell D1: =ML.DIM_REDUCTION.PCA()
Cell E1: =ML.FIT(D1, C1)

# Check explained variance
Cell F1: =ML.DIM_REDUCTION.PCA.RESULTS(E1)

# Reduce dimensions (remove noise)
Cell G1: =ML.DIM_REDUCTION.PCA(5)  # Based on F1 analysis
Cell H1: =ML.FIT_TRANSFORM(G1, C1)

Grid Search with PCA

# Create pipeline with PCA
Cell A1: =ML.PREPROCESSING.STANDARD_SCALER()
Cell A2: =ML.DIM_REDUCTION.PCA()
Cell A3: =ML.CLASSIFICATION.SVM()
Cell B1: =ML.PIPELINE(A1, A2, A3)

# Parameter grid
# Model | Parameter      | V1  | V2  | V3
Cell C1: "pca" | "n_components" | 5   | 10  | 20
Cell C2: "svm" | "C"            | 0.1 | 1   | 10
Cell C3: "svm" | "kernel"       | "linear" | "rbf" |

# Grid search
Cell D1: =ML.EVAL.GRID_SEARCH(B1, C1:E3, "accuracy", 5, TRUE)
Cell E1: =ML.FIT(D1, X_train, y_train)

# Best PCA components and SVM params
Cell F1: =ML.EVAL.BEST_PARAMS(E1)

Tips and Best Practices

When to Use PCA
- High-dimensional data (many features)
- Visualization (reduce to 2-3D)
- Remove multicollinearity
- Speed up training
- Noise reduction
Preprocessing for PCA
- Always scale features first
- Use StandardScaler or MinMaxScaler
- PCA sensitive to feature scales
- Center data (automatic in PCA)
Choosing n_components
- Visualization: 2 or 3 components
- 95% variance: Common threshold
- Elbow method: Plot explained variance
- Cross-validation: Test different values
- None: Keep all for analysis
Interpreting Results
- Explained variance ratio: Proportion of information retained
- Cumulative variance: Total information up to component n
- Singular values: Scale of each component
- Higher components capture less variance
Linear vs Kernel PCA
- Linear PCA: Fast, linear relationships
- Kernel PCA: Slower, non-linear patterns
- RBF kernel: Good default for non-linear
- Linear kernel: Equivalent to standard PCA
Performance Optimization
- Use svd_solver="randomized" for large datasets
- Specify n_components to speed up
- Consider incremental PCA for very large data
- Cache fitted PCA objects

Common Patterns

Visualization: Scale → PCA(2) → Plot
Preprocessing: Scale → PCA(95%) → Model
Analysis: PCA(all) → Analyze variance → Reduce
Pipeline: Scale → PCA → Classifier

Avoiding Pitfalls
- ❌ PCA on unscaled data
- ❌ Fitting PCA on test data
- ❌ Over-reducing (too few components)
- ❌ Interpretability loss
- ✅ Scale before PCA
- ✅ Fit on train only
- ✅ Preserve 90-95% variance
- ✅ Document transformation

ML.PREPROCESSING.STANDARD_SCALER() - Scale before PCA
ML.FIT_TRANSFORM() - Fit and reduce
ML.PIPELINE() - Combine with models
ML.CLUSTERING.KMEANS() - Cluster in reduced space

Dimensionality Reduction Reference

Table of Contents

Dimensionality Reduction Reference

ML.DIM_REDUCTION Namespace

ML.DIM_REDUCTION.PCA()

ML.DIM_REDUCTION.PCA.RESULTS()

ML.DIM_REDUCTION.KERNEL_PCA() ⭐

Common Patterns

Basic PCA for Visualization

Finding Optimal Number of Components

PCA in Pipeline

Feature Extraction and Model Training

Comparing Linear and Kernel PCA

PCA for Noise Reduction

Grid Search with PCA

Tips and Best Practices

Navigation

Table of Contents

Dimensionality Reduction Reference

ML.DIM_REDUCTION Namespace

ML.DIM_REDUCTION.PCA()

ML.DIM_REDUCTION.PCA.RESULTS()

ML.DIM_REDUCTION.KERNEL_PCA() ⭐

Common Patterns

Basic PCA for Visualization

Finding Optimal Number of Components

PCA in Pipeline

Feature Extraction and Model Training

Comparing Linear and Kernel PCA

PCA for Noise Reduction

Grid Search with PCA

Tips and Best Practices

Related Functions

Navigation