Impute & Feature Selection Reference

Functions for handling missing values and selecting important features.

ML.IMPUTE Namespace

ML.IMPUTE.SIMPLE_IMPUTER()

Imputes missing values using basic strategies.

Syntax:

=ML.IMPUTE.SIMPLE_IMPUTER(strategy, fill_value)

Parameters:

strategy (String, Required): Imputation strategy
- “mean”: Replace with column mean (numeric only)
- “median”: Replace with column median (numeric only)
- “most_frequent”: Replace with mode (works for all types)
- “constant”: Replace with specific value (use fill_value)
fill_value (Any, Optional): Value for “constant” strategy

Returns: SimpleImputer transformer object

Use Case: Handle missing data before training models

Example:

# Mean imputation for numeric columns
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Result: <SimpleImputer>

# Constant imputation
Cell A2: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 0)

# Most frequent for categorical
Cell A3: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)

ML.FEATURE_SELECTION Namespace

ML.FEATURE_SELECTION.SELECT_PERCENTILE()

Selects features based on percentile of highest scores.

Syntax:

=ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile)

Parameters:

score_func (String, Required): Scoring function to use
- “chi2”: Chi-squared test (for non-negative features)
- More functions may be available
percentile (Integer, Required): Percentile of features to keep
- Value between 1-100
- 10 = keep top 10% of features

Returns: SelectPercentile transformer object

Use Case: Reduce features by statistical significance

Example:

# Keep top 25% of features
Cell A1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 25)
Result: <SelectPercentile>

# Fit on training data
Cell B1: =ML.FIT(A1, X_train, y_train)

# Transform both train and test
Cell C1: =ML.TRANSFORM(B1, X_train)
Cell D1: =ML.TRANSFORM(B1, X_test)

Common Patterns

Basic Missing Value Imputation

# Load data with missing values
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:F1000, TRUE)

# Check for missing values
Cell B1: =ML.DATA.INFO(A1)  # Shows null counts

# Impute with mean
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell D1: =ML.FIT_TRANSFORM(C1, A1)

# Verify no missing values
Cell E1: =ML.DATA.INFO(D1)

Different Strategies for Different Columns

# Load data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:G1000, TRUE)

# Mean for numeric columns
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell C1: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {"age", "income", "score"})

# Most frequent for categorical
Cell B2: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(B2, {"category", "region"})

# Combine imputers
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(C1, C2)

# Use in pipeline
Cell E1: =ML.CLASSIFICATION.LOGISTIC()
Cell F1: =ML.PIPELINE(D1, E1)
Cell G1: =ML.FIT(F1, X_train, y_train)

Imputation in Complete Pipeline

# Load data
Cell A1: =ML.DATASETS.DIABETES()

# Add some missing values for demonstration
# (In practice, data already has missing values)

# Step 1: Impute missing values
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")

# Step 2: Scale features
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()

# Step 3: Select features
Cell B3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)

# Step 4: Train model
Cell B4: =ML.REGRESSION.RIDGE(1.0)

# Create pipeline
Cell C1: =ML.PIPELINE(B1, B2, B3, B4)

# Fit and predict
Cell D1: =ML.FIT(C1, X_train, y_train)
Cell E1: =ML.PREDICT(D1, X_test)

Constant Value Imputation

# Load data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)

# Impute age with median age (40)
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 40)
Cell C1: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {"age"})

# Impute categorical with "Unknown"
Cell B2: =ML.IMPUTE.SIMPLE_IMPUTER("constant", "Unknown")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(B2, {"category"})

# Combine
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(C1, C2)

# Process data
Cell E1: =ML.FIT_TRANSFORM(D1, A1)

Feature Selection Workflow

# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # 64 features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)      # Target

# Select top 50% most important features
Cell D1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)

# Fit selector on training data
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0)  # X_train
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1)  # X_test
Cell F1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0)  # y_train
Cell F2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1)  # y_test

# Fit feature selector
Cell G1: =ML.FIT(D1, E1, F1)

# Transform to selected features
Cell H1: =ML.TRANSFORM(G1, E1)  # Train with top features
Cell H2: =ML.TRANSFORM(G1, E2)  # Test with top features

# Train model on selected features
Cell I1: =ML.CLASSIFICATION.SVM()
Cell J1: =ML.FIT(I1, H1, F1)

# Evaluate
Cell K1: =ML.EVAL.SCORE(J1, H2, F2)

Comparing Feature Selection Percentiles

# Load data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)

# Try different percentiles
Cell D1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 10)  # Top 10%
Cell D2: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 25)  # Top 25%
Cell D3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)  # Top 50%

# Create pipelines with each
Cell E1: =ML.CLASSIFICATION.SVM()
Cell F1: =ML.PIPELINE(D1, E1)
Cell F2: =ML.PIPELINE(D2, E1)
Cell F3: =ML.PIPELINE(D3, E1)

# Cross-validate each
Cell G1: =ML.EVAL.CV_SCORE(F1, B1, C1, 5, "accuracy")
Cell G2: =ML.EVAL.CV_SCORE(F2, B1, C1, 5, "accuracy")
Cell G3: =ML.EVAL.CV_SCORE(F3, B1, C1, 5, "accuracy")

# Compare average scores
Cell H1: =AVERAGE(G1#)  # 10% features
Cell H2: =AVERAGE(G2#)  # 25% features
Cell H3: =AVERAGE(G3#)  # 50% features

Complete Preprocessing Pipeline

# Load messy data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:J1000, TRUE)

# 1. Impute missing values
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")

# 2. Scale features
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()

# 3. Select best features
Cell B3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 70)

# 4. Train model
Cell B4: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100)

# Build full pipeline
Cell C1: =ML.PIPELINE(B1, B2, B3, B4)

# Fit and evaluate
Cell D1: =ML.FIT(C1, X_train, y_train)
Cell E1: =ML.EVAL.SCORE(D1, X_test, y_test)

# Make predictions
Cell F1: =ML.PREDICT(D1, X_test)

Tips and Best Practices

Imputation Strategies

Choosing Imputation Strategy
- Mean: Symmetric distributions, no outliers
- Median: Skewed distributions, has outliers
- Most Frequent: Categorical data, discrete values
- Constant: Domain knowledge suggests specific value
When to Impute
- Before any other preprocessing
- Before splitting train/test (if using same strategy)
- Or fit imputer on train, transform both
Imputation Best Practices
- ✅ Understand why data is missing
- ✅ Use median for outlier-prone data
- ✅ Document imputation choices
- ✅ Check imputation impact on results
- ❌ Don’t ignore missing data patterns
- ❌ Don’t use mean with outliers
- ❌ Don’t impute without checking distribution

Imputation Order

Correct: Load → Impute → Split → Scale → Model
Also OK: Load → Split → Impute(fit on train) → Scale → Model
Wrong: Load → Split → Scale → Impute

Feature Selection

Choosing Percentile
- Start with 50% (half features)
- Use cross-validation to optimize
- Balance: fewer features vs accuracy
- Consider computational cost
Feature Selection Strategies
- 10-25%: Very high dimensionality
- 25-50%: Medium dimensionality
- 50-75%: Mild reduction
- 75-90%: Conservative approach
When to Use Feature Selection
- High-dimensional data (many features)
- Reduce overfitting
- Speed up training
- Improve interpretability
- Remove redundant features
Feature Selection Best Practices
- ✅ Always fit on training data only
- ✅ Use same selector for train and test
- ✅ Try different percentiles
- ✅ Validate with cross-validation
- ❌ Don’t select features on full dataset
- ❌ Don’t use different selections for train/test

Combined Strategies

Preprocessing Order

1. Handle missing values (impute)
2. Encode categorical features
3. Scale numerical features
4. Select features (if needed)
5. Train model

Pipeline Integration
- Put imputation first in pipeline
- Feature selection before model
- Use compose for column-specific imputation
- Validate entire pipeline with CV
Performance Tips
- Impute early (reduces missing value errors)
- Select features to speed up training
- Cache imputer/selector for production
- Monitor feature importance after selection

Common Patterns

Quick Fix: IMPUTE(mean) → MODEL
Better: IMPUTE → SCALE → MODEL
Best: IMPUTE → SCALE → SELECT → MODEL
Column-wise: COMPOSE(imputers) → COMPOSE(scalers) → MODEL

Imputation Strategy Guide

Data Type	Distribution	Has Outliers	Strategy
Numeric	Normal	No	Mean
Numeric	Normal	Yes	Median
Numeric	Skewed	Any	Median
Categorical	Any	N/A	Most Frequent
Any	Any	N/A	Constant (with domain value)

Feature Selection Scoring Functions

Function	Use Case	Data Requirements
chi2	Classification	Non-negative features
f_classif	Classification	Any features
f_regression	Regression	Any features
mutual_info	Any	Any features

ML.DATA.DROP_MISSING_ROWS() - Alternative to imputation
ML.COMPOSE Functions - Column-specific imputation
ML.PREPROCESSING Functions - Feature scaling
ML.PIPELINE() - Combine preprocessing steps

Impute & Feature Selection Reference

Table of Contents

Impute & Feature Selection Reference

ML.IMPUTE Namespace

ML.IMPUTE.SIMPLE_IMPUTER()

ML.FEATURE_SELECTION Namespace

ML.FEATURE_SELECTION.SELECT_PERCENTILE()

Common Patterns

Basic Missing Value Imputation

Different Strategies for Different Columns

Imputation in Complete Pipeline

Constant Value Imputation

Feature Selection Workflow

Comparing Feature Selection Percentiles

Complete Preprocessing Pipeline

Tips and Best Practices

Imputation Strategies

Feature Selection

Combined Strategies

Imputation Strategy Guide

Feature Selection Scoring Functions

Navigation

Table of Contents

Impute & Feature Selection Reference

ML.IMPUTE Namespace

ML.IMPUTE.SIMPLE_IMPUTER()

ML.FEATURE_SELECTION Namespace

ML.FEATURE_SELECTION.SELECT_PERCENTILE()

Common Patterns

Basic Missing Value Imputation

Different Strategies for Different Columns

Imputation in Complete Pipeline

Constant Value Imputation

Feature Selection Workflow

Comparing Feature Selection Percentiles

Complete Preprocessing Pipeline

Tips and Best Practices

Imputation Strategies

Feature Selection

Combined Strategies

Imputation Strategy Guide

Feature Selection Scoring Functions

Related Functions

Navigation