Impute & Feature Selection Reference

Complete reference for FormulaML imputation and feature selection functions.

Impute & Feature Selection Reference

Functions for handling missing values and selecting important features.

ML.IMPUTE Namespace

ML.IMPUTE.SIMPLE_IMPUTER()

Imputes missing values using basic strategies.

Syntax:

=ML.IMPUTE.SIMPLE_IMPUTER(strategy, fill_value)

Parameters:

  • strategy (String, Required): Imputation strategy
    • “mean”: Replace with column mean (numeric only)
    • “median”: Replace with column median (numeric only)
    • “most_frequent”: Replace with mode (works for all types)
    • “constant”: Replace with specific value (use fill_value)
  • fill_value (Any, Optional): Value for “constant” strategy

Returns: SimpleImputer transformer object

Use Case: Handle missing data before training models

Example:

# Mean imputation for numeric columns
Cell A1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Result: <SimpleImputer>

# Constant imputation
Cell A2: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 0)

# Most frequent for categorical
Cell A3: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")

# Fit and transform
Cell B1: =ML.FIT_TRANSFORM(A1, X_train)
Cell C1: =ML.TRANSFORM(A1, X_test)

ML.FEATURE_SELECTION Namespace

ML.FEATURE_SELECTION.SELECT_PERCENTILE()

Selects features based on percentile of highest scores.

Syntax:

=ML.FEATURE_SELECTION.SELECT_PERCENTILE(score_func, percentile)

Parameters:

  • score_func (String, Required): Scoring function to use
    • “chi2”: Chi-squared test (for non-negative features)
    • More functions may be available
  • percentile (Integer, Required): Percentile of features to keep
    • Value between 1-100
    • 10 = keep top 10% of features

Returns: SelectPercentile transformer object

Use Case: Reduce features by statistical significance

Example:

# Keep top 25% of features
Cell A1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 25)
Result: <SelectPercentile>

# Fit on training data
Cell B1: =ML.FIT(A1, X_train, y_train)

# Transform both train and test
Cell C1: =ML.TRANSFORM(B1, X_train)
Cell D1: =ML.TRANSFORM(B1, X_test)

Common Patterns

Basic Missing Value Imputation

# Load data with missing values
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:F1000, TRUE)

# Check for missing values
Cell B1: =ML.DATA.INFO(A1)  # Shows null counts

# Impute with mean
Cell C1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell D1: =ML.FIT_TRANSFORM(C1, A1)

# Verify no missing values
Cell E1: =ML.DATA.INFO(D1)

Different Strategies for Different Columns

# Load data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:G1000, TRUE)

# Mean for numeric columns
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("mean")
Cell C1: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {"age", "income", "score"})

# Most frequent for categorical
Cell B2: =ML.IMPUTE.SIMPLE_IMPUTER("most_frequent")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(B2, {"category", "region"})

# Combine imputers
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(C1, C2)

# Use in pipeline
Cell E1: =ML.CLASSIFICATION.LOGISTIC()
Cell F1: =ML.PIPELINE(D1, E1)
Cell G1: =ML.FIT(F1, X_train, y_train)

Imputation in Complete Pipeline

# Load data
Cell A1: =ML.DATASETS.DIABETES()

# Add some missing values for demonstration
# (In practice, data already has missing values)

# Step 1: Impute missing values
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")

# Step 2: Scale features
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()

# Step 3: Select features
Cell B3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)

# Step 4: Train model
Cell B4: =ML.REGRESSION.RIDGE(1.0)

# Create pipeline
Cell C1: =ML.PIPELINE(B1, B2, B3, B4)

# Fit and predict
Cell D1: =ML.FIT(C1, X_train, y_train)
Cell E1: =ML.PREDICT(D1, X_test)

Constant Value Imputation

# Load data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:E1000, TRUE)

# Impute age with median age (40)
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("constant", 40)
Cell C1: =ML.COMPOSE.COLUMN_TRANSFORMER(B1, {"age"})

# Impute categorical with "Unknown"
Cell B2: =ML.IMPUTE.SIMPLE_IMPUTER("constant", "Unknown")
Cell C2: =ML.COMPOSE.COLUMN_TRANSFORMER(B2, {"category"})

# Combine
Cell D1: =ML.COMPOSE.DATA_TRANSFORMER(C1, C2)

# Process data
Cell E1: =ML.FIT_TRANSFORM(D1, A1)

Feature Selection Workflow

# Load high-dimensional data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")  # 64 features
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)      # Target

# Select top 50% most important features
Cell D1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)

# Fit selector on training data
Cell E1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 0)  # X_train
Cell E2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(B1, 0.2, 42, 1)  # X_test
Cell F1: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 0)  # y_train
Cell F2: =ML.PREPROCESSING.TRAIN_TEST_SPLIT(C1, 0.2, 42, 1)  # y_test

# Fit feature selector
Cell G1: =ML.FIT(D1, E1, F1)

# Transform to selected features
Cell H1: =ML.TRANSFORM(G1, E1)  # Train with top features
Cell H2: =ML.TRANSFORM(G1, E2)  # Test with top features

# Train model on selected features
Cell I1: =ML.CLASSIFICATION.SVM()
Cell J1: =ML.FIT(I1, H1, F1)

# Evaluate
Cell K1: =ML.EVAL.SCORE(J1, H2, F2)

Comparing Feature Selection Percentiles

# Load data
Cell A1: =ML.DATASETS.DIGITS()
Cell B1: =ML.DATA.SELECT_COLUMNS(A1, "0:63")
Cell C1: =ML.DATA.SELECT_COLUMNS(A1, 64)

# Try different percentiles
Cell D1: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 10)  # Top 10%
Cell D2: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 25)  # Top 25%
Cell D3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 50)  # Top 50%

# Create pipelines with each
Cell E1: =ML.CLASSIFICATION.SVM()
Cell F1: =ML.PIPELINE(D1, E1)
Cell F2: =ML.PIPELINE(D2, E1)
Cell F3: =ML.PIPELINE(D3, E1)

# Cross-validate each
Cell G1: =ML.EVAL.CV_SCORE(F1, B1, C1, 5, "accuracy")
Cell G2: =ML.EVAL.CV_SCORE(F2, B1, C1, 5, "accuracy")
Cell G3: =ML.EVAL.CV_SCORE(F3, B1, C1, 5, "accuracy")

# Compare average scores
Cell H1: =AVERAGE(G1#)  # 10% features
Cell H2: =AVERAGE(G2#)  # 25% features
Cell H3: =AVERAGE(G3#)  # 50% features

Complete Preprocessing Pipeline

# Load messy data
Cell A1: =ML.DATA.CONVERT_TO_DF(Sheet1!A1:J1000, TRUE)

# 1. Impute missing values
Cell B1: =ML.IMPUTE.SIMPLE_IMPUTER("median")

# 2. Scale features
Cell B2: =ML.PREPROCESSING.STANDARD_SCALER()

# 3. Select best features
Cell B3: =ML.FEATURE_SELECTION.SELECT_PERCENTILE("chi2", 70)

# 4. Train model
Cell B4: =ML.CLASSIFICATION.RANDOM_FOREST_CLF(100)

# Build full pipeline
Cell C1: =ML.PIPELINE(B1, B2, B3, B4)

# Fit and evaluate
Cell D1: =ML.FIT(C1, X_train, y_train)
Cell E1: =ML.EVAL.SCORE(D1, X_test, y_test)

# Make predictions
Cell F1: =ML.PREDICT(D1, X_test)

Tips and Best Practices

Imputation Strategies

  1. Choosing Imputation Strategy

    • Mean: Symmetric distributions, no outliers
    • Median: Skewed distributions, has outliers
    • Most Frequent: Categorical data, discrete values
    • Constant: Domain knowledge suggests specific value
  2. When to Impute

    • Before any other preprocessing
    • Before splitting train/test (if using same strategy)
    • Or fit imputer on train, transform both
  3. Imputation Best Practices

    • ✅ Understand why data is missing
    • ✅ Use median for outlier-prone data
    • ✅ Document imputation choices
    • ✅ Check imputation impact on results
    • ❌ Don’t ignore missing data patterns
    • ❌ Don’t use mean with outliers
    • ❌ Don’t impute without checking distribution
  4. Imputation Order

    Correct: Load → Impute → Split → Scale → Model
    Also OK: Load → Split → Impute(fit on train) → Scale → Model
    Wrong: Load → Split → Scale → Impute
    

Feature Selection

  1. Choosing Percentile

    • Start with 50% (half features)
    • Use cross-validation to optimize
    • Balance: fewer features vs accuracy
    • Consider computational cost
  2. Feature Selection Strategies

    • 10-25%: Very high dimensionality
    • 25-50%: Medium dimensionality
    • 50-75%: Mild reduction
    • 75-90%: Conservative approach
  3. When to Use Feature Selection

    • High-dimensional data (many features)
    • Reduce overfitting
    • Speed up training
    • Improve interpretability
    • Remove redundant features
  4. Feature Selection Best Practices

    • ✅ Always fit on training data only
    • ✅ Use same selector for train and test
    • ✅ Try different percentiles
    • ✅ Validate with cross-validation
    • ❌ Don’t select features on full dataset
    • ❌ Don’t use different selections for train/test

Combined Strategies

  1. Preprocessing Order

    1. Handle missing values (impute)
    2. Encode categorical features
    3. Scale numerical features
    4. Select features (if needed)
    5. Train model
    
  2. Pipeline Integration

    • Put imputation first in pipeline
    • Feature selection before model
    • Use compose for column-specific imputation
    • Validate entire pipeline with CV
  3. Performance Tips

    • Impute early (reduces missing value errors)
    • Select features to speed up training
    • Cache imputer/selector for production
    • Monitor feature importance after selection
  4. Common Patterns

    Quick Fix: IMPUTE(mean) → MODEL
    Better: IMPUTE → SCALE → MODEL
    Best: IMPUTE → SCALE → SELECT → MODEL
    Column-wise: COMPOSE(imputers) → COMPOSE(scalers) → MODEL
    

Imputation Strategy Guide

Data Type Distribution Has Outliers Strategy
Numeric Normal No Mean
Numeric Normal Yes Median
Numeric Skewed Any Median
Categorical Any N/A Most Frequent
Any Any N/A Constant (with domain value)

Feature Selection Scoring Functions

Function Use Case Data Requirements
chi2 Classification Non-negative features
f_classif Classification Any features
f_regression Regression Any features
mutual_info Any Any features